This is Part 6 of a series walking through my book Voice and AI. In the previous chapter, we learned to turn raw audio into spectrograms — rich, but large and redundant. Early speech systems couldn't afford that richness, and the response shaped the field for decades.
Storage was expensive, CPUs were slow, data was scarce. Working directly with raw waveforms or full spectrograms was impractical, so classical speech processing compressed speech into compact features that kept what mattered and discarded the rest. That forced a genuinely hard question — what parts of the signal actually matter for understanding speech? — and the classical methods are careful answers, drawn from acoustics, linguistics, and psychoacoustics.
The Human Ear as Inspiration: MFCCs
Many classical features mimic how we hear. The auditory system doesn't respond equally to all frequencies, and we perceive pitch and loudness nonlinearly. MFCCs — Mel-Frequency Cepstral Coefficients — lean into this. They capture the spectral envelope of speech (closely tied to formants and vocal tract shape) while deliberately downplaying fine harmonic detail: take short-time transforms, map frequency onto the perceptual mel scale, group energy into a few bands and take logarithms (mirroring loudness perception), then decorrelate into a compact set — often just 12 or 13 numbers per frame.
LPC: Modeling Production Instead of Perception
Where MFCCs are inspired by hearing, Linear Predictive Coding is inspired by production. LPC treats speech as the output of a source–filter system and estimates the vocal tract filter that produced the signal — predicting the current sample as a combination of previous ones, with coefficients that encode the tract's resonances. It's powerful for low-bitrate coding and early synthesis because it represents the filter and excitation separately and efficiently. Its weakness: it assumes a relatively clean signal and is sensitive to noise, making it less robust in the real world than MFCCs.
Capturing Change: Delta Features
Speech is never static, and the changes carry information. Classical systems add delta and delta-delta features — the first and second time derivatives of the base features — telling the model how things are moving, not just where they are. Including these dynamics improves recognition because transitions between sounds are themselves cues. It's a recurring lesson: speech understanding depends as much on change as on absolute values.
What Chapter 6 Sets Up
Don't write these off as obsolete. Many modern systems still use MFCC-like inputs, and even end-to-end models implicitly learn similar structures because the physics and biology haven't changed. Knowing the classics gives you historical context, sharper intuition about representation, and a real edge when debugging constrained real-world systems. With this, we've completed the journey from voice as a human phenomenon to voice as machine-ready data. What remains is interpretation.
Next up — Chapter 7: The Problem of Understanding Speech. Part III opens with the central challenge of the field — why turning sound into meaning is so genuinely hard, from speaker variability to ambiguity to real-time constraints.