Chapter 5: Signal Processing Fundamentals — Seeing Sound with Spectrograms

This is Part 5 of a series walking through my book Voice and AI. In the previous chapter, voice finally became data — a long sequence of numbers. Faithful, but inconvenient: the important structure is hidden inside it. This chapter is about revealing that structure.

A waveform tells you how air pressure changes over time, but not which frequencies are present, how they evolve, or which parts are speech versus noise. Two waveforms that look completely different can sound alike; two that look alike can sound completely different. Speech is structured in frequency and time at once — brief noisy consonants, longer harmonic vowels, prosody unfolding over hundreds of milliseconds. To capture that, we need representations that reflect how sound behaves, not just how it was sampled.

Thinking in Frequency Instead of Time

The pivotal shift is moving from "what is the signal doing right now?" to "which frequencies are present, and how strong are they?" Any complex sound can be seen as a combination of simpler vibrations at different frequencies. The Fourier Transform is the tool that performs this shift — and the intuition is friendlier than the math. Picture listening to an orchestra and picking out which instruments are playing: you're separating sound into components by pitch and timbre. The Transform does the same, producing a spectrum that shows how much energy each frequency contributes.

Short-Time Analysis and the Spectrogram

A single transform over an entire utterance tells you which frequencies appear overall, but not when. Since speech changes constantly, we use short-time analysis: split the waveform into short overlapping frames (often 20–30 milliseconds), transform each one, and get a time–frequency representation. A spectrogram visualizes exactly this — time on one axis, frequency on the other, brightness for energy. Harmonics appear as horizontal lines, formants as energy bands, noise as diffuse smears.

Key idea: Spectrograms turn invisible sound into something you can inspect — and spectrogram-like representations are what many machine learning models actually consume. Learn to read one and you can debug half the field.

Windowing, Magnitude, and Phase

You can't just chop the signal into frames abruptly — that injects artificial discontinuities. Each frame is multiplied by a window function that tapers its edges, and the window choice trades time resolution against frequency resolution: short windows catch rapid changes but blur frequency; long windows nail frequency but smear timing. The transform also produces complex values with both magnitude (how strong each frequency is) and phase (how they align in time). For recognition, magnitude carries most of the perceptual information and phase is often simplified away — but for synthesis, phase comes roaring back, since poor phase handling produces buzzy, unnatural sound.

Key idea: Not all parts of the signal matter equally for every task. Deciding what to keep and what to discard is a recurring design theme across the whole book.

What Chapter 5 Sets Up

Signal processing isn't just plumbing — it encodes assumptions about what matters in speech (that harmonic and resonant structure is meaningful, that speech is locally stable over short frames). Those assumptions are usually valid, and even models that learn straight from waveforms are still shaped by the same underlying physics. We've now turned raw audio into structured representations that expose patterns. The next step is compressing them into compact features useful for modeling.

Next up — Chapter 6: Classical Speech Processing. Before deep learning, engineers had to hand-design features. We meet MFCCs and LPC — methods built on deep insights about human hearing and voice production that, remarkably, still echo inside today's models.

Want the full picture? Grab Voice and AI here for the full, intuition-first treatment of signal processing.