This is Part 2 of a series walking through my book Voice and AI. In the previous chapter, we defined voice as something physical, biological, linguistic, and social all at once — distinct from both raw sound and abstract language. Now we stop asking what voice means and start asking what voice is made of.
Every voice AI system, no matter how advanced, ultimately operates on a physical signal: air in motion. Those movements obey the same laws as waves in water or vibrations in a string. You don't need heavy math to build strong intuition here — and that intuition pays off everywhere later, because it explains why some problems are hard and why certain features keep showing up in speech systems.
Sound as Vibration
Sound is vibration traveling through a medium, usually air. When the vocal cords vibrate, they push and pull air molecules, sending out waves of slightly higher and lower pressure. Your ears convert those pressure changes into signals your brain reads as sound. From a physics standpoint there's nothing special about voice — it follows the same principles as any sound. What makes it special is how structured and information-rich it is.
Frequency and Amplitude: Pitch, Loudness, and Emphasis
Frequency describes how fast a vibration repeats, measured in hertz, and maps closely to perceived pitch. Human voices typically carry a fundamental frequency between roughly 80 and 300 Hz, shaped by the length, tension, and mass of the vocal cords. Amplitude describes how strong the vibration is, mapping roughly to loudness. In speech, neither is just a number: rising pitch can signal a question, a sudden drop can signal finality, and shifts in loudness carry emphasis and emotional intensity.
Harmonics: Why Voices Are Rich
If voice were a single frequency it would sound like a tuning fork. Real vocal cords produce a fundamental frequency plus many higher frequencies — harmonics — at integer multiples of it (100 Hz, 200 Hz, 300 Hz, and so on). Harmonics give voice its richness and, crucially, its structure. That structure is what lets machines extract meaningful patterns instead of treating speech as random noise.
Formants: Shaping Sound into Speech
Harmonics alone don't explain why "ah" and "ee" sound different. Formants do. Formants are frequency bands amplified by the shape of your vocal tract — throat, mouth, tongue, nasal cavity. Move your tongue or open your mouth and you change which frequencies resonate. Different vowels are distinguished largely by the positions of their first few formants, which is why you recognize the same vowel across speakers with very different pitch.
Timbre and Time: Character and Change
Timbre is the hardest property to define and one of the most important — it's why a violin and a piano playing the same note sound different, and why you recognize a person's voice even when they whisper. Physically, timbre arises from the relative strengths of harmonics, the fine structure of formants, and subtle airflow noise. It's distributed across the whole signal, with no single number that captures it, which is exactly why it matters so much for speaker recognition and voice cloning.
And none of this is static. Pitch rises and falls, formants shift as the mouth moves, loudness varies — all over milliseconds. Voice is a sequence of structured changes, which is why we analyze it with time–frequency views like spectrograms rather than single snapshots.
What Chapter 2 Sets Up
Speech recognition works because machines learn patterns in frequency, harmonics, and formants over time. Synthesis works because machines learn to generate signals with realistic pitch, timbre, and dynamics. When these physical properties are ignored, systems fail in subtle ways — intelligible but lifeless, or fine in a quiet room and collapsing in noise. This chapter gives you the vocabulary to see why.
Next up — Chapter 3: Biology of the Human Voice. Physics explains how sound behaves but not why voices differ so much between people. For that we go to the source: lungs, vocal cords, and the resonant vocal tract — and why individuality is built into the system from the start.