This is Part 3 of a series walking through my book Voice and AI. In the previous chapter, we looked at voice as a physical signal — frequency, harmonics, formants, and timbre. Now we move closer to the source and ask how that signal is produced in the first place.
The human voice is generated by biological hardware: lungs, muscles, cartilage, soft tissue, and bone working in tight coordination. Unlike an engineered device, this system isn't standardized. It varies from person to person, changes over time, and responds to emotion, health, and context. That biological variability is exactly why voice is so expressive — and why voice AI is so hard to perfect.
The Source–Filter Model
Voice production is usefully described in three parts. The power source is the lungs, providing airflow — no air, no sound. The sound source is the vocal cords, vibrating as air passes through them to create the basic signal. The filter is the vocal tract — throat, mouth, tongue, lips, nasal cavity — which shapes that sound through resonance. This is the source–filter model, and though simplified, it underpins a remarkable amount of speech processing, classical and modern alike.
Breath, Vibration, and Resonance
Speech begins with breath. Airflow from the lungs is never constant — speakers adjust pressure continuously for loudness, emphasis, and phrasing, and because breathing is shared with vital functions, pauses and hesitations are natural features of speech rather than flaws. That air passes through the vocal cords in the larynx, two folds of tissue that vibrate and interrupt the airflow in a regular pattern, producing the fundamental frequency and its harmonics. Pitch is controlled mainly by their tension, length, and thickness.
The raw buzz from the vocal cords isn't speech yet. Resonance in the vocal tract shapes it: moving the tongue shifts formants, opening the mouth changes resonance, lowering the soft palate adds nasal sound. Small, precise movements yield an enormous variety of sounds — and no two speakers configure them identically.
Individuality and Emotion
Even two people with the same language and accent are distinguishable. That individuality comes from anatomy (vocal cord size, vocal tract length, mouth and nasal structure), habitual patterns (rhythm, pitch range, articulation), and external factors (age, health, emotional state). Together they form a vocal fingerprint — which is why voice can identify a person, and why it's hard to perfectly imitate.
Voice is also tightly linked to emotion, not because emotion is a separate signal but because emotion changes the body. Stress tightens muscles, excitement raises pitch and speed, sadness lowers energy. Emotion shows up as many small acoustic cues at once, which is why generating emotional speech is never as simple as labeling output "happy" or "sad."
What Chapter 3 Sets Up
Three implications carry forward. Data matters — models must learn from real human voices to capture biological variation. Perfect replication is unrealistic — even your own voice changes day to day, so expecting perfect stability from a synthetic voice backfires. And ethics matter, because voice is personal data by its very nature. Understanding the biology explains both the power and the limits of the technology.
Next up — Chapter 4: Digitizing Voice. Part I grounded voice in reality — concept, physics, and biology. Now we cross into the digital world and ask how a continuous biological signal that was never designed to be digital gets turned into numbers a machine can use.