Chapter 3: The Biology of the Human Voice — Why Individuality Is Built In

Voice and AI, Chapter 3: how lungs, vocal cords, and the resonant vocal tract produce speech via the source–filter model — and why biological variability makes voice expressive, personal, and hard to replicate.

Last updated on: Sho Shimoda

This is Part 3 of a series walking through my book Voice and AI. In the previous chapter, we looked at voice as a physical signal — frequency, harmonics, formants, and timbre. Now we move closer to the source and ask how that signal is produced in the first place.


The human voice is generated by biological hardware: lungs, muscles, cartilage, soft tissue, and bone working in tight coordination. Unlike an engineered device, this system isn't standardized. It varies from person to person, changes over time, and responds to emotion, health, and context. That biological variability is exactly why voice is so expressive — and why voice AI is so hard to perfect.

The Source–Filter Model

Voice production is usefully described in three parts. The power source is the lungs, providing airflow — no air, no sound. The sound source is the vocal cords, vibrating as air passes through them to create the basic signal. The filter is the vocal tract — throat, mouth, tongue, lips, nasal cavity — which shapes that sound through resonance. This is the source–filter model, and though simplified, it underpins a remarkable amount of speech processing, classical and modern alike.

Key idea: Separating the sound source from the filter isn't just anatomy — it's a design pattern that reappears throughout voice technology, from classical vocoders to neural synthesis.

Breath, Vibration, and Resonance

Speech begins with breath. Airflow from the lungs is never constant — speakers adjust pressure continuously for loudness, emphasis, and phrasing, and because breathing is shared with vital functions, pauses and hesitations are natural features of speech rather than flaws. That air passes through the vocal cords in the larynx, two folds of tissue that vibrate and interrupt the airflow in a regular pattern, producing the fundamental frequency and its harmonics. Pitch is controlled mainly by their tension, length, and thickness.

The raw buzz from the vocal cords isn't speech yet. Resonance in the vocal tract shapes it: moving the tongue shifts formants, opening the mouth changes resonance, lowering the soft palate adds nasal sound. Small, precise movements yield an enormous variety of sounds — and no two speakers configure them identically.

Key idea: Vocal cords are living tissue — flexible, imperfect, fatigue-prone. Those micro-irregularities are part of what makes a voice sound natural. Synthetic voices that are too regular often sound artificial precisely because they're too clean.

Individuality and Emotion

Even two people with the same language and accent are distinguishable. That individuality comes from anatomy (vocal cord size, vocal tract length, mouth and nasal structure), habitual patterns (rhythm, pitch range, articulation), and external factors (age, health, emotional state). Together they form a vocal fingerprint — which is why voice can identify a person, and why it's hard to perfectly imitate.

Voice is also tightly linked to emotion, not because emotion is a separate signal but because emotion changes the body. Stress tightens muscles, excitement raises pitch and speed, sadness lowers energy. Emotion shows up as many small acoustic cues at once, which is why generating emotional speech is never as simple as labeling output "happy" or "sad."

Important: Because voice reflects identity and physical traits, it must be treated as personal data. The biological basis of voice makes misuse more sensitive than most other forms of media — a theme the book returns to in depth.

What Chapter 3 Sets Up

Three implications carry forward. Data matters — models must learn from real human voices to capture biological variation. Perfect replication is unrealistic — even your own voice changes day to day, so expecting perfect stability from a synthetic voice backfires. And ethics matter, because voice is personal data by its very nature. Understanding the biology explains both the power and the limits of the technology.


Next up — Chapter 4: Digitizing Voice. Part I grounded voice in reality — concept, physics, and biology. Now we cross into the digital world and ask how a continuous biological signal that was never designed to be digital gets turned into numbers a machine can use.

Want the full picture? Grab Voice and AI here for the complete end-to-end view, from anatomy to deployed systems.