Chapter 12: Classical Text-to-Speech — Concatenative, Unit Selection, and Parametric

This is Part 12 of a series walking through my book Voice and AI. In the previous chapter, we defined what synthetic speech is really after. Now we look at how the first systems pursued it — built from carefully engineered parts, decades before neural networks learned prosody.

Understanding classical TTS matters not because it's the future, but because it explains why the future looks the way it does. These systems established the core concepts modern models would later absorb.

The Classical Pipeline

Classical TTS followed a strict sequence. Text was normalized ("12" to "twelve," "Dr." to "doctor"), converted into linguistic units via part-of-speech tagging and pronunciation dictionaries, assigned prosody (stress, pauses, pitch movements), and finally rendered into speech. Each stage demanded expert knowledge, baked in assumptions, and — critically — passed its errors downstream to accumulate.

Concatenative and Unit Selection

The earliest successful approach was concatenative synthesis: stitch together small segments of recorded human speech — phonemes, syllables, larger units. Because the building blocks are real recordings, it can sound very natural under the right conditions. But it needs large, carefully recorded databases, offers limited prosody control, and must approximate when a needed segment is missing — producing audible discontinuities. Unit selection refined this by searching a big database for the unit sequence that best matches the target, balancing acoustic continuity against linguistic constraints. Brilliant when it works, obviously broken when it fails — and heavy, with large databases and expensive search.

Key idea: Concatenative systems borrow naturalness from real recordings but can't generalize beyond them. Parametric systems generalize freely but pay for it in naturalness. That tension defined classical TTS for years.

Parametric TTS and the Vocoder Bottleneck

Parametric TTS takes the opposite bet: instead of storing recordings, it learns statistical models of speech parameters (spectral shape, pitch, duration), predicts them from text, then generates audio with a vocoder. It's compact, flexible, allows fine prosody control, and adapts easily to new voices — but tends to sound less natural, since over-smoothing of parameters produces muffled, robotic speech. And the vocoder itself was often the limiting factor: early vocoders introduced artifacts that capped quality even when the predicted parameters were accurate.

Important: Classical systems leaned on rule-based prosody drawn from linguistic theory. Rules handled structured text adequately but couldn't capture context-dependent nuance — which is exactly why long passages so often sounded monotone.

What Chapter 12 Sets Up

Classical TTS achieved remarkable results for its era and established the durable concepts — text normalization, phoneme generation, prosody modeling, vocoding. Neural systems didn't discard these ideas; they absorbed them. But classical TTS plateaued, with naturalness and expressiveness lagging expectations, and the breakthrough came when neural networks were applied not to individual components but to the entire synthesis process.

Next up — Chapter 13: The Neural TTS Revolution. Tacotron, WaveNet, and FastSpeech — how learning the text-to-speech mapping directly shattered the old ceiling and made prosody emerge instead of being programmed.

Want the full picture? Grab Voice and AI here for the full story of how voices learned to sound human.