This is Part 12 of a series walking through my book Voice and AI. In the previous chapter, we defined what synthetic speech is really after. Now we look at how the first systems pursued it — built from carefully engineered parts, decades before neural networks learned prosody.
Understanding classical TTS matters not because it's the future, but because it explains why the future looks the way it does. These systems established the core concepts modern models would later absorb.
The Classical Pipeline
Classical TTS followed a strict sequence. Text was normalized ("12" to "twelve," "Dr." to "doctor"), converted into linguistic units via part-of-speech tagging and pronunciation dictionaries, assigned prosody (stress, pauses, pitch movements), and finally rendered into speech. Each stage demanded expert knowledge, baked in assumptions, and — critically — passed its errors downstream to accumulate.
Concatenative and Unit Selection
The earliest successful approach was concatenative synthesis: stitch together small segments of recorded human speech — phonemes, syllables, larger units. Because the building blocks are real recordings, it can sound very natural under the right conditions. But it needs large, carefully recorded databases, offers limited prosody control, and must approximate when a needed segment is missing — producing audible discontinuities. Unit selection refined this by searching a big database for the unit sequence that best matches the target, balancing acoustic continuity against linguistic constraints. Brilliant when it works, obviously broken when it fails — and heavy, with large databases and expensive search.
Parametric TTS and the Vocoder Bottleneck
Parametric TTS takes the opposite bet: instead of storing recordings, it learns statistical models of speech parameters (spectral shape, pitch, duration), predicts them from text, then generates audio with a vocoder. It's compact, flexible, allows fine prosody control, and adapts easily to new voices — but tends to sound less natural, since over-smoothing of parameters produces muffled, robotic speech. And the vocoder itself was often the limiting factor: early vocoders introduced artifacts that capped quality even when the predicted parameters were accurate.
What Chapter 12 Sets Up
Classical TTS achieved remarkable results for its era and established the durable concepts — text normalization, phoneme generation, prosody modeling, vocoding. Neural systems didn't discard these ideas; they absorbed them. But classical TTS plateaued, with naturalness and expressiveness lagging expectations, and the breakthrough came when neural networks were applied not to individual components but to the entire synthesis process.
Next up — Chapter 13: The Neural TTS Revolution. Tacotron, WaveNet, and FastSpeech — how learning the text-to-speech mapping directly shattered the old ceiling and made prosody emerge instead of being programmed.