This is Part 13 of a series walking through my book Voice and AI. In the previous chapter, classical TTS hit its ceiling — rigid pipelines, rule-based prosody, imperfect vocoders. Then that ceiling broke.
Neural networks didn't just improve synthetic speech; they changed what people believed was possible. Voices became fluid and expressive. Prosody emerged naturally instead of being imposed by rules. The gap between human and machine speech narrowed dramatically — and everyone noticed.
Learning the Mapping, and Tacotron
The most important shift was conceptual. Classical systems decomposed synthesis into many hand-tuned subproblems; neural TTS asked whether a model could learn the text-to-speech mapping directly. Linguistic structure didn't vanish — it was learned from data rather than hard-coded. Tacotron was an early demonstration: it takes text and produces a mel spectrogram capturing both content and prosody, and its key innovation was attention — learning to align characters with acoustic frames automatically, deciding how long each sound lasts and how prosody unfolds. For the first time, timing and intonation were learned, not programmed.
WaveNet and the Two-Stage Architecture
Tacotron's output was still a spectrogram, not sound — waveform synthesis remained the bottleneck until WaveNet. WaveNet is a neural vocoder that generates audio sample by sample, modeling the waveform's probability distribution directly instead of relying on simplified signal models, achieving unprecedented fidelity. The cost was speed: one sample at a time is slow, impractical for early real-time use. Together, Tacotron and WaveNet established a durable architecture — predict an acoustic representation from text, then convert it to a waveform with a neural vocoder — letting each stage improve independently and clarifying where errors originate (prosody from the first stage, audio artifacts from the vocoder).
FastSpeech and Parallelization
Tacotron had drawbacks — slow sequential inference and attention failures causing skipped or repeated words. FastSpeech rethought alignment with a teacher–student approach: a model with attention provides alignment, used to train a feed-forward model that needs no attention at inference. The result is faster, more stable synthesis with predictable timing and explicit duration control — opening the door to better prosody steering. The broader theme is parallelization: generating speech in parallel rather than sequentially slashed latency and made high-quality TTS practical for real products.
What Chapter 13 Sets Up
This was a shift in the center of gravity — from engineering to learning, rules to data, explicit design to emergent behavior. But one decisive component still deserves its own focus: the vocoder that turns representations into actual sound.
Next up — Chapter 14: Vocoders. Why waveform generation so often decides whether synthetic speech sounds convincing — from Griffin–Lim and the role of phase to HiFi-GAN and practical real-time neural vocoders.