Chapter 13: The Neural TTS Revolution — Tacotron, WaveNet, and FastSpeech

Voice and AI, Chapter 13: how neural networks broke the TTS ceiling — learning text-to-speech directly, prosody emerging from data, WaveNet's waveform quality, and FastSpeech's speed and stability.

Last updated on: Sho Shimoda

This is Part 13 of a series walking through my book Voice and AI. In the previous chapter, classical TTS hit its ceiling — rigid pipelines, rule-based prosody, imperfect vocoders. Then that ceiling broke.


Neural networks didn't just improve synthetic speech; they changed what people believed was possible. Voices became fluid and expressive. Prosody emerged naturally instead of being imposed by rules. The gap between human and machine speech narrowed dramatically — and everyone noticed.

Learning the Mapping, and Tacotron

The most important shift was conceptual. Classical systems decomposed synthesis into many hand-tuned subproblems; neural TTS asked whether a model could learn the text-to-speech mapping directly. Linguistic structure didn't vanish — it was learned from data rather than hard-coded. Tacotron was an early demonstration: it takes text and produces a mel spectrogram capturing both content and prosody, and its key innovation was attention — learning to align characters with acoustic frames automatically, deciding how long each sound lasts and how prosody unfolds. For the first time, timing and intonation were learned, not programmed.

Key idea: Prosody emerged as a surprise. Models were only trained to match spectrograms, yet pitch variation, rhythm, and phrasing appeared on their own — the model absorbed the statistical regularities of human speech. The flip side: because prosody is implicit, it's hard to control.

WaveNet and the Two-Stage Architecture

Tacotron's output was still a spectrogram, not sound — waveform synthesis remained the bottleneck until WaveNet. WaveNet is a neural vocoder that generates audio sample by sample, modeling the waveform's probability distribution directly instead of relying on simplified signal models, achieving unprecedented fidelity. The cost was speed: one sample at a time is slow, impractical for early real-time use. Together, Tacotron and WaveNet established a durable architecture — predict an acoustic representation from text, then convert it to a waveform with a neural vocoder — letting each stage improve independently and clarifying where errors originate (prosody from the first stage, audio artifacts from the vocoder).

FastSpeech and Parallelization

Tacotron had drawbacks — slow sequential inference and attention failures causing skipped or repeated words. FastSpeech rethought alignment with a teacher–student approach: a model with attention provides alignment, used to train a feed-forward model that needs no attention at inference. The result is faster, more stable synthesis with predictable timing and explicit duration control — opening the door to better prosody steering. The broader theme is parallelization: generating speech in parallel rather than sequentially slashed latency and made high-quality TTS practical for real products.

Important: The revolution introduced new trade-offs — bigger data appetite, harder prosody control, and new failure modes. Instead of monotone speech, errors now show up as strange emphasis, timing glitches, or instability under unusual input.

What Chapter 13 Sets Up

This was a shift in the center of gravity — from engineering to learning, rules to data, explicit design to emergent behavior. But one decisive component still deserves its own focus: the vocoder that turns representations into actual sound.


Next up — Chapter 14: Vocoders. Why waveform generation so often decides whether synthetic speech sounds convincing — from Griffin–Lim and the role of phase to HiFi-GAN and practical real-time neural vocoders.

Want the full picture? Grab Voice and AI here for the complete neural TTS story.