Chapter 9: Modern ASR Architectures — CTC, Attention, and Transformers

Voice and AI, Chapter 9: how end-to-end speech recognition handles alignment implicitly — CTC, attention-based models, transformers and self-attention, streaming constraints, and why architecture choice depends on use case.

Last updated on: Sho Shimoda

This is Part 9 of a series walking through my book Voice and AI. In the previous chapter, the field crossed into end-to-end learning. This chapter explains how modern architectures actually work — the ideas, not the equations.


Classical ASR was modular: separate feature extraction, acoustic, pronunciation, and language models. Modern architectures collapse much of that pipeline into a single trainable system — audio in, text out. Structure doesn't disappear; it gets learned rather than hard-coded. The same old problems remain — variability, alignment, context — they're just approached differently.

The Alignment Problem, Two Ways

Audio is continuous, text is discrete, and there's no obvious mapping. Where HMMs handled alignment explicitly, end-to-end models handle it implicitly, and two big ideas emerged. Connectionist Temporal Classification (CTC) lets the model emit a symbol at every time step, including a special "blank" for no output, then sums over all alignments that could produce the target — learning alignment without explicit supervision. It shines on monotonic problems like speech, where output order follows input order, but assumes conditional independence between outputs, limiting long-range dependencies.

Attention-based models instead decide which parts of the input to focus on for each output symbol, computing a weighted summary of the relevant regions. This makes alignment explicit and flexible and often yields more fluent output — though it can struggle with very long inputs and with enforcing monotonic alignment.

Key idea: CTC and attention are two different bets on the same hard problem — turning a continuous signal into discrete symbols without anyone hand-labeling where each one begins.

Transformers and Self-Attention

Transformers relate every part of a sequence to every other part via self-attention, capturing long-range dependencies efficiently instead of stepping through input one frame at a time. In ASR they serve in both encoders and decoders, modeling acoustic context over long spans and integrating language information — which is why transformer-based models have become dominant. The cost is compute: long audio sequences demand careful design and optimization.

Streaming and Hybrid Reality

Many systems must run in real time, which forbids relying on unlimited future context. Attention has to work incrementally, and transformers get adapted with chunked attention or caching for streaming — all in service of low latency without sacrificing accuracy. And real systems are rarely architecturally pure: many combine CTC and attention losses during training, or bolt on external language models to lift accuracy without retraining everything. The mindset is pragmatic — robust performance over purity.

Important: There's no single best architecture. CTC is efficient and stable, attention is expressive, transformers are powerful but resource-hungry. Offline transcription, real-time assistants, multilingual support, and edge deployment each push you toward different choices.

What Chapter 9 Sets Up

So far we've quietly assumed one language. Reality involves many languages, code-switching, and low-resource scenarios that architecture alone can't solve.


Next up — Chapter 10: Multilingual and Low-Resource Speech. How transfer learning stretches coverage across thousands of languages, where the performance gaps persist, and why supporting a language well is as much a social question as a technical one.

Want the full picture? Grab Voice and AI here for the complete architecture deep-dive.