This is Part 9 of a series walking through my book Voice and AI. In the previous chapter, the field crossed into end-to-end learning. This chapter explains how modern architectures actually work — the ideas, not the equations.
Classical ASR was modular: separate feature extraction, acoustic, pronunciation, and language models. Modern architectures collapse much of that pipeline into a single trainable system — audio in, text out. Structure doesn't disappear; it gets learned rather than hard-coded. The same old problems remain — variability, alignment, context — they're just approached differently.
The Alignment Problem, Two Ways
Audio is continuous, text is discrete, and there's no obvious mapping. Where HMMs handled alignment explicitly, end-to-end models handle it implicitly, and two big ideas emerged. Connectionist Temporal Classification (CTC) lets the model emit a symbol at every time step, including a special "blank" for no output, then sums over all alignments that could produce the target — learning alignment without explicit supervision. It shines on monotonic problems like speech, where output order follows input order, but assumes conditional independence between outputs, limiting long-range dependencies.
Attention-based models instead decide which parts of the input to focus on for each output symbol, computing a weighted summary of the relevant regions. This makes alignment explicit and flexible and often yields more fluent output — though it can struggle with very long inputs and with enforcing monotonic alignment.
Transformers and Self-Attention
Transformers relate every part of a sequence to every other part via self-attention, capturing long-range dependencies efficiently instead of stepping through input one frame at a time. In ASR they serve in both encoders and decoders, modeling acoustic context over long spans and integrating language information — which is why transformer-based models have become dominant. The cost is compute: long audio sequences demand careful design and optimization.
Streaming and Hybrid Reality
Many systems must run in real time, which forbids relying on unlimited future context. Attention has to work incrementally, and transformers get adapted with chunked attention or caching for streaming — all in service of low latency without sacrificing accuracy. And real systems are rarely architecturally pure: many combine CTC and attention losses during training, or bolt on external language models to lift accuracy without retraining everything. The mindset is pragmatic — robust performance over purity.
What Chapter 9 Sets Up
So far we've quietly assumed one language. Reality involves many languages, code-switching, and low-resource scenarios that architecture alone can't solve.
Next up — Chapter 10: Multilingual and Low-Resource Speech. How transfer learning stretches coverage across thousands of languages, where the performance gaps persist, and why supporting a language well is as much a social question as a technical one.