This is Part 20 of a series walking through my book Voice and AI. In the previous chapter, we closed the section on personal voices. Now Part VI zooms out: users never experience models, they experience systems — and voice AI is a system of components working together.
At a high level, voice AI looks simple: a user speaks, the system listens, it replies with speech. Behind that simplicity sits a pipeline of components, each with its own assumptions, delays, and failure modes. The seams between steps matter as much as the steps themselves.
The Canonical Pipeline
The common architecture is ASR → LLM → TTS: speech recognition turns audio into text, a language model interprets it and decides what to say, and text-to-speech turns the response back into audio. It powers assistants, call-center bots, and conversational agents. Each step is conceptually clear — but the joins are where the trouble lives. ASR doesn't emit clean, punctuation-rich text; it emits hypotheses, partial results, and confidence-weighted guesses, often missing capitalization, confusing homophones, or leaving numbers ambiguous. A language model that expects clean text must be adapted to noisy input.
Latency and Error Propagation
Latency isn't one number — it accumulates across ASR decoding, network delay, LLM inference, TTS generation, and playback. A system that's individually fast at each step can still feel slow if coordination is poor, and users are exquisitely sensitive to conversational delay; even small pauses break the illusion of interaction. Designing for low latency demands end-to-end optimization, not just faster models — including the choice between batch processing (more context, higher accuracy, more latency) and streaming (responsive, more uncertain), often resolved with hybrids like streaming ASR plus delayed finalization.
Context, the Language Model, and TTS as Interface
Voice interactions are rarely single-turn — users refer back, interrupt, and self-correct — so managing context across turns requires tight coordination, complicated by timing and overlapping speech. The language model is the decision-making core, interpreting intent and managing dialogue state, but it operates on text, so spoken input must be mapped to structured representations and back into conversational speech — and it must be aware of its own uncertainty, since overconfident responses can be harmful. Finally, TTS is the user interface: however intelligent the system is internally, users judge it by how it sounds.
What Chapter 20 Sets Up
The pipeline shapes turn-taking, error handling, and trust — whether an interaction feels smooth or frustrating. Good voice AI means thinking in pipelines, not components. With the architecture in view, we can look more closely at interaction itself.
Next up — Chapter 21: Conversational Voice AI. Turn-taking, interruptions, backchannels, and repair — why conversation is a timing problem and a system-level challenge, not something any single model can own.