This is Part 21 of a series walking through my book Voice and AI. In the previous chapter, we assembled the pipeline. But a system that recognizes speech and generates responses is not automatically conversational.
Conversation isn't just an exchange of words — it's a coordinated activity. Timing, turn-taking, interruption, and repair matter as much as content. Humans manage these signals effortlessly; voice AI has to approximate them. And conversation is a system-level problem, not any one model's responsibility.
Timing, Turn-Taking, and Silence
In text, timing is forgiving — a pause between messages rarely feels awkward. In voice, timing is everything: respond too fast and the system feels intrusive, too slow and it feels unresponsive, with even a few hundred milliseconds shifting perception. Humans manage turns through subtle, ambiguous cues — pauses, intonation, breathing — and a voice system must infer turn boundaries from incomplete signals. Silence might mean the user is thinking, finished, or that the microphone failed. Misreading silence is one of the most common sources of frustration in voice interfaces.
Interruptions, Backchannels, and Repair
In natural conversation people interrupt to correct, clarify, or hurry things along, so a conversational system must support barge-in — stopping its own speech to listen immediately — which requires interruptible TTS, fast-resuming ASR, and decisions about partial utterances. Humans also give feedback while listening ("uh-huh," "I see"); most voice AI lacks true backchanneling, and adding lightweight acknowledgments well requires real context awareness. And since misunderstandings are inevitable, systems must detect when understanding is uncertain and ask for clarification rather than guess.
Overlap, Memory, and Personality
Real conversations are messy: people talk over each other, noise interferes, multiple speakers appear, and the system must decide whose speech to attend to and when to defer. It must also maintain dialogue state — remembering what's been said, agreed, and left unresolved — robust to recognition errors and interruptions; dialogue state is less a data structure than an evolving hypothesis about shared understanding. And conversational behavior is personality: verbosity, interruption habits, how mistakes are handled all define the system's character.
What Chapter 21 Sets Up
No single component controls conversation — ASR affects timing and confidence, the language model affects intent and repair, TTS affects interruption and responsiveness. Conversation emerges from how they interact, which is why quality often improves through better integration rather than better models. But conversation doesn't always happen through voice alone.
Next up — Chapter 22: Multimodal Voice AI. When speech becomes one channel among text, visuals, touch, and context — shared representations, audio tokens, and the choreography of speaking while showing.