Chapter 25: Cost Models — Designing Voice AI That's Sustainable

Voice and AI, Chapter 25: where voice AI costs come from — compute that scales with audio duration, real-time vs. offline, storage and bandwidth, the cost of quality, and why cost awareness makes better design.

Last updated on: Sho Shimoda

This is Part 25 of a series walking through my book Voice and AI. In the previous chapter, platform choice turned out to be a product decision. This chapter tackles the one least often discussed but just as decisive: cost. Every second of audio, every inference, every real-time guarantee has a price.


The goal here isn't exact numbers — it's intuition about where costs come from, how they scale, and how design decisions affect sustainability.

Why Voice Has Unique Cost Characteristics

Audio is time-based, so processing cost scales with duration, not just complexity — a ten-second utterance costs roughly ten times a one-second one regardless of content. Voice systems also demand real-time or near-real-time performance, which limits batching and raises compute cost. And the multi-stage pipeline means ASR, language models, and TTS all contribute. Compute is the largest driver in most systems: ASR scales with audio length, TTS with output length, language models with token count and context. GPUs boost performance but cost more; CPUs cost less but add latency — deciding where to spend compute is strategic.

Key idea: Real-time processing is more expensive than offline because capacity must be available immediately, even at peak — leading to underutilization, since resources are provisioned for the worst case. Moving non-critical tasks out of the real-time path can cut cost dramatically.

Storage, Bandwidth, and the Cost of Quality

Voice systems generate large volumes of data — raw audio, intermediate features, and logs eat storage, while audio streams consume bandwidth. Retention policies matter, because storing everything indefinitely is expensive and usually unnecessary, and compression trades cost against quality and debuggability. Quality itself costs: larger models, higher sampling rates, and more expressive TTS all raise compute and bandwidth, and the relationship isn't linear — small quality gains can demand large cost increases. The discipline is deciding what level of quality the use case actually requires.

Important: Cost scales with usage but not smoothly — bursty traffic raises peak capacity needs and idle capacity still costs money. If the cost structure doesn't match the revenue model (per-minute, subscription, tiers), the product becomes unsustainable.

Controlling Cost

Several levers help: model optimization reduces compute, caching avoids repeated work, hybrid architectures shift load to cheaper environments, and feature prioritization reserves expensive operations for cases where they add real value. Cost-aware design is an ongoing process, not a one-time decision — and far from being only a limitation, cost constraints guide design: they force clarity about what matters and reward efficiency and simplicity. Systems built with cost awareness are often more robust and maintainable as a result.

What Chapter 25 Sets Up

That closes Part VII — how voice AI scales, how platforms shape capability, and how cost constrains design. Now we shift from infrastructure to experience: the choices that determine whether people actually accept and trust voice in their daily lives.


Next up — Chapter 26: Designing for Voice. Part VIII opens with voice UX as its own discipline — why an interface you can't see, that exists only in time, demands a fundamentally different set of design principles.

Want the full picture? Grab Voice and AI here for the complete treatment of voice AI economics.