This is Part 4 of a series walking through my book Voice and AI. In the previous chapter, we closed Part I by tracing voice back to its biological source. Now Part II begins, and we cross the boundary between the physical and the digital: how does a living, continuous signal become numbers a machine can store and learn from?
Computers don't do continuity. Sound in the real world has a value at every instant, with infinitely many values between any two moments. Machines need discrete snapshots. Digitization is how we cross that gap — and although it's usually treated as a low-level technical step, it's one of the most consequential design choices in any voice system. What you capture here determines what's even possible later.
Sampling: Measuring Sound Over Time
Sampling measures the signal at regular intervals, like a strobe light freezing motion. Flash slowly and motion looks jerky; flash quickly and it looks smooth. The sampling rate is how many measurements per second, in hertz: 8,000 Hz, 16,000 Hz, or 44,100 Hz (CD quality). Human speech mostly sits below about 8,000 Hz, and a fundamental result says you must sample at least twice the highest frequency you want to capture — the Nyquist principle. That's why many speech systems land on 16,000 Hz: enough detail, manageable data.
Bit Depth and Quantization: How Precisely We Measure
If sampling decides when we measure, bit depth decides how precisely. Each sample records amplitude, and bit depth sets how many values that amplitude can take: 8-bit allows 256 levels, 16-bit over 65,000, 24-bit millions. Quantization is the step that maps continuous amplitude to those discrete levels — and it's the first place information is unavoidably lost. Usually imperceptible for speech, but never neutral: later stages can amplify or suppress those artifacts, which is exactly why audio pipelines must be designed as a whole, not as isolated steps.
Channels, Formats, and the Real World
Most speech systems use mono, since speech doesn't inherently need stereo; stereo adds spatial information useful for things like meeting transcription with speaker localization, but also complexity. Once digitized, voice has to be stored in a format — WAV (uncompressed), FLAC (lossless), MP3 and AAC (lossy). For training, uncompressed or lossless is usually preferred because lossy compression discards information that may not matter to human ears but can matter to a machine. And none of this happens in a vacuum: microphone quality, room acoustics, and background noise all enter the signal before digitization, and once noise is captured it becomes part of the data.
What Chapter 4 Sets Up
Voice is now a sequence of numbers representing pressure over time. But raw waveforms are dense, high-dimensional, and hard to reason about directly — they aren't how machines usually think about sound. The next step is extracting structure.
Next up — Chapter 5: Signal Processing Fundamentals. We move from the time domain into frequency, build intuition for the Fourier Transform and short-time analysis, and meet the spectrogram — the representation that lets both engineers and models actually "see" sound.