This is Part 17 of a series walking through my book Voice and AI. In the previous chapter, we saw that cloning rests on one idea: voice identity can be represented separately from the words. Speaker embeddings are the mechanism that makes that possible.
Earlier in the book we treated voice as a signal carrying content and identity at the same time. Speaker embeddings disentangle the two — instead of "this sound at this moment," they encode "this is who is speaking," abstracted from what's being said. That abstraction is what lets modern systems generalize across any text while preserving a voice.
What an Embedding Is
An embedding is a compact numerical representation of something complex — words in NLP, images in vision, and here, voice identity. Speaker embeddings are typically fixed-length vectors whose individual values have no human-readable meaning; what matters is their position relative to others. Voices that sound similar land close together, voices that sound different land far apart. They're learned from data: a model trains on many speakers, learning to map speech to vectors so segments from the same speaker cluster and different speakers separate — usually via metric learning rather than direct classification. The result is a space where identity is encoded geometrically.
d-Vectors and x-Vectors
The d-vector was an early neural approach: a network trained to discriminate between speakers, where intermediate activations are averaged at inference to produce an embedding. It proved neural networks could capture identity compactly, enabling early speaker verification and adaptation. x-vectors improved on this by aggregating information across time, representing longer segments more effectively and proving far more robust to noise and variability — which is why they became widespread in speaker recognition and influenced cloning architectures. The core idea never changed: learn a representation of who is speaking, not what.
Few-Shot, Zero-Shot, and Drift
Embeddings power few-shot and zero-shot cloning — averaging several samples for a stable embedding, or using a single sample. Short samples raise uncertainty, capturing noise or transient traits instead of stable identity, which explains why some cloned voices sound inconsistent. And because embeddings are statistical summaries, they shift when the input shifts: background noise, emotion, and speaking style all move them, producing drift where a voice sounds accurate at first then gradually loses resemblance. Producing stable embeddings across conditions is an active research problem.
What Chapter 17 Sets Up
Now that we understand how identity is represented internally, we can look at how those representations are used to build real voices.
Next up — Chapter 18: Building Custom Voices. The practical side — data requirements, scripted vs. natural speech, fine-tuning vs. conditioning, prompting, evaluation, and the maintenance cost most teams underestimate.