Creating Speech / Help

3. Creating Speech

Once you’ve created a voice and entered your text in the Studio Dashboard, you’re ready to generate speech. Clone Voice Translator produces clear, natural audio in your cloned voice that you can play, download, or reuse.

3.1 Fine-Tuning Your Voice

Each saved voice has adjustable controls under Your Saved Clones in the Studio Dashboard. They change how the voice sounds; adjust them, click Save, then generate:

Stability: How consistent the delivery is. Higher values are steadier; lower values are more varied and expressive.
Similarity: How closely the output matches your original recording. Very high values can amplify recording artifacts.
Style: How much expressiveness and emphasis is applied to the speech.
Speaker Boost: An on/off option that enhances the clarity and presence of the voice.

Use the Preview button on a voice to hear a quick sample of the current settings before generating a full clip.

3.2 Generating the Audio

Select your voice, enter your text, and click Generate Audio. Clone Voice Translator converts your text into a high-quality recording, usually within a few seconds depending on text length.

When it’s ready, the audio appears in the player on the page. You can play it there, and it is also saved to your Audio Library for download and reuse.

Tip: If you are testing variations, generate a few versions and compare them in your Audio Library before settling on one.

3.3 Editing and Re-Generating

You can freely edit the text and generate again as many times as needed. Each generation produces a new clip — ideal for refining pronunciation or timing.

If your text includes unusual words or names, try spelling them phonetically or adding punctuation to adjust pauses.

3.4 Downloading and Using Your Audio

Open your Audio Library to download a clip. Audio is provided as an MP3 file for easy use in videos, e-learning, or accessibility applications.

How you may use the audio depends on your plan’s license — see Subscriptions & Plans for commercial-use details.

Note: For long scripts or audiobooks, generate speech in smaller sections to keep a stable tone and faster processing.

3.5 Choosing a Voice Model (Pro & Business)

On the Pro and Business plans, the Generation Studio shows a Voice model selector:

Standard (v2) — the default. Fast and reliable across many languages. For pauses, type <break time="1.0s" /> (up to ~3 seconds; use sparingly) or use punctuation such as commas, periods, and ellipses.
Expressive (v3) — a more expressive model that understands audio tags you type in the text: pauses like [pause], delivery like [whispers] or [excited], and accents such as [American accent] or [British accent]. Place a tag where you want it to take effect (e.g. at the start of the text). These bracket tags work on v3 only.

Free accounts always use Standard (v2). The same voice can sound a little different between models because expressiveness is handled differently — try both and keep whichever sounds best for your content. The selected model applies to the next clip you generate.