Tutorials8 minRichard Byrne

ElevenLabs Voice Cloning: A Production Guide for 2026

Tools covered

ElevenLabs

HeyGen

ElevenLabs Eleven v3 has moved voice cloning from novelty to production tool. I use it on every commercial project that involves narration, corporate training, or multilingual delivery. Here's the complete workflow. Voice cloning is a standard part of the professional AI video production services offered here.

The Source Recording

The single biggest determinant of clone quality is the source recording. ElevenLabs requires a minimum of 1 minute of audio. I use 3–5 minutes on every professional clone. Here's why and how.

What to record:

Varied sentence structures (declarative, question, exclamatory)
Different emotional registers (neutral, warm, authoritative, conversational)
Fast and slow delivery sections
Words with complex phonemes — "particularly," "specifically," "extraordinary"

Recording conditions:

Treated room or professional booth — no reverb, no background noise
Condenser microphone at 24bit/48kHz minimum
Consistent distance — don't move from the mic
No breath noise between sentences — edit these out before uploading

What kills clone quality:

Background noise (even low-level HVAC)
Room reverb — sounds muffled even after cloning
Phone/laptop mic recordings
Inconsistent distance causing volume variation

The Script Format That Works

ElevenLabs reads your text and decides how to deliver it. You can guide it with formatting:

Punctuation controls pacing:

Full stops create natural pauses
Commas create shorter pauses
Em dashes — like this — create a mid-thought pause with slight emphasis
Ellipsis... creates a trailing, uncertain pause

Capitalization for emphasis:

ALL CAPS on a word creates heavy stress
Avoid it for more than one word per sentence — it sounds artificial

Emotional tags (v3 feature): Use ElevenLabs' audio tags for explicit emotional direction:

<happy>Great news on that front.</happy>
<serious>This part is important.</serious>
<whisper>Just between us...</whisper>

These are available in Eleven v3 and significantly improve performance on scripts that require emotional range.

Clone Settings

In the ElevenLabs voice settings panel:

Stability: Lower stability (0.3–0.5) produces more natural variation, better for conversational tone. Higher stability (0.7–0.9) is more consistent, better for long-form narration.

Similarity: Keep at 0.75–0.85 for production use. Very high similarity can introduce artifacts.

Style exaggeration: 0 for neutral delivery. Increase carefully — it amplifies the expressive patterns in the original recording, which can over-stylise.

Speaker boost: On for voice clones that are thin or lack presence. Off for naturally warm voices.

Multilingual Delivery

Eleven v3 handles 70+ languages with the same cloned voice. The quality varies by language family:

European Latin languages (French, Spanish, Italian, Portuguese): Excellent — natural cadence, correct phoneme mapping
Germanic languages (German, Dutch, Swedish): Very good
Eastern European (Polish, Czech, Romanian): Good — slight accent in complex words
East Asian languages (Japanese, Korean, Mandarin): Functional for corporate use, not yet native-grade

For multilingual campaign work, always have a native speaker review before delivery. Not for the voice quality — for the script itself. ElevenLabs will speak whatever you write, including grammatically wrong constructions.

Integration with HeyGen

The workflow for AI presenter videos:

Clone voice in ElevenLabs, export as WAV
Import to HeyGen as custom voice
Apply to avatar of choice
HeyGen syncs lip movement to ElevenLabs audio

This produces better lip sync quality than HeyGen's own voice generation for English-language content, because ElevenLabs' prosody is more natural. For multilingual content, use HeyGen's built-in translation pipeline — it handles language-specific lip sync better.

Cost Management

ElevenLabs bills by character. A 60-second script at average speaking pace is approximately 900–1,100 characters. At Creator tier rates, this is a few cents.

Where cost accumulates: iteration. Running 20 variations of a 1,000-character script in different emotional registers adds up. Write the script once, correctly, then generate. The prompt engineering discipline that applies to video applies equally here.

What v3 Still Can't Do

Heavy regional accents: The clone will soften the source accent. If a strong Irish, Scottish, or regional American accent is required, the source recording needs to be very long (10+ minutes) and the accent very consistent.
Singing: ElevenLabs has a separate music product. The voice clone tool is not for singing.
Real-time: The API has a streaming mode but it's not production-ready for latency-sensitive applications.

The Bottom Line

A well-recorded source voice + a well-formatted script + correct v3 settings produces output that passes broadcast review. I've delivered ElevenLabs voice to clients who had no idea it was AI-generated. The pipeline is mature — the skill is in the setup. If you'd like voice cloning built into a production for you, start a project.

Back to Intelligence Feed

Ready to create?

Generate cinematic AI video — from €19

Five frontier models. No subscription. Buy credits, generate on demand, own the results outright.

Start Generating Hire a Director