ElevenLabs Eleven v3 has moved voice cloning from novelty to production tool. I use it on every commercial project that involves narration, corporate training, or multilingual delivery. Here's the complete workflow.
The Source Recording
The single biggest determinant of clone quality is the source recording. ElevenLabs requires a minimum of 1 minute of audio. I use 3–5 minutes on every professional clone. Here's why and how.
What to record:
- Varied sentence structures (declarative, question, exclamatory)
- Different emotional registers (neutral, warm, authoritative, conversational)
- Fast and slow delivery sections
- Words with complex phonemes — "particularly," "specifically," "extraordinary"
Recording conditions:
- Treated room or professional booth — no reverb, no background noise
- Condenser microphone at 24bit/48kHz minimum
- Consistent distance — don't move from the mic
- No breath noise between sentences — edit these out before uploading
What kills clone quality:
- Background noise (even low-level HVAC)
- Room reverb — sounds muffled even after cloning
- Phone/laptop mic recordings
- Inconsistent distance causing volume variation
The Script Format That Works
ElevenLabs reads your text and decides how to deliver it. You can guide it with formatting:
Punctuation controls pacing:
- Full stops create natural pauses
- Commas create shorter pauses
- Em dashes — like this — create a mid-thought pause with slight emphasis
- Ellipsis... creates a trailing, uncertain pause
Capitalization for emphasis:
- ALL CAPS on a word creates heavy stress
- Avoid it for more than one word per sentence — it sounds artificial
Emotional tags (v3 feature): Use ElevenLabs' audio tags for explicit emotional direction:
<happy>Great news on that front.</happy>
<serious>This part is important.</serious>
<whisper>Just between us...</whisper>
These are available in Eleven v3 and significantly improve performance on scripts that require emotional range.
Clone Settings
In the ElevenLabs voice settings panel:
Stability: Lower stability (0.3–0.5) produces more natural variation, better for conversational tone. Higher stability (0.7–0.9) is more consistent, better for long-form narration.
Similarity: Keep at 0.75–0.85 for production use. Very high similarity can introduce artifacts.
Style exaggeration: 0 for neutral delivery. Increase carefully — it amplifies the expressive patterns in the original recording, which can over-stylise.
Speaker boost: On for voice clones that are thin or lack presence. Off for naturally warm voices.
Multilingual Delivery
Eleven v3 handles 70+ languages with the same cloned voice. The quality varies by language family:
- European Latin languages (French, Spanish, Italian, Portuguese): Excellent — natural cadence, correct phoneme mapping
- Germanic languages (German, Dutch, Swedish): Very good
- Eastern European (Polish, Czech, Romanian): Good — slight accent in complex words
- East Asian languages (Japanese, Korean, Mandarin): Functional for corporate use, not yet native-grade
For multilingual campaign work, always have a native speaker review before delivery. Not for the voice quality — for the script itself. ElevenLabs will speak whatever you write, including grammatically wrong constructions.
Integration with HeyGen
The workflow for AI presenter videos:
- Clone voice in ElevenLabs, export as WAV
- Import to HeyGen as custom voice
- Apply to avatar of choice
- HeyGen syncs lip movement to ElevenLabs audio
This produces better lip sync quality than HeyGen's own voice generation for English-language content, because ElevenLabs' prosody is more natural. For multilingual content, use HeyGen's built-in translation pipeline — it handles language-specific lip sync better.
Cost Management
ElevenLabs bills by character. A 60-second script at average speaking pace is approximately 900–1,100 characters. At Creator tier rates, this is a few cents.
Where cost accumulates: iteration. Running 20 variations of a 1,000-character script in different emotional registers adds up. Write the script once, correctly, then generate. The prompt engineering discipline that applies to video applies equally here.
What v3 Still Can't Do
- Heavy regional accents: The clone will soften the source accent. If a strong Irish, Scottish, or regional American accent is required, the source recording needs to be very long (10+ minutes) and the accent very consistent.
- Singing: ElevenLabs has a separate music product. The voice clone tool is not for singing.
- Real-time: The API has a streaming mode but it's not production-ready for latency-sensitive applications.
The Bottom Line
A well-recorded source voice + a well-formatted script + correct v3 settings produces output that passes broadcast review. I've delivered ElevenLabs voice to clients who had no idea it was AI-generated. The pipeline is mature — the skill is in the setup.