StableVoice API

Reserve an audio output slot on StableUpload, then call /api/speech. Poll the returned job ID with SIWX until the audio URL is ready.

Models

Chatterbox Turbo

Default English TTS for the bundled voice catalog. Fast 350M model. Long-form text is automatically chunked on the worker so you can submit up to 2500 chars per call. Use voxcpm2 for higher-quality custom cloning.

Options: temperature, topP, topK, repetitionPenalty, normalizeReferenceLoudness

Chatterbox

More expressive English variant of Chatterbox with CFG, exaggeration, and min-p controls. Same auto-chunking as Turbo. Use voxcpm2 for higher-quality custom cloning.

Options: temperature, topP, minP, repetitionPenalty, exaggeration, cfgWeight

Chatterbox Multilingual

Multilingual TTS across 23 languages, with auto-chunking for long-form. Use voxcpm2 if you want both multilingual coverage AND high-quality custom cloning.

Options: language, temperature, topP, minP, repetitionPenalty, exaggeration, cfgWeight

F5-TTS

Cheap, fast cloning fallback. ~50% the per-second cost of voxcpm2 and 5-10× faster cold-start. Quality is noticeably less faithful — use when latency or budget matters more than clone fidelity, or when you specifically need an MIT-licensed model. English only.

Options: nfeStep, cfgStrength, speed

VoxCPM2

Recommended for voice cloning. OpenBMB 2B diffusion-AR — highest fidelity in the catalog, multilingual, 48kHz output, handles long-form text without truncation. Supports voice design, controllable cloning, and ultimate cloning with a transcript. Use short style prompts only; verbose style text can leak into speech. Cold-start is slow (~130s), but quality is worth it.

Options: language, cfgValue, inferenceTimesteps, cloneMode, stylePrompt, voiceDescription

Qwen3-TTS 1.7B

Experimental eval backend. Alibaba Qwen 1.7B Base voice-clone model with 3-second rapid cloning and 10-language support. Added for side-by-side evals against voxcpm2; do not treat as the default until benchmark results justify it.

Options: language, xVectorOnlyMode

When to use each clone path

Best clone: voxcpm2 + clean 8-12s referenceAudioUrl. If the transcript is exact, add referenceText and cloneMode="ultimate".
No exact transcript: omit referenceText and let the worker transcribe. Never guess it.
Cadence: use short Vox stylePrompt cues. Long prompts can leak into spoken output.
Designed voice: cloneMode="voice-design" + voiceDescription; omit referenceAudioUrl.
Qwen eval: compare qwen3-tts-1.7b against Vox with exact transcript and xVectorOnlyMode.

Catalog

Voices: Aaron, Abigail, Anaya, Andy, Archer, Brian, Chloe, Dylan, Emmanuel, Ethan, Evelyn, Gavin, Gordon, Ivan, Laura, Lucy, Madison, Marisol, Meera, Walter.

Aaron

Standard American male, grounded and balanced — fits a generic white-collar Western character, product narration, and support flows.

Abigail

Australian female, bright and approachable — picture an Aussie or general English-speaking character. Friendly without getting silly.

Anaya

Indian-accented female, crisp and energetic — natural fit for South Asian characters or any role where the agent should picture an Indian speaker.

Andy

Casual American male with a dry edge — picture a laid-back white guy in his 20s or 30s. Works for informal narration and creator content.

Archer

Confident American male with cinematic gravitas — picture a movie-trailer voice or a composed lead character. Suits high-drama launches.

Brian

Steady American male, technical and matter-of-fact — picture an engineer or operator. Low-friction for ops, tutorials, and engineering reads.

Chloe

Light, playful Australian female — picture a young Aussie. Best when small interface moments should feel more alive.

Dylan

Relaxed American male with a natural, understated cadence — picture a low-key indie/folk vibe. Good for narration that should not feel overproduced.

Emmanuel

Polished Black male, articulate and warm — picture an African American or African character. A dependable voice for structured explanation and presentations.

Ethan

Upbeat North American male, clear and brisk — picture a friendly American or Canadian guy. Useful for task-oriented reads with forward motion.

Evelyn

Smooth and expressive American female — picture a reassuring white-collar professional woman. Good for warmer flows where reassurance matters.

Gavin

Bold, animated American male — picture a high-energy host or hype guy. Suits ads, intros, and energetic explainers that need presence.

Gordon

Measured, authoritative older American male — picture a seasoned narrator or documentary host. Good when the read should feel stable and serious.

Ivan

Slavic-accented male (Russian/Eastern European), precise with a deadpan edge — picture a Russian or Eastern European character. Works for analytical, dry technical reads.

Laura

Clear and friendly North American female — picture a competent American or Canadian woman. A practical default for help content and product education.

Lucy

Balanced and lively North American female — picture a warm, modern American/Canadian woman. The safest default for general assistant and product narration.

Madison

Polished, upbeat American female — picture a young media-trained woman. Useful for media-ready content and confident product copy.

Marisol

Warm Latina female with lightly Spanish-inflected English — picture a Hispanic or Latin American character. Fits hospitality, travel, and conversational flows.

Meera

Indian-accented female, calm and thoughtful — picture a South Asian character with a measured cadence. Suited to longer explanation and reflective narration.

Walter

Older American male with a weathered, country-tinged delivery — picture a grandfatherly classic-Western character. Fits deliberate reads with a little ceremony.

Output formats: wav, mp3.

Endpoints

GET /api/voices SIWX model and voice catalog.
GET /api/voice-samples SIWX voice sample catalog.
POST /api/speech paid TTS job.
GET /api/jobs/:jobId SIWX status.
GET /api/jobs SIWX job list.