StableVoice API
Reserve an audio output slot on StableUpload, then call /api/speech. Poll the returned job ID with SIWX until the audio URL is ready.
Models
Chatterbox Turbo
Default English TTS for the bundled voice catalog. Fast 350M model. Long-form text is automatically chunked on the worker so you can submit up to 2500 chars per call. Use voxcpm2 for higher-quality custom cloning.
Options: temperature, topP, topK, repetitionPenalty, normalizeReferenceLoudness
Chatterbox
More expressive English variant of Chatterbox with CFG, exaggeration, and min-p controls. Same auto-chunking as Turbo. Use voxcpm2 for higher-quality custom cloning.
Options: temperature, topP, minP, repetitionPenalty, exaggeration, cfgWeight
Chatterbox Multilingual
Multilingual TTS across 23 languages, with auto-chunking for long-form. Use voxcpm2 if you want both multilingual coverage AND high-quality custom cloning.
Options: language, temperature, topP, minP, repetitionPenalty, exaggeration, cfgWeight
F5-TTS
Cheap, fast cloning fallback. ~50% the per-second cost of voxcpm2 and 5-10× faster cold-start. Quality is noticeably less faithful — use when latency or budget matters more than clone fidelity, or when you specifically need an MIT-licensed model. English only.
Options: nfeStep, cfgStrength, speed
VoxCPM2
Recommended for voice cloning. OpenBMB 2B diffusion-AR — highest fidelity in the catalog, multilingual, 48kHz output, handles long-form text without truncation. Supports voice design, controllable cloning, and ultimate cloning with a transcript. Use short style prompts only; verbose style text can leak into speech. Cold-start is slow (~130s), but quality is worth it.
Options: language, cfgValue, inferenceTimesteps, cloneMode, stylePrompt, voiceDescription
Qwen3-TTS 1.7B
Experimental eval backend. Alibaba Qwen 1.7B Base voice-clone model with 3-second rapid cloning and 10-language support. Added for side-by-side evals against voxcpm2; do not treat as the default until benchmark results justify it.
Options: language, xVectorOnlyMode
When to use each clone path
- Best clone:
voxcpm2+ clean 8-12sreferenceAudioUrl. If the transcript is exact, addreferenceTextandcloneMode="ultimate". - No exact transcript: omit
referenceTextand let the worker transcribe. Never guess it. - Cadence: use short Vox
stylePromptcues. Long prompts can leak into spoken output. - Designed voice:
cloneMode="voice-design"+voiceDescription; omitreferenceAudioUrl. - Qwen eval: compare
qwen3-tts-1.7bagainst Vox with exact transcript andxVectorOnlyMode.
Catalog
Voices: Aaron, Abigail, Anaya, Andy, Archer, Brian, Chloe, Dylan, Emmanuel, Ethan, Evelyn, Gavin, Gordon, Ivan, Laura, Lucy, Madison, Marisol, Meera, Walter.
Aaron
Standard American male, grounded and balanced — fits a generic white-collar Western character, product narration, and support flows.
Abigail
Australian female, bright and approachable — picture an Aussie or general English-speaking character. Friendly without getting silly.
Anaya
Indian-accented female, crisp and energetic — natural fit for South Asian characters or any role where the agent should picture an Indian speaker.
Andy
Casual American male with a dry edge — picture a laid-back white guy in his 20s or 30s. Works for informal narration and creator content.
Archer
Confident American male with cinematic gravitas — picture a movie-trailer voice or a composed lead character. Suits high-drama launches.
Brian
Steady American male, technical and matter-of-fact — picture an engineer or operator. Low-friction for ops, tutorials, and engineering reads.
Chloe
Light, playful Australian female — picture a young Aussie. Best when small interface moments should feel more alive.
Dylan
Relaxed American male with a natural, understated cadence — picture a low-key indie/folk vibe. Good for narration that should not feel overproduced.
Emmanuel
Polished Black male, articulate and warm — picture an African American or African character. A dependable voice for structured explanation and presentations.
Ethan
Upbeat North American male, clear and brisk — picture a friendly American or Canadian guy. Useful for task-oriented reads with forward motion.
Evelyn
Smooth and expressive American female — picture a reassuring white-collar professional woman. Good for warmer flows where reassurance matters.
Gavin
Bold, animated American male — picture a high-energy host or hype guy. Suits ads, intros, and energetic explainers that need presence.
Gordon
Measured, authoritative older American male — picture a seasoned narrator or documentary host. Good when the read should feel stable and serious.
Ivan
Slavic-accented male (Russian/Eastern European), precise with a deadpan edge — picture a Russian or Eastern European character. Works for analytical, dry technical reads.
Laura
Clear and friendly North American female — picture a competent American or Canadian woman. A practical default for help content and product education.
Lucy
Balanced and lively North American female — picture a warm, modern American/Canadian woman. The safest default for general assistant and product narration.
Madison
Polished, upbeat American female — picture a young media-trained woman. Useful for media-ready content and confident product copy.
Marisol
Warm Latina female with lightly Spanish-inflected English — picture a Hispanic or Latin American character. Fits hospitality, travel, and conversational flows.
Meera
Indian-accented female, calm and thoughtful — picture a South Asian character with a measured cadence. Suited to longer explanation and reflective narration.
Walter
Older American male with a weathered, country-tinged delivery — picture a grandfatherly classic-Western character. Fits deliberate reads with a little ceremony.
Output formats: wav, mp3.
Endpoints
GET /api/voicesSIWX model and voice catalog.GET /api/voice-samplesSIWX voice sample catalog.POST /api/speechpaid TTS job.GET /api/jobs/:jobIdSIWX status.GET /api/jobsSIWX job list.