Text-to-speech on Modal as an x402 API.

Generate speech with bundled voices or zero-shot clone any voice from a 3-15 second StableUpload reference clip. Open-source models — Chatterbox, F5-TTS, VoxCPM2, Qwen3-TTS — wav or mp3, SIWX job history, audio delivered to your StableUpload slot.

Read docs OpenAPI llms.txt

Models

Chatterbox Turbo

Default English TTS for the bundled voice catalog. Fast 350M model. Long-form text is automatically chunked on the worker so you can submit up to 2500 chars per call. Use voxcpm2 for higher-quality custom cloning.

Chatterbox

More expressive English variant of Chatterbox with CFG, exaggeration, and min-p controls. Same auto-chunking as Turbo. Use voxcpm2 for higher-quality custom cloning.

Chatterbox Multilingual

Multilingual TTS across 23 languages, with auto-chunking for long-form. Use voxcpm2 if you want both multilingual coverage AND high-quality custom cloning.

F5-TTS

Cheap, fast cloning fallback. ~50% the per-second cost of voxcpm2 and 5-10× faster cold-start. Quality is noticeably less faithful — use when latency or budget matters more than clone fidelity, or when you specifically need an MIT-licensed model. English only.

VoxCPM2

Recommended for voice cloning. OpenBMB 2B diffusion-AR — highest fidelity in the catalog, multilingual, 48kHz output, handles long-form text without truncation. Supports voice design, controllable cloning, and ultimate cloning with a transcript. Use short style prompts only; verbose style text can leak into speech. Cold-start is slow (~130s), but quality is worth it.

Qwen3-TTS 1.7B

Experimental eval backend. Alibaba Qwen 1.7B Base voice-clone model with 3-second rapid cloning and 10-language support. Added for side-by-side evals against voxcpm2; do not treat as the default until benchmark results justify it.

Starting price

$0.02

Bundled voices

Voice samples

Hear all 20 bundled voices.

Short static MP3 auditions generated with Chatterbox Turbo, ready to play without paying or starting a job.

Aaron

MP3

Beep beep. The deploy passed, and my coffee has entered production.

Standard American male, grounded and balanced — fits a generic white-collar Western character, product narration, and support flows.

American malegroundedbalanced

Abigail

MP3

I opened one tab to test audio. It became a lifestyle.

Australian female, bright and approachable — picture an Aussie or general English-speaking character. Friendly without getting silly.

Australian accentfemalebright

Anaya

MP3

Tiny update: the button works. Huge update: I said tiny update.

Indian-accented female, crisp and energetic — natural fit for South Asian characters or any role where the agent should picture an Indian speaker.

Indian accentfemaleenergetic

Andy

MP3

This sample is legally a vibe, technically a waveform.

Casual American male with a dry edge — picture a laid-back white guy in his 20s or 30s. Works for informal narration and creator content.

American malecasualdry

Archer

MP3

Ship it, then whisper ship it again for cache warmth.

Confident American male with cinematic gravitas — picture a movie-trailer voice or a composed lead character. Suits high-drama launches.

American malecinematicconfident

Brian

MP3

I asked Modal for a snack and it returned a GPU.

Steady American male, technical and matter-of-fact — picture an engineer or operator. Low-friction for ops, tutorials, and engineering reads.

American malesteadytechnical

Chloe

MP3

Psst. Your browser just learned twenty voices. Casual.

Light, playful Australian female — picture a young Aussie. Best when small interface moments should feel more alive.

Australian accentfemaleplayful

Dylan

MP3

If this loads fast, pretend I planned it that way.

Relaxed American male with a natural, understated cadence — picture a low-key indie/folk vibe. Good for narration that should not feel overproduced.

American malerelaxednatural

Emmanuel

MP3

I put the syllables in a trench coat and called it speech.

Polished Black male, articulate and warm — picture an African American or African character. A dependable voice for structured explanation and presentations.

Black malepolishedarticulate

Ethan

MP3

Audio sample number nine is feeling extremely compiled.

Upbeat North American male, clear and brisk — picture a friendly American or Canadian guy. Useful for task-oriented reads with forward motion.

North American maleupbeatclear

Evelyn

MP3

I am not buffering. I am building dramatic suspense.

Smooth and expressive American female — picture a reassuring white-collar professional woman. Good for warmer flows where reassurance matters.

American femaleexpressivereassuring

Gavin

MP3

The waveform said squiggle squiggle and invoices got paid.

Bold, animated American male — picture a high-energy host or hype guy. Suits ads, intros, and energetic explainers that need presence.

American maleboldanimated

Gordon

MP3

This voice has been toasted to a perfect golden latency.

Measured, authoritative older American male — picture a seasoned narrator or documentary host. Good when the read should feel stable and serious.

older American maleauthoritativemeasured

Ivan

MP3

Behold: one sentence, lightly seasoned with computation.

Slavic-accented male (Russian/Eastern European), precise with a deadpan edge — picture a Russian or Eastern European character. Works for analytical, dry technical reads.

Slavic accentmaledeadpan

Laura

MP3

Click once for sound. Click twice for confidence.

Clear and friendly North American female — picture a competent American or Canadian woman. A practical default for help content and product education.

North American femaleclearfriendly

Lucy

MP3

Hello from StableVoice. I brought receipts and a tiny reverb.

Balanced and lively North American female — picture a warm, modern American/Canadian woman. The safest default for general assistant and product narration.

North American femalebalancedlively

Madison

MP3

The landing page asked for personality, so I arrived with tags.

Polished, upbeat American female — picture a young media-trained woman. Useful for media-ready content and confident product copy.

American femalepolishedmedia-ready

Marisol

MP3

Today's forecast: ninety percent chance of nice audio.

Warm Latina female with lightly Spanish-inflected English — picture a Hispanic or Latin American character. Fits hospitality, travel, and conversational flows.

Latinafemalewarm

Meera

MP3

I tried to be serious, then the waveform did a little wiggle.

Indian-accented female, calm and thoughtful — picture a South Asian character with a measured cadence. Suited to longer explanation and reflective narration.

Indian accentfemalethoughtful

Walter

MP3

Back in my day, samples were shipped after they loaded.

Older American male with a weathered, country-tinged delivery — picture a grandfatherly classic-Western character. Fits deliberate reads with a little ceremony.

older American malecountry-tingedweathered