{"version":1,"resources":["https://stablevoice.dev/api/speech","https://stablevoice.dev/api/voices","https://stablevoice.dev/api/jobs","https://stablevoice.dev/api/jobs/:jobId","DELETE https://stablevoice.dev/api/jobs/:jobId","https://stablevoice.dev/api/voice-samples","https://stablevoice.dev/api/recording-tokens"],"mppResources":["https://stablevoice.dev/api/speech","https://stablevoice.dev/api/recording-tokens"],"description":"Pay-per-request text-to-speech on Modal. StableVoice serves the Chatterbox, F5-TTS, VoxCPM2, and Qwen3-TTS open models with bundled voices, optional custom voice references, mp3/wav output, and StableUpload storage slots.","instructions":"# StableVoice API\n\nBase URL: `https://stablevoice.dev`\n\nPay-per-request text-to-speech. Reserve a StableUpload output slot, call `POST /api/speech`, then poll `GET /api/jobs/{jobId}`.\n\n## Models\n\nDefault to `voxcpm2` for custom voice cloning. It has the best fidelity, multilingual coverage, 48kHz output, and long-form handling.\n\n| model | best for | license | notes |\n|---|---|---|---|\n| `voxcpm2` | production custom clones | Apache-2.0 | Use `cloneMode: \"ultimate\"` + exact `referenceText` for best similarity. |\n| `qwen3-tts-1.7b` | experimental clone evals | Apache-2.0 | Compare against Vox; try exact `referenceText` and `xVectorOnlyMode`. |\n| `f5-tts` | cheap/fast English clones | MIT | Lower fidelity; reference is clipped to 12s. |\n| `chatterbox-turbo` | bundled-voice TTS | MIT | Best default for short catalog-voice utterances and paralinguistic tags. |\n| `chatterbox` | expressive bundled TTS | MIT | English, CFG/exaggeration controls. |\n| `chatterbox-multilingual` | multilingual bundled TTS | MIT | 23 languages. |\n\nHard rules:\n- Exact transcript available: pass `referenceText`; for Vox use `options.cloneMode: \"ultimate\"`.\n- Transcript uncertain: omit `referenceText`; for Qwen also compare `options.xVectorOnlyMode: true`.\n- Cadence steering: use short Vox `options.stylePrompt` only. Long style instructions can be spoken by the model.\n- Designed voice without a reference: use Vox `options.cloneMode: \"voice-design\"` + `options.voiceDescription`; do not send `referenceAudioUrl`.\n\nCall `GET /api/voices` for the full model catalog, bundled voice guide, supported languages, cloning notes, and pricing.\n\n## Workflow\n\n```\n1. Optional: GET stablevoice.dev/api/voices\n2. POST stableupload.dev/api/upload       # reserve output slot, filename matches format\n3. POST stablevoice.dev/api/speech        # paid job\n4. GET  stablevoice.dev/api/jobs/{jobId}  # SIWX poll every 2-5s\n```\n\n## Endpoints\n\n- `POST /api/speech` — paid TTS job. Body: `text` (1-2500), `model`, `voice`, `language`, `format`, `output`, optional `referenceAudioUrl`, optional `referenceText`, `options`, `clientRequestId`.\n- `GET /api/jobs/{jobId}` — SIWX status. When complete, read `result.outputs.audio.publicUrl`.\n- `GET /api/voices` — SIWX model/voice catalog and decision guidance.\n- `GET /api/voice-samples` — SIWX bundled voice MP3 previews.\n- `GET /api/jobs?cursor=...&limit=50` and `DELETE /api/jobs/{jobId}` — SIWX job history.\n\n## Voice cloning\n\nUse `referenceAudioUrl` for cloning. It must be a `https://f.stableupload.dev/...` URL. Reference duration:\n\n- `voxcpm2` and `qwen3-tts-1.7b`: 3-10s, recommended 8s.\n- `f5-tts`: 10-15s, recommended 12s.\n- `chatterbox-*`: 5-15s, recommended 10s.\n\nCloning workflow:\n\n```\n1. POST stableupload.dev /api/upload   # reserve slot for the reference clip\n2. PUT or POST the audio to that slot   # wav/mp3/m4a, mono or stereo, any sample rate\n3. POST stableupload.dev /api/upload   # reserve slot for generated speech\n4. POST stablevoice.dev /api/speech    # set referenceAudioUrl + output to the two publicUrls\n5. GET  stablevoice.dev /api/jobs/{id} # SIWX poll\n```\n\nBest clone request:\n\n```json\n{\n  \"model\": \"voxcpm2\",\n  \"text\": \"Your output text here.\",\n  \"referenceAudioUrl\": \"https://f.stableupload.dev/.../voice.wav\",\n  \"referenceText\": \"Exact transcript of the reference clip.\",\n  \"options\": {\n    \"cloneMode\": \"ultimate\"\n  }\n}\n```\n\nQwen comparison request:\n\n```json\n{ \"model\": \"qwen3-tts-1.7b\", \"referenceAudioUrl\": \"https://f.stableupload.dev/.../voice.wav\", \"options\": { \"xVectorOnlyMode\": true } }\n```\n\n## Capture flow — record from a human without a wallet\n\nUse this when the recordee does not have a wallet:\n\n1. Reserve a StableUpload slot: filename `.mp3`, `contentType: \"audio/mpeg\"`, `policyTtlSeconds: 3600`.\n2. `POST /api/recording-tokens` ($0.01) with that slot's `uploadUrl` or `postUrl/postFields`, `publicUrl`, `expiresAt=uploadUrlExpiresAt`, and optional `suggestedText`/speaker label.\n3. Send `recordUrl` to the person. They record up to 60s; StableVoice transcodes and uploads it to your slot.\n4. Poll `GET /api/recording-tokens/{token}` until complete, then use the returned `publicUrl` as `referenceAudioUrl`.\n\nMinimum speech price: $0.02.\n"}