A single 780M-parameter model that speaks 50+ languages in any of 300+ voices — same voice, every language, one API call. Emotion tags, code-switching, and streaming, all built in.
A browser studio with 30+ emotion tags, multi-track timelines, and real-time previews. No installs, no GPUs, no audio engineering.
Pass voice="aria" with text in any of 50+ languages — even mixed mid-sentence. Emotion tags inline. Streams from token zero.
# pip install svara from svara import Client c = Client() audio = c.speak( text="नमस्ते, world! [laugh] こんにちは.", voice="aria", emotion="warm", stream=True, )
Apache 2.0 weights, 1.5GB on disk, runs on a single T4 or quantized to 420MB on CPU. Fine-tune your own talents. Sign deepfake-detection terms. Air-gapped if you need it.
Most TTS models force you to pick a language first, then a voice. Svara flips that — pick a voice, speak in any of 50+ languages with the same identity. The rest follows.
Smaller than the category by 3–5×. Runs on a single T4 in 18% real-time, or quantized to 420MB on a CPU. Fine-tune on a laptop.
Aria sounds like Aria in English, Hindi, Japanese, or all three in one sentence. No swapping voices when you switch language — the model handles code-switch natively.
30+ tags — [happy] [whisper] [sigh] [speed=1.2] — written inline. No emotion sliders, no separate API.
First audio byte at ~80ms. Pipe directly into a phone call, an LLM, or a browser <audio> tag.
No language router, no voice ID lookup, no emotion API. POST /v1/audio/speech takes text + voice + emotion. That's it.
Use the weights commercially, modify, redistribute. Watermarking and a deepfake-detector ship alongside.
100K characters / month free. No credit card. Apache 2.0 weights on day one.