svara-global-v1 · 780M params · open weights · live

One small voice model.
Every language. Every voice.

A single 780M-parameter model that speaks 50+ languages in any of 300+ voices — same voice, every language, one API call. Emotion tags, code-switching, and streaming, all built in.

Open Studio → Read the API docs Talk to sales
50+
Languages, in every voice
300+
Voices, all multilingual
<200ms
P50 streaming latency
780M
Params · runs on a single T4
Studio · for creators

Write a script. Pick a voice. Ship.

A browser studio with 30+ emotion tags, multi-track timelines, and real-time previews. No installs, no GPUs, no audio engineering.

From $9 · 1M chars / mo
Open Studio
API · for developers

One endpoint. Any language. Same voice.

Pass voice="aria" with text in any of 50+ languages — even mixed mid-sentence. Emotion tags inline. Streams from token zero.

# pip install svara
from svara import Client
c = Client()
audio = c.speak(
  text="नमस्ते, world! [laugh] こんにちは.",
  voice="aria",
  emotion="warm",
  stream=True,
)
From $0.030 / 1K chars · 187ms p50
Read API docs
Enterprise · self-hosted

Your voices, your VPC, your weights.

Apache 2.0 weights, 1.5GB on disk, runs on a single T4 or quantized to 420MB on CPU. Fine-tune your own talents. Sign deepfake-detection terms. Air-gapped if you need it.

Single-tenant deployment
SOC 2 · HIPAA · GDPR DPA
Custom voice cloning (1 min audio)
Watermark + provenance
Dedicated support
Annual contract · custom
Talk to enterprise

Built around four product decisions.

Most TTS models force you to pick a language first, then a voice. Svara flips that — pick a voice, speak in any of 50+ languages with the same identity. The rest follows.

780M parameters, open weights

Smaller than the category by 3–5×. Runs on a single T4 in 18% real-time, or quantized to 420MB on a CPU. Fine-tune on a laptop.

Same voice, 50+ languages

Aria sounds like Aria in English, Hindi, Japanese, or all three in one sentence. No swapping voices when you switch language — the model handles code-switch natively.

Emotion tags in the prompt

30+ tags — [happy] [whisper] [sigh] [speed=1.2] — written inline. No emotion sliders, no separate API.

Streaming from token zero

First audio byte at ~80ms. Pipe directly into a phone call, an LLM, or a browser <audio> tag.

One endpoint

No language router, no voice ID lookup, no emotion API. POST /v1/audio/speech takes text + voice + emotion. That's it.

©

Apache 2.0

Use the weights commercially, modify, redistribute. Watermarking and a deepfake-detector ship alongside.

Used by builders at
Cromwell&Co. NORTH/STAR Polyglot Vesper byteline Auralis

The whole catalog. One API call. Free to start.

100K characters / month free. No credit card. Apache 2.0 weights on day one.