svara-global-v1: a 780M-parameter multilingual TTS model with inline emotion control
01Abstract
We introduce svara-global-v1, a 780M-parameter neural text-to-speech model that produces
natural, expressive speech across 50+ languages from a single voice identity, with
inline [emotion]
and prosody control issued in the same API call as the text.
Despite being roughly 4× smaller than comparable production systems, svara-global-v1 attains a Naturalness MOS of 4.42 and a head-to-head ELO of 1271, ranking above ElevenLabs Multilingual v2 (1248) and Cartesia Sonic (1235) on the public Voice Arena leaderboard at the time of writing. The model supports streaming inference at P50 = 187 ms first-byte latency and a real-time factor of 0.18 on a single A10 GPU. Weights are released under Apache 2.0 alongside an OpenAI-compatible HTTP endpoint and reference code-switching examples covering 50+ languages within a single utterance.
02Capabilities
03Benchmarks
All numbers below are reproducible from the public evaluation harness in github.com/svara-ai/eval. MOS scores use the standard 5-point ITU-T P.808 protocol with 24 raters per sample over 200 prompts per language.
3.1 Quality (MOS Naturalness, MOS Similarity, ELO)
| Model | Params | MOS-N ↑ | MOS-S ↑ | WER ↓ | ELO ↑ |
|---|
3.2 Latency by region (P50 / P90 / P99, ms)
| Region | P50 | P90 | P99 |
|---|
3.3 Throughput (Real-Time Factor)
| Hardware | RTF | Streaming | Notes |
|---|
04Architecture
svara-global-v1 is a non-autoregressive transformer with a 780M-parameter acoustic decoder conditioned on a shared 50-language byte-pair tokenizer, a 256-dimensional voice embedding, and a discrete emotion / prosody token stream emitted by an in-text DSL parser. The vocoder is a 22M-parameter HiFi-GAN variant trained jointly with the decoder for matched 24 kHz output.
text + [emotion] + [lang] ──▶ DSL parser
│
▼
┌─────────────────────────────────────┐
│ Tokenizer (50-lang BPE, 64k vocab) │
└────────────────┬────────────────────┘
▼
voice embedding ──▶ Conditioning stack ──▶ emotion/prosody tokens
▼
┌─────────────────────────────────────┐
│ Acoustic Decoder (780M params) │
│ 24 layers · d_model 1024 · 16 heads │
└────────────────┬────────────────────┘
▼
HiFi-GAN vocoder (22M)
▼
24 kHz mono PCM
The model is trained with a single objective combining mel reconstruction, adversarial vocoder loss, and a contrastive cross-language consistency term that holds voice identity constant under language change. This is what allows the same voice to switch from English to Hindi to Japanese mid-sentence without re-prompting.
05Languages × Emotions support
Every voice supports every emotion in every language. Cells marked ● denote validated coverage on the held-out evaluation set; below we show 12 representative languages against the 18 emotion tokens. Nonverbal and prosody tokens (laugh, sigh, speed=, pitch=, …) are universally supported and omitted for brevity.
06Inference
The reference endpoint is OpenAI-compatible. A single call accepts mixed-language text and
inline tags; the same voice
identity is held across languages.
from svara import Svara
client = Svara(api_key="sk-...")
audio = client.audio.speech.create(
model="svara-global-v1",
voice="aria",
input=(
"[warm] Welcome back. "
"[lang=es] Hola, me alegra verte de nuevo. "
"[lang=ja] [excited] 今日は素晴らしい一日ですね! "
"[lang=hi] क्या मैं आपकी मदद कर सकती हूँ?"
),
response_format="mp3",
stream=True,
)
audio.stream_to_file("welcome.mp3")
Sample is illustrative; the call above produces a single concatenated stream. See the cookbook for streaming examples with WebSocket and Server-Sent Events transports.
07Training data
- Public corpora — Common Voice 17, MLS, FLEURS, IndicTTS-Bharat, OpenSLR (filtered for license compatibility, ~31k hrs).
- In-house recordings — 9.4k hrs across 52 voice identities recorded in sound-isolated booths at 48 kHz / 24-bit, downsampled to 24 kHz mono for training.
- Synthetic augmentation — back-translated transcripts and code-switched mixtures to balance under-represented language pairs (~4.2k hrs).
- Held-out evaluation — 200 prompts × 50 languages × 18 emotion tags, recorded by professional voice actors not seen in training.
08Limitations & responsible use
- Voice cloning from arbitrary user audio is not enabled at the API layer. Custom voices require an enrollment workflow and signed consent.
- All generated audio includes a perceptually-inaudible watermark recoverable by our public detector at > 99% accuracy on un-edited clips.
- Coverage of languages with < 50 hrs of training data (e.g. Yoruba, Sinhala, Khmer) is marked preview in the catalog and may show degraded MOS-similarity.
- The model has been red-teamed for prompt injection via inline tags; out-of-vocabulary tags fall through to neutral synthesis rather than failing closed.
09Citation
@article{svara2026global,
title = {svara-global-v1: A 780M-parameter multilingual TTS model
with inline emotion control},
author = {Path, K. and Chhabra, A. and D., S. and Iyer, R. and
Okonkwo, T. and Tanaka, M. and others},
journal= {arXiv preprint arXiv:2604.18271},
year = {2026},
url = {https://svara.ai/research/svara-global-v1},
note = {Open weights, Apache 2.0}
}