svara research / model-card / svara-global-v1 v1.0 · 2026-04-30 Apache 2.0 HuggingFace GitHub Paper API
research / model card

svara-global-v1: a 780M-parameter multilingual TTS model with inline emotion control

Svara Research · K. Path, A. Chhabra, S. D., R. Iyer, T. Okonkwo, M. Tanaka, et al.
contact: research@svara.ai · released 2026-04-30 · weights mirror: hf://kenpath/svara-tts-v1

01Abstract

We introduce svara-global-v1, a 780M-parameter neural text-to-speech model that produces natural, expressive speech across 50+ languages from a single voice identity, with inline [emotion] and prosody control issued in the same API call as the text.

Despite being roughly 4× smaller than comparable production systems, svara-global-v1 attains a Naturalness MOS of 4.42 and a head-to-head ELO of 1271, ranking above ElevenLabs Multilingual v2 (1248) and Cartesia Sonic (1235) on the public Voice Arena leaderboard at the time of writing. The model supports streaming inference at P50 = 187 ms first-byte latency and a real-time factor of 0.18 on a single A10 GPU. Weights are released under Apache 2.0 alongside an OpenAI-compatible HTTP endpoint and reference code-switching examples covering 50+ languages within a single utterance.

02Capabilities

Quantized INT8 weights run in <512 MB of VRAM and are deployable on a single T4 with real-time-factor < 0.35. CPU inference is supported but not recommended for production streaming workloads.

03Benchmarks

All numbers below are reproducible from the public evaluation harness in github.com/svara-ai/eval. MOS scores use the standard 5-point ITU-T P.808 protocol with 24 raters per sample over 200 prompts per language.

3.1 Quality (MOS Naturalness, MOS Similarity, ELO)

Voice Arena, 2026-04 cycle. ELO computed over 12,400 pairwise judgments.
Model Params MOS-N ↑ MOS-S ↑ WER ↓ ELO ↑

3.2 Latency by region (P50 / P90 / P99, ms)

First-byte latency for streaming TTS over HTTPS, measured from 9 PoPs against a 60-token prompt. Uniform sample of 10,000 requests per region across 24h.
Region P50 P90 P99

3.3 Throughput (Real-Time Factor)

RTF = synthesis time ÷ output audio duration. Lower is faster; RTF < 1 means real-time-capable.
Hardware RTF Streaming Notes

04Architecture

svara-global-v1 is a non-autoregressive transformer with a 780M-parameter acoustic decoder conditioned on a shared 50-language byte-pair tokenizer, a 256-dimensional voice embedding, and a discrete emotion / prosody token stream emitted by an in-text DSL parser. The vocoder is a 22M-parameter HiFi-GAN variant trained jointly with the decoder for matched 24 kHz output.

text + [emotion] + [lang]   ──▶  DSL parser
                                    │
                                    ▼
              ┌─────────────────────────────────────┐
              │ Tokenizer  (50-lang BPE, 64k vocab) │
              └────────────────┬────────────────────┘
                               ▼
       voice embedding ──▶  Conditioning stack ──▶  emotion/prosody tokens
                               ▼
              ┌─────────────────────────────────────┐
              │ Acoustic Decoder   (780M params)    │
              │ 24 layers · d_model 1024 · 16 heads │
              └────────────────┬────────────────────┘
                               ▼
                   HiFi-GAN vocoder (22M)
                               ▼
                       24 kHz mono PCM
      

The model is trained with a single objective combining mel reconstruction, adversarial vocoder loss, and a contrastive cross-language consistency term that holds voice identity constant under language change. This is what allows the same voice to switch from English to Hindi to Japanese mid-sentence without re-prompting.

05Languages × Emotions support

Every voice supports every emotion in every language. Cells marked denote validated coverage on the held-out evaluation set; below we show 12 representative languages against the 18 emotion tokens. Nonverbal and prosody tokens (laugh, sigh, speed=, pitch=, …) are universally supported and omitted for brevity.

06Inference

The reference endpoint is OpenAI-compatible. A single call accepts mixed-language text and inline tags; the same voice identity is held across languages.

from svara import Svara

client = Svara(api_key="sk-...")

audio = client.audio.speech.create(
    model="svara-global-v1",
    voice="aria",
    input=(
        "[warm] Welcome back. "
        "[lang=es] Hola, me alegra verte de nuevo. "
        "[lang=ja] [excited] 今日は素晴らしい一日ですね! "
        "[lang=hi] क्या मैं आपकी मदद कर सकती हूँ?"
    ),
    response_format="mp3",
    stream=True,
)

audio.stream_to_file("welcome.mp3")

Sample is illustrative; the call above produces a single concatenated stream. See the cookbook for streaming examples with WebSocket and Server-Sent Events transports.

07Training data

  • Public corpora — Common Voice 17, MLS, FLEURS, IndicTTS-Bharat, OpenSLR (filtered for license compatibility, ~31k hrs).
  • In-house recordings — 9.4k hrs across 52 voice identities recorded in sound-isolated booths at 48 kHz / 24-bit, downsampled to 24 kHz mono for training.
  • Synthetic augmentation — back-translated transcripts and code-switched mixtures to balance under-represented language pairs (~4.2k hrs).
  • Held-out evaluation — 200 prompts × 50 languages × 18 emotion tags, recorded by professional voice actors not seen in training.
Per-corpus license attestations are listed in DATASETS.md. We release training manifests but not raw audio for in-house recordings; voice talent consent forms are on file.

08Limitations & responsible use

  • Voice cloning from arbitrary user audio is not enabled at the API layer. Custom voices require an enrollment workflow and signed consent.
  • All generated audio includes a perceptually-inaudible watermark recoverable by our public detector at > 99% accuracy on un-edited clips.
  • Coverage of languages with < 50 hrs of training data (e.g. Yoruba, Sinhala, Khmer) is marked preview in the catalog and may show degraded MOS-similarity.
  • The model has been red-teamed for prompt injection via inline tags; out-of-vocabulary tags fall through to neutral synthesis rather than failing closed.

09Citation

@article{svara2026global,
  title  = {svara-global-v1: A 780M-parameter multilingual TTS model
            with inline emotion control},
  author = {Path, K. and Chhabra, A. and D., S. and Iyer, R. and
            Okonkwo, T. and Tanaka, M. and others},
  journal= {arXiv preprint arXiv:2604.18271},
  year   = {2026},
  url    = {https://svara.ai/research/svara-global-v1},
  note   = {Open weights, Apache 2.0}
}