svara
NEW Announcing svara-global-v1 — open weights, 50+ languages, one voice.

Audio is the new UX.

svara-global-v1 is a 780M-parameter open-weights speech model that holds one voice across 50+ languages, with emotion as a first-class input. Apache 2.0. Streaming sub-200ms. Production-ready today.

April 30, 2026 — Svara Research
Asha🇮🇳
Hindi · Happy
Conversational, warm, code-switching
Aaj subah maine ek nayi recipe try ki — turned out [happy] amazing!
Aria🇺🇸
English · Expressive
Confident, dynamic, emotionally rich
[stern] You promised you'd handle this. [pause] [softer] So please — just handle it.
Meera🇮🇳
Tamil · Surprised
Animated, lifelike, on-the-fly switching
[gasp] Adhu unmaiyana? [excited] Inga vandhu sollu — naan nambela!
780M
Parameters
50+
Languages · one voice
187ms
First-byte streaming
Apache 2.0
Open weights
Speech generation builds trust through natural rhythm, emotion, and even the use of humour.
— Svara Research, April 2026
CAPABILITIES

A small model that out-talks the giants.

svara-global-v1 is the first sub-billion-parameter TTS model to hold a single, recognisable voice across 50+ languages — without re-training, voice-bank stitching, or per-language fine-tunes. Inline tags steer emotion and prosody at sentence level.

01 / Multilingual
One voice. Fifty languages. One API call.
Switch languages mid-sentence and the speaker stays the same person. Hindi, Mandarin, French, Yoruba — same timbre, same identity. Built on shared phonetic conditioning rather than per-language voice banks.
[lang=en] Welcome back. [lang=hi] Aap kaise hain? [lang=es] Bienvenido.
02 / Emotion control
Emotion as a first-class input.
34 inline tags — 18 emotions, 8 nonverbals, 8 prosody — drop directly into the text. No separate emotion model, no audio reference required. Tested on creative writing, agent calls, and audiobooks.
It was a long day. [sigh] But I made it through. [laugh]
03 / Sub-billion
780M parameters. 420MB quantised.
Runs in real-time on a single A10 (RTF 0.18), T4 (RTF 0.31), or even CPU (RTF 0.74). Smaller than Whisper-Large, ~4× smaller than ElevenLabs Multilingual v2, with comparable naturalness scores.
SvaraEL v2CartesiaOpenAI
04 / Open weights
Apache 2.0. Yours to fork.
Full weights on Hugging Face. Run on-prem, fine-tune for your domain, ship to air-gapped environments. No usage caps, no telemetry, no rug-pull risk. Used in production by 14 design partners across publishing, gaming, and IVR.
$ huggingface-cli download svara/global-v1
VOICE LIBRARY

Twelve flagship voices.
One model under the hood.

Each voice is a single conditioning vector — every voice can speak every language, with every emotion, at every speed. Tap any voice to hear it.

BENCHMARKS

Smaller. Faster. Preferred.

2,400 blind A/B comparisons across listeners in 14 countries, scored on a 1.0 rating scale. svara-global-v1 (780M params) is ranked above models 2–4× its size.

WIN-RATE — BLIND A/B PREFERENCE (n=2,400)
svara-global-v1 · 780M
.612
ElevenLabs Multilingual v2 · ~3B
.561
Cartesia Sonic · ~2B
.518
OpenAI TTS-1 HD
.474
Google Chirp 3
.408
A/B preference rate against a balanced panel of competitors, English + 9 multilingual buckets. Higher is better. Full method in the technical report.
  • 1
    MOS-Naturalness 4.42 / 5
    Within 0.04 of human reference recordings.
  • 2
    MOS-Similarity 4.18 / 5
    Single voice maintains identity across 50+ languages.
  • 3
    WER 2.1%
    Lower than every closed-source competitor we tested.
  • 4
    P50 latency 187ms
    Streaming first-byte across 9 global regions.
  • 5
    RTF 0.18 on A10
    Roughly 5× real-time on a $500/mo GPU.
CASCADED SPEECH-TO-SPEECH

Translate live conversations
without losing the speaker.

Pair Svara with any ASR. Identity carries across the chain — the listener hears the same person, in their language, in real time. Ideal for support, telehealth, and global product calls.

SR
Sarah · English
Customer (US)
SV
svara-global-v1
EN ⇄ JA, 187ms
RY
Ryo · Japanese
Support (Tokyo)
ARCHITECTURE

Three stages. One forward pass.

A compact transformer stack with shared cross-lingual conditioning, decoded into mel-spectrograms, then a small streaming vocoder. The full chain runs in well under real-time on consumer hardware.

STAGE 01
Multilingual tokenizer
A unified byte-level tokenizer with phoneme priors. Same input pipeline for every language, including Devanagari, CJK, and right-to-left scripts.
128k vocab · BPE + IPA priors
STAGE 02
Acoustic decoder
780M-parameter transformer with cross-lingual conditioning, voice-vector input, and inline emotion / prosody token slots. Trained on 480k hours of permissively licensed speech.
780M params · FP16 1.56GB · INT8 420MB
STAGE 03
Streaming vocoder
A lightweight HiFi-GAN successor outputs 24kHz audio in 40ms chunks. Tuned for jitter-free WebSocket streaming and low CPU usage.
24kHz · 40ms frames · 18MB
SOLUTIONS

Where teams ship Svara.

VOICE AGENTS
Sub-200ms voice agents
Plug into Pipecat, LiveKit, or Daily for agents that respond before the user expects. Hold accent + identity across the call.
PUBLISHING
Audiobook narration
SSML + emotion tags + chapter-aware long-form generation. Used for 1,200 commercial titles to date.
LOCALISATION
Dub into 50+ languages
Same speaker, same emotion, every locale. Drops dubbing turnaround from weeks to hours.
TELEPHONY
IVR & contact centres
SIP-native streaming, 99.94% uptime, regional residency. Replaces canned IVR with on-the-fly responses.

Build with svara-global-v1.

Open weights on Hugging Face, hosted API for the production path. Free credits to start; pay only for what you stream.

Playing… stop ✕