Audio is the new UX.

svara-global-v1 is a 780M-parameter open-weights speech model that holds one voice across 50+ languages, with emotion as a first-class input. Apache 2.0. Streaming sub-200ms. Production-ready today.

April 30, 2026 — Svara Research

Asha🇮🇳

Hindi · Happy

Conversational, warm, code-switching

Aaj subah maine ek nayi recipe try ki — turned out [happy] amazing!

Aria🇺🇸

English · Expressive

Confident, dynamic, emotionally rich

[stern] You promised you'd handle this. [pause] [softer] So please — just handle it.

Meera🇮🇳

Tamil · Surprised

Animated, lifelike, on-the-fly switching

[gasp] Adhu unmaiyana? [excited] Inga vandhu sollu — naan nambela!

780M

Parameters

50+

Languages · one voice

187ms

First-byte streaming

Apache 2.0

Open weights

CAPABILITIES

A small model that out-talks the giants.

svara-global-v1 is the first sub-billion-parameter TTS model to hold a single, recognisable voice across 50+ languages — without re-training, voice-bank stitching, or per-language fine-tunes. Inline tags steer emotion and prosody at sentence level.

01 / Multilingual

One voice. Fifty languages. One API call.

Switch languages mid-sentence and the speaker stays the same person. Hindi, Mandarin, French, Yoruba — same timbre, same identity. Built on shared phonetic conditioning rather than per-language voice banks.

[lang=en] Welcome back. [lang=hi] Aap kaise hain? [lang=es] Bienvenido.

02 / Emotion control

Emotion as a first-class input.

34 inline tags — 18 emotions, 8 nonverbals, 8 prosody — drop directly into the text. No separate emotion model, no audio reference required. Tested on creative writing, agent calls, and audiobooks.

It was a long day. [sigh] But I made it through. [laugh]

03 / Sub-billion

780M parameters. 420MB quantised.

Runs in real-time on a single A10 (RTF 0.18), T4 (RTF 0.31), or even CPU (RTF 0.74). Smaller than Whisper-Large, ~4× smaller than ElevenLabs Multilingual v2, with comparable naturalness scores.

SvaraEL v2CartesiaOpenAI

04 / Open weights

Apache 2.0. Yours to fork.

Full weights on Hugging Face. Run on-prem, fine-tune for your domain, ship to air-gapped environments. No usage caps, no telemetry, no rug-pull risk. Used in production by 14 design partners across publishing, gaming, and IVR.

$ huggingface-cli download svara/global-v1

BENCHMARKS

Smaller. Faster. Preferred.

2,400 blind A/B comparisons across listeners in 14 countries, scored on a 1.0 rating scale. svara-global-v1 (780M params) is ranked above models 2–4× its size.

WIN-RATE — BLIND A/B PREFERENCE (n=2,400)

svara-global-v1 · 780M

.612

ElevenLabs Multilingual v2 · ~3B

.561

Cartesia Sonic · ~2B

.518

OpenAI TTS-1 HD

.474

Google Chirp 3

.408

A/B preference rate against a balanced panel of competitors, English + 9 multilingual buckets. Higher is better. Full method in the technical report.

1
MOS-Naturalness 4.42 / 5
Within 0.04 of human reference recordings.
2
MOS-Similarity 4.18 / 5
Single voice maintains identity across 50+ languages.
3
WER 2.1%
Lower than every closed-source competitor we tested.
4
P50 latency 187ms
Streaming first-byte across 9 global regions.
5
RTF 0.18 on A10
Roughly 5× real-time on a $500/mo GPU.

CASCADED SPEECH-TO-SPEECH

Translate live conversations
without losing the speaker.

Pair Svara with any ASR. Identity carries across the chain — the listener hears the same person, in their language, in real time. Ideal for support, telehealth, and global product calls.

Sarah · English

Customer (US)

svara-global-v1

EN ⇄ JA, 187ms

Ryo · Japanese

Support (Tokyo)

ARCHITECTURE

Three stages. One forward pass.

A compact transformer stack with shared cross-lingual conditioning, decoded into mel-spectrograms, then a small streaming vocoder. The full chain runs in well under real-time on consumer hardware.

STAGE 01

Multilingual tokenizer

A unified byte-level tokenizer with phoneme priors. Same input pipeline for every language, including Devanagari, CJK, and right-to-left scripts.

128k vocab · BPE + IPA priors

STAGE 02

Acoustic decoder

780M-parameter transformer with cross-lingual conditioning, voice-vector input, and inline emotion / prosody token slots. Trained on 480k hours of permissively licensed speech.

780M params · FP16 1.56GB · INT8 420MB

STAGE 03

Streaming vocoder

A lightweight HiFi-GAN successor outputs 24kHz audio in 40ms chunks. Tuned for jitter-free WebSocket streaming and low CPU usage.

24kHz · 40ms frames · 18MB

SOLUTIONS

Where teams ship Svara.

VOICE AGENTS

Sub-200ms voice agents

Plug into Pipecat, LiveKit, or Daily for agents that respond before the user expects. Hold accent + identity across the call.

PUBLISHING

Audiobook narration

SSML + emotion tags + chapter-aware long-form generation. Used for 1,200 commercial titles to date.

LOCALISATION

Dub into 50+ languages

Same speaker, same emotion, every locale. Drops dubbing turnaround from weeks to hours.

TELEPHONY

IVR & contact centres

SIP-native streaming, 99.94% uptime, regional residency. Replaces canned IVR with on-the-fly responses.

Audio is the new UX.

A small model that out-talks the giants.

Twelve flagship voices.
One model under the hood.

Smaller. Faster. Preferred.

Translate live conversations
without losing the speaker.

Three stages. One forward pass.

Where teams ship Svara.

Build with svara-global-v1.

Audio is the new UX.

A small model that out-talks the giants.

Twelve flagship voices. One model under the hood.

Smaller. Faster. Preferred.

Translate live conversations without losing the speaker.

Three stages. One forward pass.

Where teams ship Svara.

Build with svara-global-v1.

Twelve flagship voices.
One model under the hood.

Translate live conversations
without losing the speaker.