
XTTS-v2
Audio
XTTS-v2 (≈ 1.8 GB checkpoint, Coqui-Public-Model-License)
Multilingual voice-cloning TTS you can run on a single mid-range GPU.
Zero-shot cloning. Feed a 6-second voice clip and it mimics tone, accent, and emotion—then speaks in any of 17 languages (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH-CN, JA, HU, KO, HI).
Cross-language & style transfer. Keep the same speaker timbre while switching languages or emotional style; supports multi-reference mixing for smoother prosody.
Hardware reality. Weights are ~1.8 GB on disk; inference peaks around 2-4 GB VRAM and <5 GB system RAM—under 10 GB even for book-length runs on a 3090.
24 kHz output, streaming OK. Latency ~200 ms for real-time chat; supports chunked streaming or batched long-form synthesis.
License heads-up. CPML allows commercial use but bans re-selling weight access; read the terms before wiring it into paid APIs.
Why pick it for Norman AI?
XTTS-v2 drops high-quality, cross-language speech synthesis into a sub-10 GB envelope—perfect for adding voice chat, multilingual dubbing, or branded voice avatars to our stack without new infra or complex data collection.
