Green Fern
Green Fern

XTTS-v2

Audio

XTTS-v2 (≈ 1.8 GB checkpoint, Coqui-Public-Model-License)

Multilingual voice-cloning TTS you can run on a single mid-range GPU.

  • Zero-shot cloning. Feed a 6-second voice clip and it mimics tone, accent, and emotion—then speaks in any of 17 languages (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH-CN, JA, HU, KO, HI).

  • Cross-language & style transfer. Keep the same speaker timbre while switching languages or emotional style; supports multi-reference mixing for smoother prosody.

  • Hardware reality. Weights are ~1.8 GB on disk; inference peaks around 2-4 GB VRAM and <5 GB system RAM—under 10 GB even for book-length runs on a 3090.

  • 24 kHz output, streaming OK. Latency ~200 ms for real-time chat; supports chunked streaming or batched long-form synthesis.

  • License heads-up. CPML allows commercial use but bans re-selling weight access; read the terms before wiring it into paid APIs.

Why pick it for Norman AI?

XTTS-v2 drops high-quality, cross-language speech synthesis into a sub-10 GB envelope—perfect for adding voice chat, multilingual dubbing, or branded voice avatars to our stack without new infra or complex data collection.

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "phi-4",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)