Green Fern
Green Fern

Whisper Large-v3

Audio

Whisper Large-v3 (1.55 B params, Apache-2.0)

OpenAI’s top-tier speech-to-text model—more accurate, still fits on one card.

  • Spec sheet. Same core as Large-v2 but with 128-Mel inputs and a new Cantonese token; trained on 5 M h weak-labeled + 4 M h pseudo-labeled audio, 30 s receptive field.

  • Best-in-class accuracy. Drops word-error-rate by 10-20 % vs Large-v2; ~2 % WER on LibriSpeech clean, ~3.9 % on test-other.

  • Runs on modest GPUs. FP16 weights ≈3 GB; expect 6–10 GB VRAM for real-time use, or squeeze to 4 GB with 4-bit/ggml builds.

  • Full multilingual + translation. Auto-detects 99 languages and can output English translations out-of-the-box.

Why pick it for Norman AI?

State-of-the-art transcription and on-the-fly translation for calls, podcasts, or video captions—no extra infra, permissive license, and small enough to co-host with your other micro-services on a single A10G.

response = await norman.invoke(
    {
        "model_name": "whisper-large-v3",
        "inputs": [
            {
                "display_title": "Audio File",
                "data": "/Users/alice/Desktop/global_conference.wav"
            },
            {
                "display_title": "Task",
                "data": "transcribe"
            },
            {
                "display_title": "Output Language",
                "data": "english"
            }
        ]
    }
)


See our SDK