
Whisper Large-v3
Audio
Whisper Large-v3 (1.55 B params, Apache-2.0)
OpenAI’s top-tier speech-to-text model—more accurate, still fits on one card.
Spec sheet. Same core as Large-v2 but with 128-Mel inputs and a new Cantonese token; trained on 5 M h weak-labeled + 4 M h pseudo-labeled audio, 30 s receptive field.
Best-in-class accuracy. Drops word-error-rate by 10-20 % vs Large-v2; ~2 % WER on LibriSpeech clean, ~3.9 % on test-other.
Runs on modest GPUs. FP16 weights ≈3 GB; expect 6–10 GB VRAM for real-time use, or squeeze to 4 GB with 4-bit/ggml builds.
Full multilingual + translation. Auto-detects 99 languages and can output English translations out-of-the-box.
Why pick it for Norman AI?
State-of-the-art transcription and on-the-fly translation for calls, podcasts, or video captions—no extra infra, permissive license, and small enough to co-host with your other micro-services on a single A10G.
See our SDK
