
wav2vec2-base-960h
Audio
wav2vec2-base-960h (94 M params, Apache-2.0)
Open-source ASR that lives happily on a laptop CPU.
Spec sheet. 12-layer, 768-dim transformer sitting on a small CNN front-end; trained self-supervised, then fine-tuned on 960 h of LibriSpeech 16 kHz audio.
Real-world accuracy. ~3.4 % WER on LibriSpeech test-clean and ~8.6 % on test-other with plain greedy decoding—good enough for production English captions.
Tiny footprint. FP16 weights are ≈380 MB; 4-bit quant drops below 100 MB, so real-time CPU or edge-GPU inference is easy.
English-only. For other languages use XLS-R; this model’s tokenizer is just letters + apostrophe.
Why pick it for Norman AI?
We can bolt on speech-to-text for calls, voice notes, or video transcriptions without new GPUs or license headaches—and the Apache-2.0 weights mean full commercial freedom.
