Green Fern
Green Fern

wav2vec2-base-960h

Audio

wav2vec2-base-960h (94 M params, Apache-2.0)

Open-source ASR that lives happily on a laptop CPU.

  • Spec sheet. 12-layer, 768-dim transformer sitting on a small CNN front-end; trained self-supervised, then fine-tuned on 960 h of LibriSpeech 16 kHz audio.

  • Real-world accuracy. ~3.4 % WER on LibriSpeech test-clean and ~8.6 % on test-other with plain greedy decoding—good enough for production English captions.

  • Tiny footprint. FP16 weights are ≈380 MB; 4-bit quant drops below 100 MB, so real-time CPU or edge-GPU inference is easy.

  • English-only. For other languages use XLS-R; this model’s tokenizer is just letters + apostrophe.

Why pick it for Norman AI?

We can bolt on speech-to-text for calls, voice notes, or video transcriptions without new GPUs or license headaches—and the Apache-2.0 weights mean full commercial freedom.

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "phi-4",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)