
Llama-3.2-3B
Text
Llama-3.2 3B (3.21 B params, Llama 3.2 Community License)
Compact Meta transformer that packs 128 k-token memory into an 8 GB GPU budget.
Long-context, low-footprint. Full-precision model reads up to 128 k tokens with GQA; BF16 weights clock in at ≈7.4 GB VRAM, while 4-bit SpinQuant/QLoRA builds shrink below 4 GB for laptops and edge devices.
Multilingual by default. Trained for eight core languages (EN, DE, FR, IT, PT, ES, HI, TH) and more, making it a solid base for dialogue, retrieval and summarization agents.
Punches above its weight. The instruction-tuned variant hits 63.4 MMLU (5-shot) and 78.6 ARC-C (0-shot), rivaling many 7-13 B open models.
Plug-and-play ops. Works out-of-the-box with transformers >= 4.43, the original llama repo, vLLM, llama.cpp, Ollama and SpinQuant mobile builds.
License heads-up. You must show “Built with Llama” and seek extra permission if your products top 700 M MAU—more restrictive than Apache-2.0.
Why pick it for Norman AI?
Near-7 B performance, 128 k context and multilingual reach on a single-card budget let us ship long-memory chat, retrieval agents or edge inference tiers without bending our infra—or our power bill.
