Llama-3.2-3B

Text

Llama-3.2 3B (3.21 B params, Llama 3.2 Community License)

Compact Meta transformer that packs 128 k-token memory into an 8 GB GPU budget.

Long-context, low-footprint. Full-precision model reads up to 128 k tokens with GQA; BF16 weights clock in at ≈7.4 GB VRAM, while 4-bit SpinQuant/QLoRA builds shrink below 4 GB for laptops and edge devices.
Multilingual by default. Trained for eight core languages (EN, DE, FR, IT, PT, ES, HI, TH) and more, making it a solid base for dialogue, retrieval and summarization agents.
Punches above its weight. The instruction-tuned variant hits 63.4 MMLU (5-shot) and 78.6 ARC-C (0-shot), rivaling many 7-13 B open models.
Plug-and-play ops. Works out-of-the-box with transformers >= 4.43, the original llama repo, vLLM, llama.cpp, Ollama and SpinQuant mobile builds.
License heads-up. You must show “Built with Llama” and seek extra permission if your products top 700 M MAU—more restrictive than Apache-2.0.

Why pick it for Norman AI?

Near-7 B performance, 128 k context and multilingual reach on a single-card budget let us ship long-memory chat, retrieval agents or edge inference tiers without bending our infra—or our power bill.

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "llama-3-2-3b",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)

View Docs

‹ TinyLlama-1.1B

gemma-2-2b ›