Qwen3-4B

Text

Qwen 3-4B (4 B params, Apache-2.0)

Small-footprint transformer that punches above its weight.

Dual-mode reasoning. Toggle enable_thinking to get full chain-of-thought for hard math/code problems, or turn it off for fast, lightweight chat. One model, two personas.
Spec sheet. 36 layers, grouped-query attention (32 Q / 8 KV heads), native 32 k token window;  stretch to 131 k with YaRN if you need it.
Runs almost anywhere. Full-precision needs ~8-16 GB of VRAM; 4-bit quant drops below 4 GB, so laptops and CPU boxes are fine.
What it’s good at. Strong logical reasoning, code generation, multilingual chat (100 + languages), and tool / agent calls—beats the older Qwen 2.5 line in human evals.
Plug-and-play. Works out-of-the-box with transformers >= 4.51, vLLM, SGLang, Ollama, llama.cpp, LMStudio, etc.—just from_pretrained("Qwen/Qwen3-4B") and go.

Why pick it for Norman AI?

You get open licensing, solid reasoning, long-context support, and minimal hardware demands—all in one model you can ship today without bending your infra.Add a Page with Content.

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "qwen3-4b",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)

View Docs

‹ Phi-3-mini-4k-instruct

qwen3-4b-instruct-2507 >