Green Fern
Green Fern

Qwen3-VL-4B-Instruct

Text

Qwen3-VL-4B-Instruct (Qwen)

Compact vision-language model for understanding images and text together.

  • Multimodal by design. A ~4B-parameter model from the Qwen3 family that handles text + image inputs in a single instruction-tuned model.

  • Sees and reasons. Can describe images, answer visual questions, extract details, and combine visual context with text instructions.

  • Instruction-tuned. Optimized for chat-style prompts and structured tasks, making its outputs more predictable than raw VLMs.

  • Efficient footprint. Much smaller than large multimodal models, making it practical for single-GPU or optimized deployments.

  • Simple integration. Single model, standard inputs, no separate vision encoder setup required.

Why pick it for Norman AI?

Qwen3-VL-4B-Instruct is a solid choice when you need multimodal understanding without heavy infrastructure. It’s well suited for image analysis, visual Q&A, document screenshots, and mixed text-image workflows.

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "qwen3-vl-4b-instruct",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)