
Janus-Pro-7B
Image
Janus-Pro 7B (≈ 7 B params, MIT + DeepSeek Model License)
Unified transformer that both understands images and generates them.
Spec sheet. 30 decoder layers · 4 096-dim embed · 32 heads · 4 096-token context window; SigLIP-L vision encoder (384 × 384) for “look”, VQ tokenizer (↓16×) for “draw”.
Dual super-powers. Hits 79 MMBench on visual Q&A and 0.80 GenEval on text-to-image, edging past DALL-E 3 and Stable Diffusion 3-M in public leaderboards.
Hardware reality. Full-precision needs ~24 GB VRAM; 8-/4-bit quant cuts that to 12--14 GB, so one 3090/4090 or A10G card is plenty for prod inference.
Any-to-Any I/O. Accepts text ↔ image ↔ mixed sequences in one stream—no separate diffusion server, no cross-model routing.
Dev-ready. Works with transformers >= 4.51, Diffusers JanusProPipeline, vLLM, llama.cpp GGUF, Ollama, and a one-liner CLI (pip install janus-pro && janus-pro "<prompt>").
License heads-up. Code is MIT; weight use follows DeepSeek’s model license—commercial OK, but mind the usual “no illegal/abusive content” clauses.
Why pick it for Norman AI?
Janus-Pro 7B lets us ship multimodal chat and on-the-fly image generation from a single 24 GB GPU—no diffusion backend, no model juggling, Apache-style freedom. Perfect for a unified “creative assistant” tier or edge deployments where every watt and GPU slot counts.
