DeepSeek rolled out a Vision Mode in beta on its chatbot website and mobile app, landing the company's first multimodal capability alongside its existing Fast and Expert modes. The rollout went to a limited set of users on April 29, per TechNode reporting, just days after the V4 model release.
The mode runs on visual chain-of-thought for tasks like geometric reasoning, chart analysis, and turning UI screenshots into HTML. It traces back to a DeepSeek paper, Thinking with Visual Primitives, which treats points and bounding boxes as units of reasoning rather than final outputs. The idea: bake coordinates directly into the thinking trace so the model points instead of describing.
DeepSeek calls the underlying issue the "reference gap." Natural language gets vague fast ("the third object from the left" stops meaning anything after a few reasoning steps), so the model anchors to coordinates instead. The architecture sits on the V4-Flash backbone, a 284B-parameter mixture-of-experts model with 13B active at inference.
The benchmarks look strong but they're self-reported, and independent testing hasn't confirmed them. DeepSeek claims 67% on maze navigation against GPT-5.4's 50%, plus a roughly 10x edge in image token efficiency over Claude and Gemini. Worth noting: the company quietly pulled the technical report shortly after posting it, no explanation given.
Vision Mode handles static images only. No audio, video, or image generation support yet.
Bottom Line
DeepSeek's Vision Mode runs on the V4-Flash backbone (284B params, 13B active) and handles static images only, with no audio, video, or generation support.
Quick Facts
- Launched April 29, 2026 in limited beta
- Available on DeepSeek web and mobile app
- Backbone: V4-Flash, 284B params, 13B active
- Maze navigation: 67% vs GPT-5.4's 50% (company-reported)
- Technical report posted then pulled by DeepSeek




