DeepSeek Launches Vision Mode in Chatbot Beta

Abstract visualization of an AI model tracing bounding boxes and coordinate points across a digital image during reasoning

DeepSeek rolled out a Vision Mode in beta on its chatbot website and mobile app, landing the company's first multimodal capability alongside its existing Fast and Expert modes. The rollout went to a limited set of users on April 29, per TechNode reporting, just days after the V4 model release.

The mode runs on visual chain-of-thought for tasks like geometric reasoning, chart analysis, and turning UI screenshots into HTML. It traces back to a DeepSeek paper, Thinking with Visual Primitives, which treats points and bounding boxes as units of reasoning rather than final outputs. The idea: bake coordinates directly into the thinking trace so the model points instead of describing.

DeepSeek calls the underlying issue the "reference gap." Natural language gets vague fast ("the third object from the left" stops meaning anything after a few reasoning steps), so the model anchors to coordinates instead. The architecture sits on the V4-Flash backbone, a 284B-parameter mixture-of-experts model with 13B active at inference.

The benchmarks look strong but they're self-reported, and independent testing hasn't confirmed them. DeepSeek claims 67% on maze navigation against GPT-5.4's 50%, plus a roughly 10x edge in image token efficiency over Claude and Gemini. Worth noting: the company quietly pulled the technical report shortly after posting it, no explanation given.

Vision Mode handles static images only. No audio, video, or image generation support yet.

Bottom Line

DeepSeek's Vision Mode runs on the V4-Flash backbone (284B params, 13B active) and handles static images only, with no audio, video, or generation support.

Quick Facts

Launched April 29, 2026 in limited beta
Available on DeepSeek web and mobile app
Backbone: V4-Flash, 284B params, 13B active
Maze navigation: 67% vs GPT-5.4's 50% (company-reported)
Technical report posted then pulled by DeepSeek

Tags:DeepSeekmultimodal AIcomputer visionVision Modechain-of-thoughtV4-FlashChina AI

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

DeepSeek Adds Vision Mode to Its Chatbot

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Moonshot AI Launches HighSpeed Mode for Kimi K2.7 Code

Moonshot AI Launches Kimi Credit Card That Pays Compute Instead of Cashback

Tencent Hunyuan Open-Sources UniRL Multimodal RL Framework

Stay Ahead of the AI Curve