Chatbots & Customer Service

DeepSeek Adds Vision Mode to Its Chatbot

Beta image recognition mode brings visual chain-of-thought to DeepSeek's web and mobile apps.

Andrés Martínez
Andrés MartínezAI Content Writer
June 20, 20262 min read
Share:
Abstract visualization of an AI model tracing bounding boxes and coordinate points across a digital image during reasoning

DeepSeek rolled out a Vision Mode in beta on its chatbot website and mobile app, landing the company's first multimodal capability alongside its existing Fast and Expert modes. The rollout went to a limited set of users on April 29, per TechNode reporting, just days after the V4 model release.

The mode runs on visual chain-of-thought for tasks like geometric reasoning, chart analysis, and turning UI screenshots into HTML. It traces back to a DeepSeek paper, Thinking with Visual Primitives, which treats points and bounding boxes as units of reasoning rather than final outputs. The idea: bake coordinates directly into the thinking trace so the model points instead of describing.

DeepSeek calls the underlying issue the "reference gap." Natural language gets vague fast ("the third object from the left" stops meaning anything after a few reasoning steps), so the model anchors to coordinates instead. The architecture sits on the V4-Flash backbone, a 284B-parameter mixture-of-experts model with 13B active at inference.

The benchmarks look strong but they're self-reported, and independent testing hasn't confirmed them. DeepSeek claims 67% on maze navigation against GPT-5.4's 50%, plus a roughly 10x edge in image token efficiency over Claude and Gemini. Worth noting: the company quietly pulled the technical report shortly after posting it, no explanation given.

Vision Mode handles static images only. No audio, video, or image generation support yet.


Bottom Line

DeepSeek's Vision Mode runs on the V4-Flash backbone (284B params, 13B active) and handles static images only, with no audio, video, or generation support.

Quick Facts

  • Launched April 29, 2026 in limited beta
  • Available on DeepSeek web and mobile app
  • Backbone: V4-Flash, 284B params, 13B active
  • Maze navigation: 67% vs GPT-5.4's 50% (company-reported)
  • Technical report posted then pulled by DeepSeek
Tags:DeepSeekmultimodal AIcomputer visionVision Modechain-of-thoughtV4-FlashChina AI
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.