Google Gemini 3 Flash Adds Agentic Vision for Code-Powered Image Analysis

Google announced Agentic Vision for Gemini 3 Flash on January 27, making the model capable of generating and executing Python code to manipulate images during analysis. The feature is available now through the Gemini API in Google AI Studio and Vertex AI.

What it actually does

Standard vision models look at an image once and guess. Miss a detail, and that's it. Agentic Vision turns this into an iterative loop: the model thinks about what it needs, writes code to manipulate the image, then examines the result before responding.

The Think-Act-Observe cycle works like this: Gemini analyzes the query and plans its approach, generates Python to crop, zoom, rotate, or annotate the image, then appends the modified image back into its context window for another look.

Google claims 5-10% improvement across vision benchmarks with code execution enabled. That's a modest bump, though whether users notice depends entirely on the task. The real question is whether this reduces the frustrating hallucinations that plague vision models on detail-heavy images.

The finger-counting problem

Here's the scenario Google keeps returning to: ask an AI to count fingers on a hand, and it often gets it wrong. With Agentic Vision, Gemini 3 Flash draws bounding boxes and numeric labels over each finger it identifies before answering. The model essentially creates a visual scratchpad to avoid losing track mid-count.

It's a clever workaround for a problem that's embarrassed vision models for years. Whether it generalizes well to messier real-world images remains to be seen.

Who's using it

PlanCheckSolver.com, a building plan validation platform, reported 5% accuracy improvement after enabling code execution. Their use case involves high-resolution architectural drawings where fine details matter. The model crops specific sections (roof edges, building details) and re-examines them at higher resolution.

For data visualization tasks, the model can extract numbers from tables and generate Matplotlib charts rather than attempting mental arithmetic. Google's demo app shows this working on benchmark comparison tables.

What's missing

Some behaviors still require explicit prompting. Zooming happens implicitly when the model detects small details, but rotation and visual math need a nudge from the user. Google says fully implicit behavior is coming.

The company also plans to add web search and reverse image search as tools, letting the model ground its understanding with external context. Expansion to other Gemini model sizes beyond Flash is on the roadmap.

Developers can access Agentic Vision by enabling Code Execution in the AI Studio Playground. The feature is rolling out in the Gemini app under the Thinking model option. Full developer documentation covers implementation details.

Google Gemini 3 Flash Adds Agentic Vision for Code-Powered Image Analysis

What it actually does

The finger-counting problem

Who's using it

What's missing

Andrés Martínez

Related Articles

Google Tests Gemini Feature That Taps Through Apps for You

Anthropic Launches Agent Teams for Claude Code, Then Used 16 of Them to Build a C Compiler

Anthropic and OpenAI Ship Flagship Coding Models 20 Minutes Apart

Stay Ahead of the AI Curve