Agents

Google Gemini 3 Flash Adds Agentic Vision for Code-Powered Image Analysis

The model now writes and runs Python to actively investigate images instead of just looking at them once.

Andrés Martínez
Andrés MartínezAI Content Writer
January 29, 20263 min read
Share:
Illustration of AI vision system actively analyzing and annotating a detailed image with zoom regions and bounding boxes

Google announced Agentic Vision for Gemini 3 Flash on January 27, making the model capable of generating and executing Python code to manipulate images during analysis. The feature is available now through the Gemini API in Google AI Studio and Vertex AI.

What it actually does

Standard vision models look at an image once and guess. Miss a detail, and that's it. Agentic Vision turns this into an iterative loop: the model thinks about what it needs, writes code to manipulate the image, then examines the result before responding.

The Think-Act-Observe cycle works like this: Gemini analyzes the query and plans its approach, generates Python to crop, zoom, rotate, or annotate the image, then appends the modified image back into its context window for another look.

Google claims 5-10% improvement across vision benchmarks with code execution enabled. That's a modest bump, though whether users notice depends entirely on the task. The real question is whether this reduces the frustrating hallucinations that plague vision models on detail-heavy images.

The finger-counting problem

Here's the scenario Google keeps returning to: ask an AI to count fingers on a hand, and it often gets it wrong. With Agentic Vision, Gemini 3 Flash draws bounding boxes and numeric labels over each finger it identifies before answering. The model essentially creates a visual scratchpad to avoid losing track mid-count.

It's a clever workaround for a problem that's embarrassed vision models for years. Whether it generalizes well to messier real-world images remains to be seen.

Who's using it

PlanCheckSolver.com, a building plan validation platform, reported 5% accuracy improvement after enabling code execution. Their use case involves high-resolution architectural drawings where fine details matter. The model crops specific sections (roof edges, building details) and re-examines them at higher resolution.

For data visualization tasks, the model can extract numbers from tables and generate Matplotlib charts rather than attempting mental arithmetic. Google's demo app shows this working on benchmark comparison tables.

What's missing

Some behaviors still require explicit prompting. Zooming happens implicitly when the model detects small details, but rotation and visual math need a nudge from the user. Google says fully implicit behavior is coming.

The company also plans to add web search and reverse image search as tools, letting the model ground its understanding with external context. Expansion to other Gemini model sizes beyond Flash is on the roadmap.

Developers can access Agentic Vision by enabling Code Execution in the AI Studio Playground. The feature is rolling out in the Gemini app under the Thinking model option. Full developer documentation covers implementation details.

Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Google Gemini 3 Flash Adds Agentic Vision for Code-Powered Image Analysis | aiHola