NVIDIA Releases LocateAnything-3B Grounding Model

A dense scene of many small identical figures, each tightly enclosed by an individual rectangular detection box

NVIDIA Research has released LocateAnything-3B, a vision-language model built for visual grounding: pointing at objects, GUI elements, and text regions in cluttered scenes. Weights, code, and a demo are live on Hugging Face, with the code shipping inside the Eagle repo on GitHub.

The trick is how it draws boxes. Most VLMs spit out coordinates one token at a time, left to right, which is slow and lets an early mistake throw off everything after it. LocateAnything instead treats each box as a fixed unit and predicts the whole thing in one parallel step. NVIDIA calls this Parallel Box Decoding, described in the tech report.

Throughput claims are all over the place. The model card cites up to 2.5x higher throughput, while NVIDIA's own launch posts and follow-on coverage push a 10x figure against Qwen3-VL. Both are company-framed, and the gap is worth noting before anyone budgets around it.

It was trained on roughly 12M images and 138M queries spanning natural scenes, robotics, driving, GUI interaction, and document understanding, which is why it handles interface elements and OCR text alongside real-world objects. One catch the original writeup skipped: the release is under the NVIDIA License, research and non-commercial use only. Commercial deployment isn't permitted.

Bottom Line

LocateAnything-3B predicts complete bounding boxes in a single parallel step and ships under a research-only NVIDIA License, not a commercial one.

Quick Facts

3 billion parameters
Architecture: MoonViT-SO-400M vision encoder + Qwen2.5-3B language model
Training data: ~12M images, 138M+ queries, 785M bounding boxes (company-reported)
Throughput: 2.5x per model card, up to 10x vs Qwen3-VL per NVIDIA posts (unverified independently)
License: NVIDIA License, non-commercial / research use only

Tags:NVIDIAcomputer visionvision-language modelsobject detectionopen weightsGUI agents

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

NVIDIA Releases LocateAnything-3B Visual Grounding Model

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Qwen Releases AgentWorld, a Language Model That Simulates Agent Environments

Moebius Inpainting Model Matches 10B Systems at 0.22B Parameters

Moonshot AI Launches HighSpeed Mode for Kimi K2.7 Code

Stay Ahead of the AI Curve