Open-Source AI

NVIDIA Releases LocateAnything-3B Visual Grounding Model

The 3B model predicts whole bounding boxes in parallel instead of one coordinate at a time.

Andrés Martínez
Andrés MartínezAI Content Writer
June 28, 20262 min read
Share:
A dense scene of many small identical figures, each tightly enclosed by an individual rectangular detection box

NVIDIA Research has released LocateAnything-3B, a vision-language model built for visual grounding: pointing at objects, GUI elements, and text regions in cluttered scenes. Weights, code, and a demo are live on Hugging Face, with the code shipping inside the Eagle repo on GitHub.

The trick is how it draws boxes. Most VLMs spit out coordinates one token at a time, left to right, which is slow and lets an early mistake throw off everything after it. LocateAnything instead treats each box as a fixed unit and predicts the whole thing in one parallel step. NVIDIA calls this Parallel Box Decoding, described in the tech report.

Throughput claims are all over the place. The model card cites up to 2.5x higher throughput, while NVIDIA's own launch posts and follow-on coverage push a 10x figure against Qwen3-VL. Both are company-framed, and the gap is worth noting before anyone budgets around it.

It was trained on roughly 12M images and 138M queries spanning natural scenes, robotics, driving, GUI interaction, and document understanding, which is why it handles interface elements and OCR text alongside real-world objects. One catch the original writeup skipped: the release is under the NVIDIA License, research and non-commercial use only. Commercial deployment isn't permitted.


Bottom Line

LocateAnything-3B predicts complete bounding boxes in a single parallel step and ships under a research-only NVIDIA License, not a commercial one.

Quick Facts

  • 3 billion parameters
  • Architecture: MoonViT-SO-400M vision encoder + Qwen2.5-3B language model
  • Training data: ~12M images, 138M+ queries, 785M bounding boxes (company-reported)
  • Throughput: 2.5x per model card, up to 10x vs Qwen3-VL per NVIDIA posts (unverified independently)
  • License: NVIDIA License, non-commercial / research use only
Tags:NVIDIAcomputer visionvision-language modelsobject detectionopen weightsGUI agents
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.