NVIDIA Research has released LocateAnything-3B, a vision-language model built for visual grounding: pointing at objects, GUI elements, and text regions in cluttered scenes. Weights, code, and a demo are live on Hugging Face, with the code shipping inside the Eagle repo on GitHub.
The trick is how it draws boxes. Most VLMs spit out coordinates one token at a time, left to right, which is slow and lets an early mistake throw off everything after it. LocateAnything instead treats each box as a fixed unit and predicts the whole thing in one parallel step. NVIDIA calls this Parallel Box Decoding, described in the tech report.
Throughput claims are all over the place. The model card cites up to 2.5x higher throughput, while NVIDIA's own launch posts and follow-on coverage push a 10x figure against Qwen3-VL. Both are company-framed, and the gap is worth noting before anyone budgets around it.
It was trained on roughly 12M images and 138M queries spanning natural scenes, robotics, driving, GUI interaction, and document understanding, which is why it handles interface elements and OCR text alongside real-world objects. One catch the original writeup skipped: the release is under the NVIDIA License, research and non-commercial use only. Commercial deployment isn't permitted.
Bottom Line
LocateAnything-3B predicts complete bounding boxes in a single parallel step and ships under a research-only NVIDIA License, not a commercial one.
Quick Facts
- 3 billion parameters
- Architecture: MoonViT-SO-400M vision encoder + Qwen2.5-3B language model
- Training data: ~12M images, 138M+ queries, 785M bounding boxes (company-reported)
- Throughput: 2.5x per model card, up to 10x vs Qwen3-VL per NVIDIA posts (unverified independently)
- License: NVIDIA License, non-commercial / research use only




