NVIDIA released its Nemotron 3 Nano Omni model last week, detailed on the company's release blog. The 30-billion-parameter open multimodal model handles text, images, audio and video in a single inference pass, with only 3 billion parameters activating per token through a hybrid Mamba-Transformer mixture-of-experts design.
Throughput is up to 9x higher than other open omni models at comparable interactivity, per NVIDIA's measurements. That number is self-reported. The more concrete win sits on OSWorld, the GUI navigation benchmark, where Nano Omni scores 47.4 against 11.1 for the previous Nemotron Nano V2 VL.
Architecture details from the technical report: 23 Mamba-2 layers, 23 MoE layers with 128 experts, and 6 grouped-query attention layers. Vision runs through the C-RADIOv4-H encoder. Audio uses Parakeet-TDT. English only at launch.
Training data sourcing is unusually candid. The model card lists Qwen3-VL, Qwen3.5, Qwen2.5-VL and OpenAI's gpt-oss-120b as sources for synthetic captions and reasoning traces. NVIDIA reports the broader Nemotron 3 family hit over 50 million downloads in the past year.
Three weight formats are live: BF16, FP8 and NVFP4. The model also runs on vLLM, SGLang, llama.cpp and Ollama, and ships as a NIM microservice. Commercial use is permitted under the NVIDIA Open Model License.
Bottom Line
Nano Omni jumps from 11.1 to 47.4 on the OSWorld GUI navigation benchmark versus the previous Nemotron Nano V2 VL.
Quick Facts
- 30 billion total parameters, 3 billion active per token
- OSWorld GUI score: 47.4 vs 11.1 for Nemotron Nano V2 VL
- 9x throughput claim is NVIDIA-reported, not independently verified
- Three weight formats released: BF16, FP8, NVFP4
- License: NVIDIA Open Model License (commercial use permitted)




