NVIDIA Nemotron 3 Super: 120B Open Model for Agentic AI

Abstract visualization of a hybrid neural network architecture with branching pathways representing Mamba and transformer layers

NVIDIA released Nemotron 3 Super on March 11, a 120-billion-parameter open model built for multi-agent AI systems. Only 12 billion parameters activate during inference, thanks to a Mixture-of-Experts design that pairs Mamba layers (for memory efficiency) with transformer layers (for reasoning). The company claims up to 5x higher throughput and 2x better accuracy compared to the previous Nemotron Super, though those figures are self-reported against its own predecessor, not independent benchmarks.

Speed is the headline number. According to Artificial Analysis, Nemotron 3 Super hits 478 output tokens per second, outpacing OpenAI's similarly sized gpt-oss-120B at 264 tokens per second. NVIDIA calls it "the fastest model in its class," and the benchmarks back that up, at least on throughput. The model supports a 1-million-token context window and runs in NVFP4 precision on Blackwell GPUs, which cuts memory requirements versus FP8 on Hopper.

What makes this release unusual: NVIDIA is publishing the full training pipeline alongside weights. That means over 10 trillion tokens of pre- and post-training data, 15 reinforcement learning environments, and evaluation recipes. Checkpoints ship in BF16, FP8, and NVFP4 on Hugging Face. The blog post positions this as a middle tier between last December's 30B Nano and a 500B Ultra model expected later in 2026.

Enterprise partners are already lined up. Perplexity, Google Cloud Vertex AI, Oracle Cloud, and a dozen inference providers offer access from day one. The technical blog details cookbooks for vLLM and SGLang. GTC kicks off next week, and an Ultra announcement there would not be surprising.

Bottom Line

Nemotron 3 Super delivers 478 tokens per second with only 12B active parameters, making it the fastest open model in the 120B class according to Artificial Analysis.

Quick Facts

120B total parameters, 12B active at inference
478 output tokens/sec (Artificial Analysis, company-reported positioning)
1-million-token context window
10T+ pretraining tokens and 15 RL environments released openly
Available on Hugging Face, build.nvidia.com, Perplexity, OpenRouter

Tags:NVIDIANemotronopen source LLMagentic AIMixture of ExpertsMambainference optimization

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

NVIDIA Ships Nemotron 3 Super, a 120B Open Model for Agents

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Shenzhen District Drafts Government Subsidies for OpenClaw Startups

China's CAICT Launches Standards Framework for OpenClaw AI Agents

Musk's Macrohard Merges Tesla and xAI Into an AI System He Said Tesla Didn't Need

Stay Ahead of the AI Curve