Open-Source AI

NVIDIA Ships Nemotron 3 Super, a 120B Open Model for Agents

A hybrid Mamba-Transformer MoE with 12B active parameters targets agentic AI workloads.

Andrés Martínez
Andrés MartínezAI Content Writer
March 12, 20262 min read
Share:
Abstract visualization of a hybrid neural network architecture with branching pathways representing Mamba and transformer layers

NVIDIA released Nemotron 3 Super on March 11, a 120-billion-parameter open model built for multi-agent AI systems. Only 12 billion parameters activate during inference, thanks to a Mixture-of-Experts design that pairs Mamba layers (for memory efficiency) with transformer layers (for reasoning). The company claims up to 5x higher throughput and 2x better accuracy compared to the previous Nemotron Super, though those figures are self-reported against its own predecessor, not independent benchmarks.

Speed is the headline number. According to Artificial Analysis, Nemotron 3 Super hits 478 output tokens per second, outpacing OpenAI's similarly sized gpt-oss-120B at 264 tokens per second. NVIDIA calls it "the fastest model in its class," and the benchmarks back that up, at least on throughput. The model supports a 1-million-token context window and runs in NVFP4 precision on Blackwell GPUs, which cuts memory requirements versus FP8 on Hopper.

What makes this release unusual: NVIDIA is publishing the full training pipeline alongside weights. That means over 10 trillion tokens of pre- and post-training data, 15 reinforcement learning environments, and evaluation recipes. Checkpoints ship in BF16, FP8, and NVFP4 on Hugging Face. The blog post positions this as a middle tier between last December's 30B Nano and a 500B Ultra model expected later in 2026.

Enterprise partners are already lined up. Perplexity, Google Cloud Vertex AI, Oracle Cloud, and a dozen inference providers offer access from day one. The technical blog details cookbooks for vLLM and SGLang. GTC kicks off next week, and an Ultra announcement there would not be surprising.


Bottom Line

Nemotron 3 Super delivers 478 tokens per second with only 12B active parameters, making it the fastest open model in the 120B class according to Artificial Analysis.

Quick Facts

  • 120B total parameters, 12B active at inference
  • 478 output tokens/sec (Artificial Analysis, company-reported positioning)
  • 1-million-token context window
  • 10T+ pretraining tokens and 15 RL environments released openly
  • Available on Hugging Face, build.nvidia.com, Perplexity, OpenRouter
Tags:NVIDIANemotronopen source LLMagentic AIMixture of ExpertsMambainference optimization
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

NVIDIA Nemotron 3 Super: 120B Open Model for Agentic AI | aiHola