Nous Research TST: 2-3x LLM Pretraining Speedup Claim

Abstract visualization of language model pretraining with overlapping token streams compressed into a single forward pass

Nous Research released a pretraining method called Token Superposition Training that the lab says cuts LLM wall-clock training time by 2 to 3 times at matched compute. The research blog describes a two-phase approach that leaves the final model, optimizer, tokenizer, and training data unchanged.

The trick happens early. For the first 20 to 40 percent of training, the model reads "bags" of contiguous tokens instead of single ones, averaging their embeddings on input and predicting the next bag through a multi-hot cross-entropy loss. Order inside each bag gets thrown away. The remaining run reverts to standard next-token prediction, and inference looks identical to a conventionally trained model.

Bag size scales with model size. The arXiv paper, submitted May 7 by Bowen Peng, Théo Gigant, and Jeffrey Quesnelle, reports optimal values between 3 and 8 tokens for smaller dense models, with larger bags for the 10B mixture-of-experts variant. Nous tested four scales: 270M, 600M, 3B dense, and a 10B-A1B MoE.

On the biggest run, the company reports its TST model hit a lower final loss than a FLOP-matched baseline in roughly 40 percent of the wall-clock time, consuming 4,768 B200-GPU-hours against the baseline's 12,311. It also beat the baseline on HellaSwag, ARC, and MMLU. All numbers are self-reported, and the paper notes the method burns through more raw training data per unit of compute, which makes it a poor fit for data-bound setups.

An ablation in the paper kills the gains entirely: re-initializing the input embedding and output head before phase two pushes final loss above the standard baseline. Shared representations across both phases, the authors argue, are what makes the method work.

Bottom Line

Nous reports its 10B-A1B MoE hit matched-FLOP baseline loss in 4,768 B200-GPU-hours versus 12,311 for the conventional run, all on its own benchmarks.

Quick Facts

2-3x wall-clock speedup at matched FLOPs (company-reported)
Up to 2.5x reduction at 10B-A1B MoE scale (company-reported)
4,768 vs 12,311 B200-GPU-hours on the 10B MoE run
Superposition phase runs for first 20-40% of training steps
Validated at 270M, 600M, 3B dense and 10B-A1B MoE

Tags:LLM pretrainingNous ResearchToken Superposition TrainingAI efficiencymachine learning researchmodel training

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Nous Research Claims 2-3x LLM Pretraining Speedup With TST

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

vLLM Benchmarks TurboQuant KV-Cache Quantization, Finds FP8 Still Wins

Goodfire Identifies Reused Addition Module Inside Llama 3.1 8B

Microsoft Benchmark Finds Top LLMs Corrupt 25% of Documents Over 20 Edits

Stay Ahead of the AI Curve