Andrej Karpathy has released a new leaderboard tracking how fast you can train a model to GPT-2 performance on a single 8×H100 node. The current record: 3.04 hours, roughly $73 at typical cloud rates. When OpenAI trained the original GPT-2 in 2019, the bill ran to approximately $43,000.
That's a 600× reduction. If the trend holds, training costs for this capability level fall to about 40% of the previous year, every year.
The benchmark that matters
The nanochat repo targets the CORE metric from the DCLM paper, a composite score across 22 benchmarks. GPT-2's CORE score sits at 0.256525. Karpathy's January 29 run hit 0.25851, clearing the bar in 10,949 seconds of wall-clock training time.
The setup is deliberately constrained: one node, eight H100s, FineWeb-edu data. No exotic infrastructure. At Lambda's rates of around $24/hour, three hours costs $73. Karpathy frames this as "the best ChatGPT that $100 can buy," which undersells it a bit. You're getting verifiable GPT-2 capability, not a toy.
What actually changed
The 2019 GPT-2 training used 32 TPU v3 chips for roughly a week. The hardware difference alone accounts for some of the gap, but not 600×.
The architecture diverges substantially from vanilla transformer. RoPE replaces learned positional embeddings. RMSNorm everywhere, no learnable parameters. Sliding window attention in an SSSL pattern, where three short-window layers alternate with one full-context layer. Value embeddings at alternating layers add parameters cheaply. The model has 768M total parameters but only 110M "scaling parameters" by Karpathy's accounting.
Flash Attention 3 contributes roughly 9% throughput improvement over FA2. It reaches 75% utilization on H100s versus 35% for Flash Attention 2, exploiting Hopper-specific features like warp specialization and TMA asynchrony.
The Muon optimizer replaces AdamW for weight matrices. Karpathy says he "tried to delete Muon and couldn't." It uses matrix orthogonalization with Nesterov momentum and something called Polar Express for faster convergence. Embeddings and scalars still use AdamW. The split design seems fussy, but apparently nothing else worked as well.
What didn't work
The discussion post lists several abandoned ideas. Multi-token prediction added 13GB of memory for no improvement. Varlen attention was unnecessary since the BOS-aligned dataloader handles document boundaries adequately. FP8 for the language model head technically works but costs 2GB extra memory for only 1% speedup.
Half-truncated RoPE, asymmetric softcapping, skip connections, attention gates, batch size schedules: all tried, all rejected. Some of these have worked elsewhere. The lesson is that gains don't automatically transfer across setups. Karpathy ran 320 hyperparameter experiments to tune the final configuration.
The scaling question
Nanochat includes a depth parameter that scales everything else. Increase depth, and model dimension, head count, and optimal token budget scale accordingly. The $73 run uses depth=24. A depth=26 model takes about 12 hours and $300. The $1,000 tier runs 41.6 hours.
Whether this matters depends on your goals. GPT-2 capability, the specific target here, is modest by current standards. These models can hold conversations and write simple code, but they hallucinate confidently and struggle with reasoning. The point isn't to compete with GPT-4. It's to make LLM training accessible for experimentation and education.
The codebase is about 1,000 lines of meaningful code. No sprawling configuration objects or framework abstractions. You can read it. Karpathy built almost all of it by hand, noting that Claude and Codex agents were "net unhelpful" because the repo is "too far off the data distribution."
The speedrun leaderboard is open for contributions. Most work happens in three files: base_train.py, gpt.py, and optim.py. Improvements to the record are welcome. Nanochat is available on GitHub.




