Model Training Platforms

Karpathy's Nanochat Trains GPT-2 Level Model for $73, Down from $43K in 2019

A 600x cost reduction in seven years, driven by better hardware, algorithms, and data.

Oliver Senti
Oliver SentiSenior AI Editor
February 1, 20263 min read
Share:
Stopwatch showing 3 hours next to approximately 73 dollars in cash, with GPU server silhouette in background, representing the reduced cost and time of training GPT-2 level AI models.

Andrej Karpathy has released a new leaderboard tracking how fast you can train a model to GPT-2 performance on a single 8×H100 node. The current record: 3.04 hours, roughly $73 at typical cloud rates. When OpenAI trained the original GPT-2 in 2019, the bill ran to approximately $43,000.

That's a 600× reduction. If the trend holds, training costs for this capability level fall to about 40% of the previous year, every year.

The benchmark that matters

The nanochat repo targets the CORE metric from the DCLM paper, a composite score across 22 benchmarks. GPT-2's CORE score sits at 0.256525. Karpathy's January 29 run hit 0.25851, clearing the bar in 10,949 seconds of wall-clock training time.

The setup is deliberately constrained: one node, eight H100s, FineWeb-edu data. No exotic infrastructure. At Lambda's rates of around $24/hour, three hours costs $73. Karpathy frames this as "the best ChatGPT that $100 can buy," which undersells it a bit. You're getting verifiable GPT-2 capability, not a toy.

What actually changed

The 2019 GPT-2 training used 32 TPU v3 chips for roughly a week. The hardware difference alone accounts for some of the gap, but not 600×.

The architecture diverges substantially from vanilla transformer. RoPE replaces learned positional embeddings. RMSNorm everywhere, no learnable parameters. Sliding window attention in an SSSL pattern, where three short-window layers alternate with one full-context layer. Value embeddings at alternating layers add parameters cheaply. The model has 768M total parameters but only 110M "scaling parameters" by Karpathy's accounting.

Flash Attention 3 contributes roughly 9% throughput improvement over FA2. It reaches 75% utilization on H100s versus 35% for Flash Attention 2, exploiting Hopper-specific features like warp specialization and TMA asynchrony.

The Muon optimizer replaces AdamW for weight matrices. Karpathy says he "tried to delete Muon and couldn't." It uses matrix orthogonalization with Nesterov momentum and something called Polar Express for faster convergence. Embeddings and scalars still use AdamW. The split design seems fussy, but apparently nothing else worked as well.

What didn't work

The discussion post lists several abandoned ideas. Multi-token prediction added 13GB of memory for no improvement. Varlen attention was unnecessary since the BOS-aligned dataloader handles document boundaries adequately. FP8 for the language model head technically works but costs 2GB extra memory for only 1% speedup.

Half-truncated RoPE, asymmetric softcapping, skip connections, attention gates, batch size schedules: all tried, all rejected. Some of these have worked elsewhere. The lesson is that gains don't automatically transfer across setups. Karpathy ran 320 hyperparameter experiments to tune the final configuration.

The scaling question

Nanochat includes a depth parameter that scales everything else. Increase depth, and model dimension, head count, and optimal token budget scale accordingly. The $73 run uses depth=24. A depth=26 model takes about 12 hours and $300. The $1,000 tier runs 41.6 hours.

Whether this matters depends on your goals. GPT-2 capability, the specific target here, is modest by current standards. These models can hold conversations and write simple code, but they hallucinate confidently and struggle with reasoning. The point isn't to compete with GPT-4. It's to make LLM training accessible for experimentation and education.

The codebase is about 1,000 lines of meaningful code. No sprawling configuration objects or framework abstractions. You can read it. Karpathy built almost all of it by hand, noting that Claude and Codex agents were "net unhelpful" because the repo is "too far off the data distribution."


The speedrun leaderboard is open for contributions. Most work happens in three files: base_train.py, gpt.py, and optim.py. Improvements to the record are welcome. Nanochat is available on GitHub.

Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.