Karpathy Drops Nanochat Miniseries, Proves You Can Still Do Real ML Science for $100

Andrej Karpathy released the first version of his nanochat miniseries on January 7th. The pitch: stop thinking about training one model, start thinking about training a family of models controlled by a single dial (compute). Turn the dial up, get monotonically better results. That's the whole thing.

The actual insight here

Here's what Karpathy is getting at: if you do the scaling laws correctly, you don't have to wonder whether spending more compute will actually help. You just know. The curves don't intersect. Each model size has one correct training length, and you can calculate it ahead of time.

He trained 11 models (d10 through d20) in about 4 hours on 8×H100s. Total cost: roughly $100. The validation loss curves look exactly how they should, clean and parallel, no crossing. If your curves cross, something's wrong with your setup.

Chinchilla, but the ratio is different

So about those scaling laws. Nanochat reproduces the Chinchilla finding that parameters and tokens scale equally with compute, both going up as C^0.5. This means the optimal ratio of data to parameters is constant regardless of how much compute you throw at it.

But here's the interesting part: Chinchilla measured that ratio at 20. Nanochat gets 8.

That's a big difference. Karpathy floats two possibilities: maybe Muon (the optimizer) prefers bigger models trained shorter, or maybe it's an artifact of the smaller scale he's working at. He doesn't commit to either explanation. I appreciate that he just says "I don't know" instead of making something up.

The CORE metric thing

Validation loss comparisons are annoying. Different data distributions, different tokenizers, different everything. Karpathy wanted to compare against GPT-2 and GPT-3, so he pulled in the CORE metric from the DCLM paper. 22 benchmark tasks, aggregated into one number.

For GPT-2 this was straightforward since the models are public. For GPT-3 he had to get creative. The models were never released, but the paper has evaluation tables. He found 6 tasks that overlap between CORE and what's reported in the GPT-3 paper, used GPT-2 for calibration, and estimated the rest. The methodology is in a jupyter notebook if you want to check his work.

What it costs to match GPT-2

The extrapolation table at the end is the part everyone's going to screenshot. Matching GPT-2 Small (124M params, CORE 0.114): about $3, 8 minutes. GPT-2 XL (1.6B params, CORE 0.257): $546, 22 hours.

There's a lot of extrapolation happening here. Karpathy says so himself. But as a sanity check, the predicted FLOPs to reach GPT-3 175B performance is 5.7e23. The actual GPT-3 training run was around 3e23. Same ballpark.

The modded-nanogpt dig

There's a brief shot at modded-nanogpt buried in the discussion. Karpathy says some changes "mildly gamed the metric" by using batch size 1 with very long sequences. The issue: fewer tokens end up with cropped context at the start of batches, which artificially improves validation loss. The improvement isn't real.

He's being polite about it. "A bit subtle" is doing a lot of work there.

The boring parts, quickly

Hyperparameter tuning happened. Learning rates are close to optimal. Warmdown ratio moved from 0.2 to 0.4. Sequence length of 2048 balances context with document diversity. Batch size of 0.5M tokens is slightly large for pure FLOPs efficiency but good for wall-clock time. "By no means exhaustive."

What's next

Miniseries v2 has a simple goal: lift the line. Same framework, better bang for buck. The pretraining code is the target.

The real value here isn't any single model. It's that someone finally put together a clean, reproducible demonstration that scaling laws actually work at accessible scales. You can verify this yourself for about $100. The code is up. The methodology is documented.

Whether the Chinchilla ratio really should be 8 instead of 20, that's going to take more experiments at larger scale to figure out.