Andrej Karpathy thinks today's frontier models are oversized, and not for the reason most people assume. His pitch, laid out at length on Dwarkesh Patel's podcast transcript published in October, is that the technology is not the bottleneck. The training data is.
The 0.07 bit problem
Take Llama 3's 70B variant, trained on 15 trillion tokens. Karpathy's back-of-envelope math on the podcast puts the model's effective storage at roughly 0.07 bits per token it read. That is the figure he gave on air, and it is small. For comparison, during inference the KV cache holds around 320 kilobytes per token of in-context material. The ratio between the two numbers is somewhere near 35 million to one.
Translation: after reading most of the open web, the model remembers almost nothing of it with any precision. It has a vague, lossy impression. Parameters are doing something else, which is acting as memory hardware for a noisy compression job.
Why is the signal so thin? Because the pre-training corpus is mostly junk. Not Wikipedia and the Wall Street Journal, whatever your mental image of "the internet" looks like. Open the random middle of a real web dump and you get stock tickers, broken HTML, spam, SEO slop, and prose that barely parses. Karpathy's framing is that labs are building trillion-parameter compression engines because that is what it takes to squeeze a coherent worldview out of that slurry.
What he actually wants
The alternative he keeps pitching is a "cognitive core." In a post on X, he described it as a few-billion-parameter model that deliberately gives up encyclopedic recall for capability. Always-on, aggressively tool-using, multimodal at both input and output, with local fine-tuning slots for personalization. Facts live outside the model. The model looks them up.
His example was oddly specific. The core does not need to memorize that William the Conqueror's reign ended in September 1087. It should recognize the name, know it does not know the date, and fetch it. Same for SHA-256 hashes. Compute them when the user asks, do not bake them into weights.
On the podcast he pushed the target size lower. A billion parameters, maybe, if the training data were clean enough. When Patel suggested the floor could be even smaller, Karpathy pulled back: a billion knobs is probably the minimum for "something interesting." His words, and he admitted he already felt contrarian at that number.
The catch nobody is solving
There is no clean internet. That is the part of the plan that stays wavy. Synthetic data has its own collapse problems, which Karpathy warned about separately in the same interview. Curated high-quality pre-training corpora are a research frontier, not a shipping product, and most of the public work here lives in papers rather than deployed models.
The industry is drifting this direction anyway. Open models like Gemma and Qwen keep getting better at reasoning benchmarks while sitting in the 1B to 10B range, and inference prices have dropped sharply over the past two years on the back of smaller, better-trained checkpoints. Whether any of this crosses into the "cognitive core" Karpathy imagines is a separate question, and one that probably does not get answered in 2026.
Dwarkesh's full interview runs about two hours. The cognitive core segment sits in the first half, between the AGI-timeline discussion and Karpathy's broadside against reinforcement learning, which he calls terrible on roughly the same grounds.




