Karpathy's 2025 Verdict: We're Summoning Ghosts, Not Raising Animals

Andrej Karpathy dropped his 2025 year-in-review on December 20, and it reads less like a retrospective than a paradigm autopsy. The former Tesla AI director and OpenAI co-founder identifies six "paradigm changes" that reshaped how LLMs work, what they are, and how we interact with them. Some of these shifts were anticipated. Others, by his own admission, blindsided him.

The through-line connecting all six is a growing intuition that we've been thinking about AI intelligence wrong. Not incrementally wrong. Categorically wrong.

The fourth training stage nobody planned for

For years, the recipe for building a production LLM looked stable. Pretraining (the GPT-2/3 era, circa 2020), supervised fine-tuning (InstructGPT, 2022), and reinforcement learning from human feedback (RLHF, also 2022). Three stages. Well understood. Expensive but predictable.

Then came Reinforcement Learning from Verifiable Rewards.

RLVR didn't emerge from a single paper or announcement. It crystallized as labs discovered that training LLMs against automatically verifiable rewards, particularly in math and coding, produced something that looked suspiciously like reasoning. The DeepSeek R1 paper, released in January 2025, became the canonical reference. By training models to solve problems where correctness could be checked programmatically, the models spontaneously developed intermediate calculation steps, self-correction behaviors, and what Karpathy calls "problem solving strategies for going back and forth."

The key distinction from RLHF: RLVR rewards aren't gameable. A math problem either solves correctly or it doesn't. There's no human labeler to fool, no preference model to exploit. This allowed for far longer optimization runs than the relatively thin SFT and RLHF stages that preceded it.

And here's where the economics shifted. RLVR offered, in Karpathy's framing, "high capability per dollar." Labs started redirecting compute originally earmarked for pretraining into extended RL runs instead. The 2025 frontier models weren't dramatically larger than their predecessors. They'd been trained longer against verifiable objectives.

OpenAI's o1 (late 2024) was the first public demonstration. But Karpathy marks o3, released early this year, as the real inflection point, the moment when "you could intuitively feel the difference." A new scaling knob emerged too: test-time compute. Longer reasoning traces, more "thinking time," more capability. The tradeoff between training compute and inference compute became negotiable in ways it hadn't been before.

Why "jagged intelligence" changes everything

This is where Karpathy's analysis gets uncomfortable. The RLVR breakthrough created something weird. Not uniformly smarter models. Models that spike in capability wherever verifiable rewards exist, and plateau or crater everywhere else.

He calls this "jagged intelligence," borrowing from a meme that's circulated in AI circles for months. Human intelligence is jagged too, obviously, but in predictable ways shaped by evolutionary pressures. We're good at spatial reasoning, social dynamics, pattern recognition in the physical world. We struggle with large numbers, long time horizons, formal logic without external aids.

LLM jaggedness follows a completely different topology. These systems are, as Karpathy puts it, "simultaneously a genius polymath and a confused and cognitively challenged grade schooler, seconds away from getting tricked by a jailbreak to exfiltrate your data."

The metaphor he's settled on: we're not evolving animals. We're summoning ghosts.

Everything about the LLM stack differs from biological intelligence. The architecture, the training data, the optimization pressure. Human neural nets evolved to keep a tribe alive in a jungle. LLM neural nets are optimized to imitate text, collect rewards on math puzzles, and win upvotes on LMSYS Arena. The resulting entities occupy entirely different regions of mind-space.

This framing matters because it predicts benchmark behavior. Benchmarks, by construction, are verifiable environments. Which means they're immediately susceptible to RLVR. Labs construct training environments adjacent to benchmark distributions, grow capability spikes to cover them, and produce models that crush leaderboards while remaining unreliable in the wild. "Training on the test set," Karpathy notes, "is a new art form."

His trust in benchmarks has collapsed accordingly. What does it look like to dominate every benchmark without achieving AGI? Apparently, it looks like 2025.

The emergence of LLM apps as infrastructure

Cursor became unavoidable this year. But Karpathy's interest isn't in the product itself. It's in what Cursor revealed about the emerging software stack.

A new layer appeared. Not models, not applications in the traditional sense. Something in between. LLM apps that bundle context engineering, orchestrate multiple model calls into complex DAGs, provide application-specific GUIs, and offer what Karpathy calls an "autonomy slider," letting users dial between tight human control and longer AI leashes.

The debate over this layer's thickness has been running all year. Will foundation model labs capture the entire value chain, or does meaningful defensibility exist for applications built on top? Karpathy's bet: labs will train increasingly capable generalists (he calls them "the generally capable college student"), but vertical apps will organize, fine-tune, and deploy these generalists into specific domains by adding private data, sensors, actuators, and feedback loops.

The precedent he's pointing to is the relationship between operating systems and applications. Windows didn't eliminate the need for Photoshop.

Claude Code and the localhost paradigm

If Cursor demonstrated the LLM app layer, Claude Code demonstrated something adjacent: an agent that lives on your computer.

Karpathy considers this Anthropic's structural advantage over OpenAI's competing Codex efforts. OpenAI, in his view, focused on cloud deployments in containers orchestrated from ChatGPT. Anthropic shipped a CLI tool that runs locally, integrates with your specific environment, and strings together tool use and reasoning in extended problem-solving loops.

The distinction seems subtle. It's not. Cloud-first agents assume a certain level of capability and autonomy that current models don't reliably possess. Local agents can work "hand in hand with developers and their specific setup," compensating for jagged capabilities with constant human supervision.

Claude Code "changed what AI looks like," Karpathy writes. "It's not just a website you go to like Google, it's a little spirit/ghost that 'lives' on your computer."

The framing is deliberately weird. He's leaning into the ghost metaphor. It captures something important about the uncanny relationship between these tools and their users: intimate, asymmetric, and fundamentally inhuman despite the natural language interface.

Vibe coding wasn't supposed to be a thing

In February 2025, Karpathy posted a tweet. A shower thought, by his own description. "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."

He expected nothing from it. The post went viral. Within weeks it had a Wikipedia page. By year end, Y Combinator reported that 25% of its Winter 2025 batch had codebases that were 95% AI-generated. Andrew Ng pushed back on the term. Simon Willison wrote essays distinguishing vibe coding from responsible AI-assisted programming. The Wall Street Journal covered it entering commercial use. Merriam-Webster acknowledged it.

Karpathy remains amused and slightly bewildered by the trajectory. But he's also synthesized what he thinks it means.

Vibe coding represents another instance of what he'd identified earlier in his "Power to the People" essay: LLMs flip the usual technology diffusion pattern. Historically, transformative technologies start with governments and corporations before reaching consumers. Early computers did ballistics. The internet was ARPANET. GPS was military.

LLMs inverted this. ChatGPT reached 100 million users faster than any consumer product in history. Regular people got productivity boosts before Fortune 500 companies figured out deployment strategies. And vibe coding democratized software creation in a way that no previous paradigm had approached.

The capability threshold crossed sometime this year. Anyone who can describe what they want in English can now produce working software. Not necessarily good software. Not maintainable software. But functional software that would have required professional developers twelve months ago.

Karpathy vibe-coded a Rust tokenizer for one of his projects. He built MenuGen, a menu-image generator that he describes as "a major cost center in my life." He's written entire ephemeral apps just to debug something, then discarded them after single use. "Code is suddenly free, ephemeral, malleable, discardable."

The implications he's tracking: job description rewrites, software terraforming, and a generation of new programmers who may never learn traditional syntax.

The LLM GUI problem, and Nano Banana's answer

The last paradigm shift Karpathy identifies is the weirdest, and possibly the most important for understanding where this goes next.

Chatting with LLMs, he argues, is analogous to issuing commands to a computer console in the 1980s. Text is the native representation for these systems, but it's not the preferred format for humans. We find reading effortful. We process visual and spatial information far more efficiently. This mismatch explains why the GUI was invented for traditional computing.

LLMs need an equivalent. They should speak to us in images, infographics, slides, animations, interactive web apps. The current accommodations, emoji, Markdown formatting, tables, are just dressed-up text. They're not the full solution.

Google's Nano Banana, released in its initial form in late August and upgraded to Nano Banana Pro in November, represents Karpathy's first candidate for what that solution might look like. What makes it notable isn't pure image generation quality. It's the joint capability emerging from text generation, image generation, and world knowledge all tangled together in the same weights.

The model connects to Google Search. It can generate infographics about real-world topics by pulling current data. It produces legible text directly in images. It maintains character consistency across edits. These capabilities weren't possible when image generation and language modeling lived in separate systems.

The implication: we're approaching a point where LLMs can communicate with humans in humans' preferred format rather than forcing humans to adapt to the machine's native representation. That shift, if it generalizes, changes the entire interaction paradigm.

The summary that isn't a summary

Karpathy closes with a characteristic both/and that captures his positioning all year. LLMs are simultaneously smarter than he expected and dumber than he expected. They're extremely useful and the industry hasn't realized anywhere near 10% of their potential at present capability. There are countless ideas to try and the field feels wide open. We'll see rapid continued progress and yet there's a lot of work left.

The paradox is only apparent. In a domain with jagged intelligence, where capability spikes coexist with embarrassing failures, both claims can be true. We're summoning increasingly powerful ghosts. We're learning what they can and can't do. We're building infrastructure to work alongside them despite their alien shape.

And we're doing all of this faster than anyone planned for, with benchmarks that no longer mean what they used to, in a year that Karpathy considers just the beginning.