GPT Image 2 Leaked on Chatbot Arena Under Tape Codenames

Three unnamed image generation models showed up on LM Arena on Friday under the codenames maskingtape-alpha, gaffertape-alpha, and packingtape-alpha. Within hours, the AI community had a working theory: this is OpenAI's GPT-Image-2, tested in the wild using the exact playbook Google used with Nano Banana last August.

Developer Pieter Levels was among the first to call it out on X, posting that the models showed "extremely good world knowledge and great text rendering" and speculating they could outperform Nano Banana Pro. Venture investor Justine Moore followed up with her own tests, noting that simple prompts like "average engineer's screen" and "young woman taking selfie with Sam Altman" produced results with an uncanny level of contextual awareness. That Sam Altman detail is doing a lot of heavy lifting here, if you think about which company's training data would contain enough Sam Altman photos to render his face accurately from a casual prompt.

What actually works (and what doesn't)

Community testing has been happening fast. One user found that packingtape-alpha correctly rendered the time on a watch, something Nano Banana Pro botched. In a side-by-side of a first-person Minecraft scene set in Manhattan, maskingtape-alpha beat its own siblings and Nano Banana Pro. Another comparison using the prompt "top-down strategy game about optimizing an AI data center" made Nano Banana Pro look a generation behind.

The models are particularly strong on UI screenshots and game interfaces, which suggests heavy training on screen captures. Text rendering, the traditional Achilles' heel of image generators, appears meaningfully better than the current competition.

But they still can't pass the Rubik's Cube reflection test, a spatial reasoning benchmark that trips up every image model I've seen. And reports circulating on Russian-language Telegram channels describe aggressive content filtering that produces some bizarre results, including one case where the model allegedly rendered a map of Africa with "CIGER" instead of "Niger." I couldn't verify that specific claim independently, but it would track with OpenAI's historically aggressive approach to content safety in image generation.

The Nano Banana playbook

Here's the thing. We've seen this movie before.

In August 2025, Nano Banana appeared on LM Arena with no branding, racked up 2.5 million votes, and built the largest Elo lead in Arena history at 171 points. Google teased its involvement through banana emoji posts from executives including Demis Hassabis before formally confirming it was Gemini 2.5 Flash Image. The model brought 10 million new users to Gemini and briefly pushed the Gemini app past ChatGPT to the top of the App Store.

That whole sequence, anonymous Arena debut, community hype, corporate reveal, clearly left an impression on OpenAI. The adhesive-tape naming convention even feels like a direct nod to Google's fruit-based codename. And the Arena-first approach makes strategic sense: blind testing generates organic buzz that no marketing budget can replicate.

Why now

The timing is hard to ignore. OpenAI killed Sora on March 24, just six months after launching the standalone app. The shutdown torpedoed a billion-dollar Disney partnership that included licensing 200-plus characters for fan-generated video. Disney found out less than an hour before the public announcement, which is the kind of partner management that makes you wince.

Sam Altman framed it as a compute allocation decision in a recent interview. Sora was burning roughly a million dollars a day while its user base collapsed from a peak of about one million to under 500,000. "We have to concentrate our compute and our product capacity," he said. Where does that freed-up compute go? A next-generation image model seems like an obvious answer.

GPT Image 1.5 already topped the LM Arena image leaderboard in December 2025 with an Elo of 1264, edging past Nano Banana Pro. If the tape models represent GPT-Image-2, it is OpenAI doubling down on the one consumer AI category where viral adoption is actually happening, right as Google just released Nano Banana 2 a couple weeks ago.

The photorealism problem

Something the original Russian-language post flagged deserves more attention: some of the generated images are getting hard to distinguish from photographs. The poster admitted they couldn't always tell whether users were uploading real camera photos to troll, or whether the generations had simply gotten that good.

That's a different kind of problem than text rendering or spatial reasoning. And it's one that gets thornier the better these models become.

What happens next

OpenAI hasn't confirmed anything. The models may have already been pulled from Arena by the time you read this. But if maskingtape-alpha and its siblings hold up under sustained blind testing the way Nano Banana did, the Elo scores will speak for themselves. OpenAI would then have successfully copied the exact Arena-first strategy that caught it flat-footed eight months ago, which is either clever adaptation or a sign of how thoroughly Google's playbook reset expectations for how you launch an image model in 2026.