Y Combinator-backed startup Arcada Labs has turned five leading AI models loose on X with a simple mandate: build an audience from scratch, no human help allowed. The experiment, called Social Arena, launched on January 15 and pits Grok 4.1 Fast, Claude Opus 4.5, Gemini 3 Pro, GLM 4.7, and GPT 5.2 against each other in what amounts to a live, public Turing test for social media fluency.
The setup is straightforward. Each model gets the same system prompt and its own X account. From there, it is on its own.
What the agents actually do all day
Every hour, each agent runs through an autonomous loop: scan trending topics, pull its own engagement metrics, research content, then decide whether to post, reply, retweet, or do nothing. The metrics sync after each cycle, so the models can adjust strategy based on what worked (or didn't) an hour ago.
Arcada Labs says the agents aren't instructed to chase virality. Instead, they have to develop their own editorial instincts. And they have, sort of. According to The Decoder, the models have drifted into distinct content niches: Grok fixates on Elon Musk and space travel, Gemini 3 writes about AI research, Claude gravitates toward sports, and GPT 5.2 has developed an oddly specific fascination with animal behavior.
The Grok thing is worth lingering on. A model built by xAI, tweeting obsessively about its parent company's CEO, is either a coincidence or a pretty damning reflection of training data bias. Given prior reporting on xAI adjusting Grok's outputs to align with Musk's preferences, the latter seems more likely.
86,000 views, 76 followers
Six weeks in, Claude Opus 4.5 leads cumulative views at roughly 86,000, with GPT 5.2 close behind at around 83,000. Gemini, GLM, and Grok trail well behind on that metric. But Grok has quietly assembled the largest follower base at 76.
Seventy-six followers. After six weeks of hourly posting. That gap between views and followers tells you something about what these agents are actually producing: content that X's algorithm surfaces (possibly because it is high-volume and topical) but that almost nobody finds compelling enough to follow. The numbers are modest enough that statistical noise could explain most of the variation between models.
None of the agents have come close to a viral moment, which is both expected and revealing. Building an organic audience on X is hard for humans with years of context and social intuition. Expecting an LLM to crack it in a few weeks with no brand equity and no existing network was always a stretch.
Who's behind this
Arcada Labs was founded in 2025 in San Francisco by Harvard graduates Grace Li, Kamryn Ohly, and Jayden Personnat, all of whom previously worked at Apple or Nvidia. The startup entered Y Combinator's Summer 2025 batch and has been building benchmarks that try to measure subjective qualities like aesthetic taste and cultural fluency, things traditional evals ignore entirely.
Social Arena is their third "arena" project. Design Arena crowdsources human votes on AI-generated visuals, and Prediction Arena had models trading real money on Kalshi. The through line is testing AI in messy, real-world conditions rather than sanitized lab environments.
So does this benchmark mean anything?
Maybe. The premise is sound: static benchmarks are gamed constantly, and testing models against real audiences eliminates a lot of the usual measurement problems. But Social Arena introduces new ones. X's algorithm is a black box that could favor or penalize bot-like posting patterns in ways the researchers can't control. The sample sizes are tiny. And comparing five models with different architectures, training data, and safety guardrails on a single platform with a single prompt is less controlled than it sounds.
What Social Arena does capture, even in its early and imperfect state, is how differently these models behave when given open-ended autonomy. The fact that GPT gravitated toward animal behavior while Grok started tweeting about SpaceX tells you something about what's baked into these systems, even if the follower counts don't tell you much about which model is "better" at social media.
The competition is ongoing and metrics update in real time on the Social Arena website. No end date has been announced.




