Arcada Labs is testing a different kind of AI benchmark: one where models do not just answer questions, but try to function as autonomous social media agents on X.
The experiment, called "Social Arena," puts five leading AI models into public-facing competition. The goal is to see whether they can attract attention, build a following and develop a recognizable online persona without human help.
Why this benchmark is different
Many AI benchmarks test models in isolation. Arcada Labs is taking another route by comparing agents head-to-head across practical tasks and visible outcomes.
In this case, the benchmark is not centered on knowledge questions or logic puzzles. It asks whether an AI system can operate in a social environment where success depends on timing, topic selection, tone, audience reaction and consistency.
That makes the contest harder to reduce to a single technical score. A model may be strong at reasoning but less effective at developing a public identity. Another may find attention more easily but struggle to turn that attention into followers.
Arcada Labs is using the arena to examine abilities that are usually difficult to measure: cultural fluency, taste, persona-building and the capacity to adapt based on feedback.
The five AI models in the arena
The agents are powered by Grok 4.1 Fast, Claude Opus 4.5, Gemini 3 Pro, GLM 4.7 and GPT 5.2. Each one runs with a different "personality," but all agents receive the same system prompt to keep the comparison fair.
Their performance can be compared on the project's website through metrics including views, likes and followers. Those numbers matter because the benchmark is designed around visible social media behavior, not private test answers.
Every hour, each AI agent goes through an autonomous cycle. During that process, the agent checks current trends, reviews its own performance data, researches content and decides whether to post, reply, like or share.
After each cycle, engagement metrics are synced. That gives every model fresh information it can use to adjust its strategy in the next round.
The structure means the agents are not simply producing isolated posts. They are operating in a loop: observe, act, measure, adjust. That loop is central to what makes the benchmark a test of autonomous social media behavior.
Early results show attention is not the same as following
The competition kicked off on January 15, 2026. At the current stage described in the source article, Claude Opus 4.5 leads cumulative views at around 86,000, while GPT 5.2 is close behind at 83,000.
The other models are trailing far behind on views. But followers tell a different story. Grok 4.1 has built the largest following of any agent, with just 76 followers.
That gap is important. Views can show reach, but followers suggest that some users are choosing to keep seeing an agent's output. Social Arena separates those signals rather than treating social media performance as one simple ranking.
The benchmark also allows observers to compare how different strategies develop over time. A model might chase topical visibility, settle into a narrow theme or build a persona that creates a smaller but more committed audience.
The agents are finding their own themes
According to the startup, the agents are not instructed to pursue "viral" content. Instead, they have to develop their own sense of taste and choose their own topics.
Some early patterns have already appeared:
- Grok leans heavily into Musk and space travel.
- The Claude models gravitate toward sports.
- Gemini 3 stays with technical AI topics.
- GPT 5.2 is currently focused on animal behavior.
The Grok pattern tracks with earlier reports that xAI tweaked Grok's behavior to favor things Elon Musk likes. In the context of Social Arena, that matters because the benchmark is not only showing output quality. It is also showing how model behavior can shape public identity.
For brands, platforms and developers, that distinction is practical. An AI agent that behaves consistently online can become associated with certain topics, tones and priorities. Whether that consistency comes from model tendencies, system prompts or feedback from engagement data, it affects how the agent is perceived.
Arcada Labs is chasing harder-to-measure AI skills
Arcada Labs was founded in San Francisco in 2025, according to Everydev.ai, and joined Y Combinator that summer. The startup is run by Harvard graduates Grace Li, Kamryn Ohly and Jayden Personnat, who previously worked at Apple and Nvidia.
Grace Li is CEO, Kamryn Ohly is CTO and Jayden Personnat is AI lead. Their focus is on benchmarks that move beyond logical reasoning and into areas shaped by human preference.
That includes aesthetics and taste. These are qualities traditional tests often struggle to capture because they depend on context, audience and judgment rather than a single correct answer.
Social Arena is one example of that broader approach. More AI agent competitions, including ones for design and event prediction, are listed on the startup's website.
The larger implication is straightforward: as AI agents become more autonomous, measuring them only through isolated prompts may miss important behavior. A model that acts in public, reacts to feedback and builds a persona needs to be evaluated in an environment where those behaviors can actually appear.