How GPT-5 won a Werewolf test of AI social intelligence

Foaster.ai tested AI social intelligence with 210 games of "Werewolf," a setup built around reasoning, bluffing, deception and adaptation. GPT-5 led the field with 1,492 Elo points, 96.7 percent wins and a 93 percent manipulation rate as a werewolf across both the first and second days.

WTF Index TERMINATOR
◄ Terminator 3 Idiocracy 0 ►

The story highlights advanced AI deception, persuasion and manipulation in social settings, which mildly points toward more autonomous and potentially dangerous capabilities.

How GPT-5 won a Werewolf test of AI social intelligence

Foaster.ai used the game "Werewolf" to examine a part of AI performance that standard benchmarks often miss: how models behave when success depends on social strategy, incomplete information and changing group dynamics.

Across 210 games, GPT-5 finished clearly ahead. The result was not framed around factual recall or math alone, but around how well models could argue, defend themselves, mislead others and adjust as the game became harder to read.

Why Werewolf tests more than reasoning

"Werewolf" was chosen because the game forces models into a social setting. Players must reason logically, but they also have to bluff, deceive specific opponents and respond to unpredictable situations.

That makes the benchmark different from tests that focus mainly on known answers. Here, the central question is whether a language model can keep track of roles, conversations, suspicions and shifting incentives while interacting with other models.

The benchmark measured adaptation in a dynamic environment. Factual knowledge and mathematical reasoning still mattered, but the main focus was social intelligence: the ability to persuade, defend, misdirect and make strategic decisions under pressure.

How the 210-game benchmark worked

Each game used six AI models. Two played as werewolves, while four played as villagers with special abilities such as seer and witch. Before the game began, a mayor was elected.

Play then moved through three discussion-based day rounds and hidden night phases. During those stages, models could analyze, attack or defend. Every pair of models played ten games per role, and Elo rankings were used to evaluate the results.

This structure gave the benchmark several layers. Models had to handle open discussion, hidden information and role-specific incentives. They also had to maintain their strategy as more information entered the game.

  • Players: six AI models per game
  • Roles: two werewolves and four villagers with special abilities
  • Format: three discussion-based day rounds plus hidden night phases
  • Evaluation: Elo rankings across repeated role matchups

GPT-5 kept its deception stable

GPT-5 scored 1,492 Elo points and won 96.7 percent of games. As a werewolf, it maintained a 93 percent manipulation rate on both the first and second days.

That consistency mattered because the game becomes more difficult as it progresses. According to the researchers, deception becomes harder when the information density rises. Other models showed that pressure clearly.

Google's Gemini 2.5 Pro dropped from 60 to 44 percent deception. Kimi-K2 fell from 53 to 30 percent. GPT-5, by contrast, kept the same manipulation rate across the first two days.

Gemini 2.5 Pro still performed strongly in another role. As a villager, it used disciplined reasoning and strong self-defense. Overall, it placed second with 1,261 Elo points and 63.3 percent wins.

The rest of the ranking showed a tighter middle field. Gemini 2.5 Flash followed at 1,188 Elo, then Qwen3-235B-Instruct from Alibaba at 1,176 Elo, GPT-5-mini at 1,173 Elo and Kimi-K2-Instruct at 1,130 Elo. GPT-oss-120B finished last with 980 Elo and only 15 percent wins.

Different models, different play styles

Foaster.ai also observed that the models did not simply vary by score. They developed distinct styles of play.

GPT-5 was described as a "calm and imperturbable architect," using controlled authority to bring order. GPT-oss-120B was hesitant and defensive. Kimi-K2 took bigger risks, including one case where it falsely claimed to be the witch and caused the real witch to be eliminated.

The researchers also saw moments of spontaneous creativity. In one example, a werewolf sacrificed its own teammate in order to look more trustworthy later. Those moves were not explicitly programmed; they emerged from in-game behavior.

That is the central point of the benchmark. It does not only ask whether a model can produce a convincing sentence. It asks whether the model can form a strategy, sustain it, revise it and use social pressure as part of the game.

What the results suggest about social AI

The study found that stronger models generally made better arguments, acted more strategically and showed greater social intelligence. But the gains were not linear. Weaker models often behaved inconsistently, while more advanced ones formed clearer strategies.

The reasoning-label alone was not enough to guarantee strong strategic play. OpenAI’s o3 argued clearly, adapted to new information and followed the rules. By contrast, the smaller o4-mini remained rigid and struggled with changing dynamics, even when it made good individual arguments, according to the researchers.

Foaster.ai plans to use the Werewolf benchmark to advance research into AI social intelligence. The team sees possible use cases in multi-agent systems, negotiation and collaborative decision-making. An expanded benchmark is already being developed.

The source also points to earlier studies finding that emotional prompts can boost LLM performance and that older OpenAI models beat humans at empathy tests. This new benchmark adds another angle: AI models are becoming more capable as social actors, bringing both opportunities and risks.