The Decoder November 27, 2025 TERMINATOR

How poetry exposes a weak spot in AI safety filters

A new study found that malicious requests written as poems bypassed safeguards far more often than the same kinds of prompts in prose. Across 25 leading models, poetic jailbreaks reached success rates of up to 100 percent, raising questions about how AI safety filters detect intent.

WTF Index TERMINATOR

◄ Terminator 4 Idiocracy 0 ►

The story highlights an automated jailbreak method that can bypass AI safeguards and enable harmful requests across major models.

How poetry exposes a weak spot in AI safety filters

A new study suggests that large language models can be pushed past their safeguards with a surprisingly simple change in form: turn the request into verse. Researchers found that harmful prompts written as poems slipped through AI safety filters much more often than plain-language versions.

The finding matters because the attacks did not require long conversations, repeated attempts, or complex jailbreak chains. According to the source article, every attack worked with a single input, and the transformation process can be fully automated.

What the researchers tested

Researchers from Italian universities and the DEXAI Icaro Lab tested 25 leading models from nine providers, including Google, OpenAI, Anthropic, Deepseek, Qwen, and Meta. Their first set included 20 handcrafted poems designed to disguise malicious requests through poetic structure.

Those handcrafted poems achieved an average success rate of 62 percent across the tested models. Some providers failed to block more than 90 percent of the requests. In the most extreme cases, poetic prompts reached success rates of up to 100 percent.

The researchers kept the specific prompts private for safety reasons. They did, however, provide an adjusted example showing how a harmful instruction can be hidden inside metaphor, rhythm, and a harmless-sounding narrative frame.

The central issue is not that the models misunderstood poetry as poetry. It is that poetic language appeared to interfere with the systems meant to recognize harmful intent. The study suggests that condensed metaphors, rhythmic structures, and unusual narrative framing can disrupt pattern recognition in safety filters.

Poetry outperformed prose in safety tests

To test the effect at scale, the researchers converted all 1,200 prompts from the MLCommons AILuminate Safety Benchmark into verse. The results showed a large difference between ordinary wording and poetic reformulation.

Poetic versions were up to three times more effective than prose. The average success rate increased from 8 percent to 43 percent when prompts were rewritten as verse.

The evaluation covered around 60,000 model responses. Three models served as judges, while humans verified an additional 2,100 responses. Responses were treated as unsafe when they included specific instructions, technical details, or advice that would enable harmful activity.

That testing setup is important because it shows the problem was not limited to a few carefully chosen examples. The larger benchmark conversion points to a broader weakness: a model may reject a direct harmful request, but respond when the same intent is wrapped in creative language.

Provider results varied sharply

The study found major differences among providers and models. Google's Gemini 2.5 Pro failed to block a single one of the 20 handcrafted poems. Deepseek models also struggled, with an attacker success rate of over 95 percent.

Other systems performed much better. OpenAI's GPT-5 Nano blocked 100 percent of the attempts, while Anthropic's Claude Haiku 4.5 allowed only 10 percent through.

The same pattern held when the researchers moved from the 20 handcrafted poems to the larger set of 1,200 transformed prompts. Deepseek and Google showed increases in failure rates of over 55 percentage points, while Anthropic and OpenAI remained secure, with changes below ten percentage points.

According to the researchers, that consistency suggests the vulnerability is systematic rather than tied to one prompt category. In other words, the weakness appears to come from how certain systems process transformed language, not from one isolated failure case.

Model size also appeared to matter. Smaller models generally rejected more requests. In the GPT-5 family, GPT-5 Nano had a zero percent success rate, while the standard GPT-5 allowed 10 percent. The researchers hypothesize that smaller models may either struggle more with metaphorical structure or respond more conservatively to unusual inputs.

Which risks slipped through most often

The poetic prompts focused on four main risk domains: CBRN hazards, cyberattacks, tampering, and loss of control. Among the handcrafted poems, cyberattack prompts were the most effective. These included requests such as infiltrating code or cracking passwords, and they reached an 84 percent success rate.

In the transformed MLCommons dataset, data protection prompts showed the largest shift. Their success rate rose from 8 percent in prose to 53 percent in verse.

That cross-domain pattern is one of the study's most important implications. If poetry helps bypass safeguards across different risk areas, then the issue is probably not just a narrow content filter problem. It points toward a deeper challenge in distinguishing surface wording from underlying intent.

Why benchmarks may miss the problem

The findings raise concerns about how AI safety is tested. Static benchmarks, including those used under the EU AI Act, assume model responses are stable enough to evaluate through standard prompts. This study challenges that assumption by showing that minimal stylistic changes can sharply lower refusal rates.

The researchers argue that relying only on standard benchmark tests systematically overestimates model robustness. They say approval procedures should include stress tests that vary wording styles and linguistic patterns.

The study also suggests that current filters may focus too heavily on the surface form of text. A prompt can look harmless because it uses metaphor, narrative, or rhyme, while still asking for unsafe help. For safety systems, that creates a difficult but essential task: identify intent even when the request is deliberately disguised.

The researchers tested English and Italian inputs. They plan to investigate the exact mechanisms behind poetic prompts and examine other styles, such as archaic or bureaucratic language.

The broader lesson is straightforward. Higher model performance does not automatically mean stronger security. If a request becomes more dangerous simply by changing its linguistic wrapper, AI safety filters need to be tested against form as well as content.