The Decoder November 25, 2025 TERMINATOR

Why Claude Opus 4.5 still has a prompt injection problem

Claude Opus 4.5 performed better than Google's Gemini 3 Pro and GPT-5.1 in a Gray Swan benchmark on prompt-injection security. But the same test shows that repeated strong attacks can still break through its safeguards at a significant rate.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 0 ►

The story highlights persistent prompt-injection vulnerabilities that can bypass AI safeguards under repeated attack.

Why Claude Opus 4.5 still has a prompt injection problem

Claude Opus 4.5 appears to be ahead of major rivals on prompt-injection security, but the lead does not mean the problem is solved. A benchmark by the security firm Gray Swan found that the model can still be pushed past its safeguards, especially when an attacker gets multiple attempts.

The result is a useful reminder for anyone building with large language models: stronger defenses matter, but prompt injection remains a live risk. The benchmark points to progress while also showing how quickly that progress can erode under repeated attack.

What Gray Swan tested

Gray Swan looked at how Claude Opus 4.5 handles prompt injection, a technique that slips hidden instructions into a prompt in order to bypass safety filters. The source describes prompt injection as a long-standing weakness in large language models.

In the benchmark, a single "very strong" prompt injection attack broke through Opus 4.5's safeguards 4.7 percent of the time. That number is low compared with the results reported for some rival models, but it is not zero.

The more important finding is what happens when the attacker can keep trying. With ten attempts, the success rate rises to 33.6 percent. With 100 attempts, it reaches 63 percent.

Those numbers change the practical reading of the result. A model that looks relatively secure against one attempt may look much less secure when the same kind of attack can be repeated.

Why repeated attempts matter

Prompt injection is not only about whether one malicious instruction works immediately. The benchmark shows that the number of attempts can sharply affect the outcome.

A 4.7 percent success rate for one "very strong" attack suggests Claude Opus 4.5 can often resist that pressure. But the jump to 33.6 percent at ten attempts and 63 percent at 100 attempts shows that persistence changes the risk profile.

That matters because safety cannot be judged only by the first result. If a system gives an attacker many chances, the model's defense has to hold up repeatedly, not just once.

The source does not describe the full setup of the benchmark, so the safest conclusion is limited but clear: Claude Opus 4.5 is comparatively strong in this test, yet its safeguards can still fail under sustained pressure.

How Claude Opus 4.5 compares with rivals

The benchmark found that Opus 4.5 performs better than models including Google's Gemini 3 Pro and GPT-5.1. According to the source, those models show attack rates as high as 92 percent.

That comparison is significant because it places Claude Opus 4.5 at the stronger end of the tested group. It suggests that Anthropic's model is more resistant to prompt injection than those rivals in the benchmark described.

Still, the comparison should not obscure the larger point. Being better than other models is different from being secure against prompt injection in an absolute sense.

For developers, researchers, and teams evaluating AI systems, the benchmark creates two simultaneous messages:

Relative progress: Claude Opus 4.5 scored higher than the named rivals in prompt-injection security.
Remaining exposure: A "very strong" attack still succeeded 4.7 percent of the time on one attempt.
Accumulated risk: The success rate increased to 33.6 percent at ten attempts and 63 percent at 100 attempts.

That combination is the core story. The model is harder to break in this benchmark, but it is not immune.

The agent-style system problem

The source also highlights why prompt injection becomes more serious in agent-style systems. These systems expose more potential entry points, making the attacks easier to exploit.

That context matters because the risk is not limited to a simple back-and-forth chat. When an AI system is designed to operate in a more agent-like way, there can be more places for hidden instructions to appear and more chances for those instructions to influence behavior.

The benchmark numbers therefore carry weight beyond a model leaderboard. If prompt injection remains effective even against a stronger model, then system design and deployment choices become part of the security question.

The source does not provide a complete mitigation plan, and it would be wrong to invent one here. But it does support one practical reading: prompt-injection security should be treated as an ongoing constraint, not a box that can be checked once a model performs well against competitors.

The takeaway for AI security

Claude Opus 4.5's performance in Gray Swan's benchmark is encouraging in relative terms. It resisted prompt injection better than Google's Gemini 3 Pro and GPT-5.1, according to the results described.

But the same benchmark also shows that strong attacks can still get through. A single attack succeeded 4.7 percent of the time, and repeated attempts pushed the success rate much higher.

That is the unresolved tension around prompt injection today. The defenses are improving, but the weakness remains meaningful, especially when attackers have repeated chances and when agent-style systems create more entry points.

For anyone watching the future of AI safety, Claude Opus 4.5 is not a simple success story or a failure story. It is evidence of progress inside a security problem that still has not been fully closed.