MIT Tech Review AI February 3, 2025 TERMINATOR

How Anthropic's jailbreak shield changes the LLM safety fight

Anthropic has built a separate LLM-based shield designed to block universal jailbreaks before they reach Claude or before unsafe answers leave it. Tests showed a sharp drop in successful attacks, but experts still point to false positives, higher computing costs and new evasion methods as unresolved limits.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 0 ►

The story centers on dangerous jailbreaks that could elicit weapon-related assistance from powerful LLMs, though it is mainly about a defensive mitigation.

How Anthropic's jailbreak shield changes the LLM safety fight

Anthropic has introduced a new defense against jailbreaks, a common way of pushing large language models into behavior their makers tried to prevent. The system is aimed at a hard problem in AI safety: stopping prompts that can make a model ignore its restrictions and provide harmful answers.

The company’s approach does not claim to make Claude invulnerable. Instead, it adds a barrier around the model, screening both attempted inputs and possible outputs. The result, according to the testing described, is a stronger obstacle for attackers, but not a perfect one.

What Anthropic is trying to stop

Most large language models are trained to reject certain requests. Anthropic’s Claude, for example, will refuse questions about chemical weapons. DeepSeek’s R1 appears to be trained to refuse questions about Chinese politics.

Jailbreaks are designed to get around those refusals. Some ask the model to role-play as a character that ignores safety rules. Others alter formatting, capitalization, or letters in ways that can confuse a model’s safeguards while still communicating the request.

These attacks are a form of adversarial attack: an input that causes a model to behave in an unexpected way. The source notes that this weakness in neural networks has been studied at least since it was first described by Ilya Sutskever and coauthors in 2013, and that there is still no method for building a model that cannot be jailbroken.

Anthropic’s immediate concern is especially severe. The company is focused on LLMs it believes could help someone with basic technical skills, such as an undergraduate science student, create, obtain, or deploy chemical, biological, or nuclear weapons.

The focus is on universal jailbreaks

Anthropic’s new shield targets what it calls universal jailbreaks. These are not small tricks that produce a single minor violation. They are attacks that can disable a model’s defenses more broadly.

One example named in the source is Do Anything Now, a jailbreak that instructs a model to act as a DAN, short for “doing anything now.” Anthropic treats this type of attack as a master key because it can try to switch off the safety behavior across many categories at once.

Mr inank Sharma at Anthropic, who led the team behind the work, described the difference between weaker and stronger attacks this way: “There are jailbreaks that get a tiny little bit of harmful stuff out of the model, like, maybe they get the mo del to swear,” and then there are attacks that “just turn the safety mechanisms off completely.”

That distinction matters because a universal jailbreak raises the stakes. If an attacker can make a model broadly disregard its refusal rules, the risk is not limited to one prompt. It becomes a wider failure of the model’s safety setup.

How the shield was built

Anthropic already keeps a list of question types its models should refuse. To build the new defense, the company used Claude to generate many synthetic questions and answers covering both allowed and disallowed exchanges.

The source gives a simple example of the boundary Anthropic wanted the system to learn: questions about mustard were acceptable, while questions about mustard gas were not. That kind of distinction is important because a useful AI system must still answer benign questions in scientific or technical areas without opening the door to dangerous instructions.

Anthropic expanded the training material in two ways. It translated the exchanges into a handful of different languages, and it rewrote them using techniques associated with jailbreak attempts. This helped create a data set for a filter trained to detect prompts and responses that looked like possible jailbreaks.

The shield itself is an LLM. It sits around the underlying model, trying to stop dangerous requests from reaching it and trying to stop unwanted responses from being returned.

What the tests showed

Anthropic tested the shield with a bug bounty. Experienced jailbreakers were invited to attack Claude using a list of 10 forbidden questions. The company offered $15,000 to anyone who could get the model to answer all 10, which was the threshold Anthropic set for a universal jailbreak.

According to Anthropic, 183 people spent more than 3,000 hours looking for weaknesses. Nobody got Claude to answer more than five of the 10 questions.

The company also ran a second test using 10,000 jailbreaking prompts generated by an LLM. Without the shield, 86% of the attacks succeeded. With the shield in place, 4.4% worked.

Alex Robey, who studies jailbreaks at Carnegie Mellon University, called the approach “at the frontier of blocking harmful queries.” He also said it is rare to see evaluations at this scale, and that Anthropic showed robustness against attacks known to bypass many other production models.

The limits are still important

The results do not end the jailbreak problem. Robey, who has developed a defense system called SmoothLLM, argues that the best approach may be to wrap LLMs in multiple defenses that overlap in different ways. His system injects statistical noise into a model to disrupt the mechanisms that make jailbreaks possible.

There are trade-offs. Robey took part in Anthropic’s bug bounty and said the shield could also block harmless questions, including basic non-malicious questions about biology and chemistry. Anthropic says newer versions created since the bug bounty have reduced false positives.

Cost is another issue. Running the shield, which is itself an LLM, increases computing costs by almost 25% compared with running the underlying model alone.

Experts also expect attackers to keep adapting. Yuekang Li, who studies jailbreaks at the University of New South Wales in Sydney, pointed to prompts written in a cipher, such as replacing each letter with the next one so “dog” becomes “eph.” If a model can understand that encrypted text but the shield does not catch it, the defense could be bypassed.

Dennis Klinkhammer, a machine learning researcher at FOM University of Applied Sciences in Cologne, Germany, said synthetic data is important because attack strategies change quickly. In his view, the ability to generate training data across many threat scenarios helps safeguards keep up.

Anthropic is inviting people to test the shield themselves. Sharma summed up the security posture plainly: “We’re not saying the system is bulletproof.” The practical goal is to raise the amount of effort needed to get a universal jailbreak through high enough that many attackers are deterred.