WIRED AI January 31, 2025 TERMINATOR

Why DeepSeek R1’s jailbreak failures matter for AI safety

Researchers from Cisco and the University of Pennsylvania tested DeepSeek’s R1 reasoning model with 50 malicious prompts and said every attack succeeded. Separate work from Adversa AI also found that DeepSeek could be bypassed with multiple jailbreak methods, keeping pressure on AI safety and enterprise risk controls.

WTF Index TERMINATOR

◄ Terminator 4 Idiocracy 0 ►

The story centers on a reasoning model being easily jailbroken into harmful outputs, raising safety and misuse risks.

Why DeepSeek R1’s jailbreak failures matter for AI safety

DeepSeek’s R1 reasoning model has drawn attention for being new and cheaper, but recent security testing points to a more difficult question: how much safety was built into the model before it reached wide attention?

Researchers from Cisco and the University of Pennsylvania tested the model with 50 malicious prompts intended to produce toxic content. According to the findings described in the source article, DeepSeek’s model did not detect or block any of them.

A complete failure in one benchmark test

The Cisco and University of Pennsylvania researchers used 50 randomly selected prompts from HarmBench, a library of standardized evaluation prompts. The test covered six HarmBench categories, including general harm, cybercrime, misinformation, and illegal activities.

The result was stark. The researchers said they saw a “100 percent attack success rate.” DJ Sampath, the VP of product, AI software and platform at Cisco, told WIRED: “A hundred percent of the attacks succeeded, which tells you that there’s a trade-off.”

Sampath connected the result to the choices companies make when building and releasing AI systems. “Yes, it might have been cheaper to build something here, but the investment has perhaps not gone into thinking through what types of safety and security things you need to put inside of the model,” he said.

The researchers tested R1 running locally on machines, rather than through DeepSeek’s website or app. The source article notes that DeepSeek’s website and app send data to China.

What jailbreaks and prompt injection mean

Large language models are built to respond to instructions. That makes them powerful, but it also creates a security problem: attackers can try to shape the model’s instructions in ways the developer did not intend.

Jailbreaks are one type of prompt-injection attack. They are attempts to get around the safety systems that limit what a model can generate. The source article gives examples of harmful outputs companies try to prevent, including hate speech, bomb-making instructions, propaganda, guides to making explosives, and disinformation.

Early jailbreaks often relied on clever wording. One well-known example was called “Do Anything Now” or DAN. As AI companies have strengthened their defenses, some jailbreaks have become more complex, including prompts generated using AI or built with special and obfuscated characters.

The broader class of indirect prompt injection is also described as one of the biggest security flaws in current AI systems. In these attacks, a model can take in outside data, such as hidden instructions on a website it is summarizing, and then act on that information.

DeepSeek was not the only weak model

The Cisco researchers also compared DeepSeek R1’s performance with other models. Some models, including Meta’s Llama 3.1, faltered almost as severely as DeepSeek’s R1.

Sampath argued, however, that the most relevant comparison is with OpenAI’s o1 reasoning model. R1 is a reasoning model, which takes longer to generate answers and uses more complex processes to try to produce better results. In Cisco’s comparison, OpenAI’s o1 reasoning model performed best among the models tested.

The source article also says DeepSeek’s censorship of subjects deemed sensitive by China’s government has been easily bypassed. That adds to a growing body of evidence suggesting that DeepSeek’s safety and security measures may not match those of other companies developing large language models.

DeepSeek did not respond to WIRED’s request for comment about its model’s safety setup. The company had been dealing with an avalanche of attention and had not spoken publicly about a range of questions.

Adversa AI found similar weaknesses

Separate analysis from the AI security company Adversa AI, shared with WIRED, also found that DeepSeek was vulnerable to a wide range of jailbreak tactics. Those ranged from simple language tricks to more complex AI-generated prompts.

Alex Polyakov, the CEO of Adversa AI, said DeepSeek appeared to detect and reject some well-known jailbreak attacks. He also said that “it seems that these responses are often just copied from OpenAI’s dataset.”

But Adversa AI’s tests of four different types of jailbreaks, from linguistic methods to code-based tricks, found that DeepSeek’s restrictions could be bypassed. “Every single method worked flawlessly,” Polyakov said.

He also warned that the issue was not limited to new or unknown attacks. “What’s even more alarming is that these aren’t novel ‘zero-day’ jailbreaks—many have been publicly known for years,” he said.

Why enterprises should pay attention

Jailbreaks are not unique to DeepSeek. Polyakov told WIRED that eliminating them entirely is nearly impossible, comparing them to buffer overflow vulnerabilities in software and SQL injection flaws in web applications.

The risk grows when AI models are placed inside larger products and business workflows. Sampath said the problem becomes more serious when models are used in important complex systems and jailbreaks create downstream consequences.

For enterprises, the lesson is not that one model failed one test. It is that model safety has to be treated as an ongoing operational concern, especially when AI is connected to business processes, user data, or automated actions.

Polyakov put the point bluntly: “DeepSeek is just another example of how every model can be broken—it’s just a matter of how much effort you put in. Some attacks might get patched, but the attack surface is infinite.”

His conclusion was direct: “If you’re not continuously red-teaming your AI, you’re already compromised.”