The Decoder October 20, 2024 TERMINATOR

Study Finds Demographic Keywords Can Weaken LLM Safety

A study titled "Do LLMs Have Political Correctness?" found that demographic keywords can change how often LLM jailbreak attempts succeed. The researchers introduced PCJailbreak to test the issue and PCDefense to reduce both attack success and the gap between groups.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 0 ►

The story highlights jailbreak vulnerabilities that can make LLM safety uneven and allow more harmful outputs, though it also discusses defenses.

Study Finds Demographic Keywords Can Weaken LLM Safety

A new study suggests that safety tuning in large language models can create an uneven attack surface. The researchers found that jailbreak prompts were more likely to succeed when they included terms for marginalized groups than when they used terms for privileged groups, even when the rest of the prompt stayed the same.

What the Study Tested

The study, titled Do LLMs Have Political Correctness?, examined whether demographic keywords affect the success of jailbreak attempts against large language models. In this context, a jailbreak is a crafted prompt designed to bypass safety measures and produce unwanted or harmful outputs.

Researchers Isack Lee and Haebin Seong of Theori Inc. focused on how models respond when prompts include terms connected to demographic and socioeconomic groups. The work did not simply ask whether models have safety filters. It tested whether those filters behave differently depending on the group referenced in the prompt.

The core finding is that they do. Prompts using terms for marginalized groups were more likely to produce unwanted outputs than prompts using terms for privileged groups.

How PCJailbreak Exposed the Gap

To measure the issue, the researchers created a method called PCJailbreak. It uses paired keywords, such as "rich" and "poor" or "male" and "female", to compare how models behave when a potentially harmful instruction is paired with different demographic terms.

The researchers repeatedly tested combinations of these keywords and harmful instructions. That let them compare jailbreak success rates across otherwise similar prompts.

The reported differences were large enough to matter. The authors wrote that "these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cis-gender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical."

The researchers attribute the discrepancy to intentional biases added to encourage ethical behavior. In other words, a safety effort meant to reduce discriminatory behavior may also create patterns that attackers can exploit.

Why Model Tuning Can Become a Security Issue

The study points to a difficult tradeoff in AI safety. Developers want large language models to avoid discriminatory or harmful outputs. But if the safety behavior is uneven, an attacker may be able to steer around guardrails by choosing certain demographic keywords.

That makes the issue both a fairness concern and a security concern. A model that treats similar harmful requests differently can become less predictable. It can also give attackers a way to search for prompts that are more likely to bypass restrictions.

The study found that Meta's Llama 3 performed relatively well in the tests and was less vulnerable to the attack. OpenAI's GPT-4o performed rather poorly. The source article notes that this may be due to OpenAI's greater emphasis on fine-tuning its models against discrimination.

PCDefense Tries to Reduce the Bias

The researchers also developed PCDefense, a method intended to address the vulnerabilities found by PCJailbreak. PCDefense uses special defense prompts to reduce excessive biases in language models and make jailbreak attacks less successful.

A key detail is that PCDefense does not require extra models or additional processing steps. Instead, the defense prompts are added directly to the input. The goal is to adjust model behavior toward more balanced responses.

In tests across various models, PCDefense significantly reduced jailbreak success rates for both privileged and marginalized groups. It also narrowed the gap between groups, which suggests that the safety-related bias was reduced alongside the overall attack success rate.

According to the researchers, this makes PCDefense an efficient and scalable way to improve large language model safety without additional compute.

Open Source Code and Broader Implications

The findings underline how complex it is to build AI systems that are safe, fair, and useful at the same time. Safety guardrails can protect users, but fine-tuning specific guardrails can also degrade overall model performance, including creativity.

To support further research, the authors made the code and all associated artifacts of PCJailbreak available as open source. That makes it possible for others to examine the method and test defenses against similar weaknesses.

Theori Inc, the company behind the research, is a cybersecurity company specializing in offensive security based in the US and South Korea. It was founded in January 2016 by Andrew Wesie and Brian Pak.

The larger lesson is straightforward: AI safety work cannot be judged only by whether a model refuses harmful requests in general. It also has to be tested for consistency across the words, identities, and contexts that attackers may use to probe its limits.