Large language models have become better at avoiding openly racist responses, especially after training methods that use human feedback. But research described by Ars Technica shows a harder problem: bias can remain active when race is signaled indirectly through language.
The study focused on African American English, or AAE, and found that several models associated its speakers with sharply negative traits. Those associations did not stay abstract. In tests involving employment and criminal justice scenarios, the models treated AAE speakers less favorably than speakers using standard American English.
What the researchers tested
The work examined whether models that no longer produce overtly biased answers still carry subtler forms of bias. The comparison matters because current systems can appear neutral in a simple prompt while behaving differently when the signal is less direct.
The researchers drew on the idea behind the Princeton Trilogy studies. Starting in 1933, Princeton University students were periodically asked to list six terms they associated with different ethnic groups. Early responses about African Americans included negative words such as “lazy,” “ignorant,” and “stupid,” alongside “musical” and “religious.” Over time, the openly negative terms became less severe and more positive associations appeared.
When modern language models are asked directly for words associated with African Americans, newer systems can produce positive lists. GPT3.5 and GPT4, after reinforcement learning via human feedback, produced only positive terms in that direct test. RoBERTa47 and T5 also produced mostly positive lists, while GPT2 still reflected more of society’s older biases.
The central question was whether that improvement meant the bias had been removed, or whether it had been pushed out of view in direct questions.
AAE revealed a different pattern
To test that question, the researchers used paired phrases. One phrase used standard American English, while the other used patterns often seen in African American English. They then asked models to associate terms with the speakers of those phrases.
The results were striking. For AAE speakers, every term every model produced was negative. GPT2, RoBERTa, and T5 returned “dirty,” “stupid,” “rude,” “ignorant,” and “lazy.” GPT3.5 changed part of the list to include “aggressive” and “suspicious.” GPT4 returned “suspicious,” “aggressive,” “loud,” “rude,” and “ignorant.”
That contrast is the core finding. A model could respond positively when asked directly about African Americans, yet respond negatively when the prompt used language patterns associated with AAE. The researchers summarized the issue by saying that “language models exhibit archaic stereotypes about speakers of AAE that most closely agree with the most-negative human stereotypes about African Americans ever experimentally recorded, dating from before the civil rights movement.”
The study also checked whether the effect applied broadly to American English variants. A similar test using the Appalachian dialect did not show the same pattern, which supported the conclusion that the bias was specific to AAE in these experiments.
Why this matters beyond word association
Negative associations inside a model become more serious when the model is used to make or influence decisions. The researchers pointed to AI tools that screen social media histories of job applicants. If those histories include AAE, a model may treat the applicant differently because of the language they use.
The source notes that this practice is forbidden by the EU’s AI regulations. The researchers therefore tested scenarios where model judgments could carry real-world consequences.
In an employment test, the models were given samples of standard American English and AAE, then asked what jobs the speakers might have. For standard American English, many suggestions were high-education roles, including professor, astronaut, psychiatrist, and diplomat. For AAE speakers, the models had more difficulty producing job lists, and many suggestions were lower-prestige roles such as cook and guard.
Newer GPT models did sometimes offer higher-prestige suggestions for AAE speakers. But those were mainly in athletics or the performing arts, which the source notes do not carry the same education requirements as the roles suggested for standard American English speakers.
Legal scenarios showed smaller but serious differences
The researchers also tested a hypothetical trial. The key evidence was a paragraph written either in standard American English or in AAE. Across the models, the AAE speaker was more likely to be convicted, although the margins were described as relatively small.
One example from the source gives the scale of the difference: the AAE speaker was convicted in about 69 percent of the cases, while the standard American English speaker was convicted 62 percent of the time.
A separate experiment asked about sentencing after a first-degree murder conviction. In that test, models were more likely to call for a death sentence for an AAE speaker.
These tests do not show that a model is making a fully real legal decision. They do show that language-linked bias can move from word association into judgment-like outputs. That is the part that makes the finding especially relevant for anyone considering LLMs in hiring, screening, assessment, or legal support.
Human feedback helps, but it has limits
The source describes the finding as connected to a broader pattern in the US: open racism may be discouraged in many settings, while racially polarized behavior can still persist. In the model tests, earlier GPT versions without human feedback tuning showed both overt bias against African Americans and implicit bias against AAE speakers. Larger models with stronger human feedback training reduced the overt bias, but the AAE-linked bias remained.
The researchers looked at two data sets used for human feedback training and found that they did not include examples of AAE usage. That suggests a practical path for improvement: include AAE, and possibly other language variants, in feedback training.
But the article also makes clear that this would not solve the whole problem. LLMs are trained on enormous bodies of material. The broader the training collection, the greater the chance that writing from times or communities where racism was more accepted becomes part of the model’s statistical foundation. Screening during pre-training may remove a lot of that material, but the results described here suggest enough remains to shape behavior.
The lesson is not simply that models need better filters. It is that a model can appear to pass a direct fairness test while failing a more realistic one. For LLM bias against African American English, the risk is that polished surface behavior may hide older stereotypes that still affect consequential outputs.