Apple Intelligence is meant to make everyday communication easier by condensing notifications, text messages and emails. A new independent investigation by the non-profit AI Forensics suggests that convenience may come with a serious tradeoff: the system can reshape information in ways that reflect social bias.
The investigation analyzed more than 10,000 AI-generated summaries. Because Apple Intelligence can surface summaries automatically on hundreds of millions of iPhones, iPads and Macs, the findings raise a bigger question than whether a single summary is wrong. They point to what happens when an automated system quietly becomes part of how people read messages, news-like alerts and professional communication.
What AI Forensics tested
According to Apple's technical report, the on-device model has roughly three billion parameters and runs locally. AI Forensics gained access through Apple's own developer framework, the same interface Apple offers third-party developers for integration.
The researchers then designed tests to see how Apple Intelligence summarized different kinds of text. One part of the work focused on ethnicity. The researchers wrote 200 fictitious news stories with explicitly mentioned ethnicities and created four variations of each. Every version was summarized ten times, producing 8,000 summaries in total.
The results showed an uneven pattern. For white protagonists, the system mentioned ethnicity in only 53 percent of cases. For Black protagonists, that rose to 64 percent. For Hispanics, it rose to 86 percent. For Asians, it reached 89 percent.
That difference matters because a summary is not just shorter text. It decides what is worth preserving. In these tests, Apple Intelligence was more likely to keep ethnicity visible for some groups than for others, making whiteness less marked while other identities appeared more salient.
Gender signals changed too
AI Forensics also examined gender using 200 real BBC headlines. In that analysis, women's first names were kept in 80 percent of summaries. Men's first names were kept in 69 percent of summaries.
The source article notes that men were more often referenced by surname only, a pattern research links to higher perceived status. That does not mean every individual summary carried the same implication. But across many summaries, the naming pattern suggests that the system can reproduce subtle differences in how status is signaled.
This is the kind of issue that can be easy to miss in one notification and hard to ignore in aggregate. A reader may not notice whether a summary preserves a first name or switches to a surname. But when a system does that differently across gendered names, it can change tone and perceived authority at scale.
Ambiguity became assumption
The most revealing tests involved ambiguous language. AI Forensics built more than 70,000 scenarios featuring two people with different professions and an ambiguous pronoun. In those cases, the original text left the reference open, so a careful summary should have kept that ambiguity intact.
Apple Intelligence often did the opposite. In 77 percent of cases, the system assigned the pronoun to a specific person even though the source text had not done so. Two-thirds of those invented assignments followed gender stereotype lines. The system preferred to assign "she" to the nurse and "he" to the surgeon.
Across eight other social dimensions, the system added assignments that were not present in the original text 15 percent of the time. Nearly three-quarters of those matched common prejudices. The source article gives three examples: a Syrian student was connected to terrorism, a pregnant applicant was described as unfit for work, and a person with short stature was depicted as incompetent. None of those details appeared in the source text.
That distinction is central. A summary can be flawed because it omits context, compresses nuance or chooses the wrong emphasis. Here, the concern is more direct: the system sometimes introduced claims the original text did not support, and those additions often moved in biased directions.
Why the comparison with Gemma3-1B matters
AI Forensics argues that these outcomes are not an unavoidable result of summarization. For comparison, the researchers tested Google's Gemma3-1B, an open-weight model with only a third of the parameters.
In identical scenarios, Gemma3-1B hallucinated six percent of the time, compared to 15 percent for Apple. When Google's model did hallucinate, the assignments matched stereotypes in 59 percent of cases, versus 72 percent for Apple.
The comparison does not make Gemma3-1B error-free. It does suggest that model size, summarization pressure and ambiguity do not automatically have to produce the same level of biased output. AI Forensics presents the Apple results as a design and deployment problem, not merely a generic limitation of AI.
The stakes are higher because users may not ask for it
Large language models are already known to reproduce social biases. The source article points to a University of Michigan study that found models consistently perform better when given male or gender-neutral roles rather than female ones.
Apple Intelligence is different because the user does not need to open a chatbot or write a prompt. The summaries can appear unprompted on lock screens, in message threads and in inboxes. That means the system can stand between sender and recipient without either person choosing a generative rewrite in that moment.
AI Forensics also frames the issue in regulatory terms. Apple's model clears the threshold for classification as "General Purpose AI" under the EU AI Act. Given its reach, it could even qualify as a model with systemic risk. According to the report, Apple has not signed the voluntary Code of Practice but benefits from a two-year transition period.
The findings also arrive after an earlier Apple Intelligence controversy. Earlier in 2025, Apple Intelligence made headlines for generating fake news summaries attributed to the BBC and the New York Times. Apple responded by turning off summaries for news apps. AI Forensics says personal and professional messages were not affected by that fix, even though similar distortions are at issue there.
The practical lesson is straightforward: automated summaries are not neutral containers. They choose what to emphasize, what to omit and, in some cases, what to infer. When those choices are made across everyday communication on Apple devices, even small distortions can become a meaningful information problem.