Why AI training data privacy risks start with web scraping

New research found personally identifiable information in DataComp CommonPool, a major open-source data set used for image generation research. The findings raise hard questions about web scraping, consent, deletion, and whether privacy filters can work at the scale of modern AI training data.

WTF Index TERMINATOR
◄ Terminator 3 Idiocracy 0 ►

The story centers on large-scale scraped training data exposing personal information and creating persistent privacy and misuse risks.

Why AI training data privacy risks start with web scraping

One of the largest open-source data sets used to train image generation models appears to include large amounts of personal information. New research found images of identity documents, faces, job applications, and other records inside DataComp CommonPool, a data set built from web-scraped image-text pairs.

The issue is not only that private material appeared in one data set. The deeper concern is that web-scraped AI training data can be copied, downloaded, reused, and used to train downstream models long after the people connected to that data know it exists there.

What researchers found in DataComp CommonPool

DataComp CommonPool was released in 2023 with 12.8 billion data samples. At the time, its curators said it was intended for academic research, but its license does not prohibit commercial use.

Researchers audited only 0.1% of CommonPool's data. Even in that limited sample, they found thousands of images that included personally identifiable information. Those images included identifiable faces and identity documents such as credit cards, driver's licenses, passports, and birth certificates.

The researchers also validated over 800 job application documents, including resumes and cover letters. Those documents were confirmed through LinkedIn and other web searches as being associated with real people. In many more cases, the researchers did not validate the material because they lacked time or because image quality made confirmation difficult.

Some resumes contained sensitive information. The source article says examples included disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When resumes were connected to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and contact information for other people such as references.

Why the scale changes the privacy problem

DataComp CommonPool was created as a follow-up to LAION-5B, which was used to train models including Stable Diffusion and Midjourney. Both CommonPool and LAION-5B draw on the same source: web scraping by the nonprofit Common Crawl between 2014 and 2022.

Because the researchers examined only a small portion of CommonPool, they estimate that the full data set could contain hundreds of millions of images with personally identifiable information, including faces and identity documents. They also estimated that a face-blurring algorithm missed 102 million faces across the full data set.

CommonPool has been downloaded more than 2 million times over the past two years. Rachel Hong, a PhD student in computer science at the University of Washington and the paper's lead author, said this means there are likely many downstream models trained on the same data set. That could duplicate similar privacy risks across systems.

The overlap with LAION-5B matters as well. Commercial AI models often do not disclose their training data, but shared data sources mean the same personally identifiable information likely appears in LAION-5B and in other downstream models trained on CommonPool data.

Why filtering was not enough

The curators of DataComp CommonPool were aware that personally identifiable information was likely to appear in the data set. They took some privacy measures, including automatically detecting and blurring faces.

But the research found that those measures did not catch everything. In the limited data set, Hong's team found and validated over 800 faces that the algorithm missed. The source article also notes that the curators did not apply filters that could have recognized known personal-information strings, such as emails or Social Security numbers.

William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, put the problem plainly: "Filtering is extremely hard to do well." He also said that if researchers web-scrape, private data will still be present even after filtering because of the scale involved.

Face blurring also leaves other privacy gaps. The blurring filter is optional and can be removed. Captions and photo metadata can contain names, exact locations, and other personal details that are not solved by altering the image alone.

Consent becomes harder to define

CommonPool was built from web data scraped between 2014 and 2022. The source article notes that many images likely date to before 2020, when ChatGPT was released. That timing raises a consent problem: even if someone made information available on the web, they could not have agreed to its use in large AI models that did not yet exist.

The researchers also found examples involving children's personal information, including depictions of birth certificates, passports, and health status. The source describes these as appearing in contexts suggesting they had been shared for limited purposes.

Another difficulty is that web scrapers often scrape data from each other. Agnew described a case where a person might upload something, later remove it, and still find that the removal no longer solves the problem because copies have moved elsewhere.

Hugging Face, which hosts CommonPool and distributes training data sets, integrates with a tool that theoretically lets people search for and remove their own information from a data set. But the researchers point out that this requires people to know their data is there in the first place.

What this means for AI policy and practice

The paper calls on the machine-learning community to rethink indiscriminate web scraping. It also discusses possible violations of existing privacy laws and the limits of those laws when applied to massive training data sets.

Marietje Schaake, a Dutch lawmaker turned tech policy expert and fellow at Stanford's Cyber Policy Center, noted that Europe has the GDPR and California has the CCPA, but there is no federal data protection law in America. That means different Americans have different rights protections.

Even where privacy laws exist, they may not reach every actor involved. The source article says such laws apply to companies that meet certain criteria for size and other characteristics, and may not necessarily apply to researchers who create and curate data sets such as DataComp CommonPool.

The larger assumption under pressure is that information available on the internet is automatically fair game for AI training. Hong, Agnew, and their colleagues argue that "publicly available" can still include material many people would consider private, such as resumes, credit card numbers, IDs, family blogs, and childhood news stories.

The clearest implication is practical: AI training data privacy cannot be treated as a cleanup task after collection. At the scale of CommonPool, the decision to scrape first and filter later leaves personal data embedded in data sets, downloads, and possibly trained models.