The Decoder September 11, 2025 TERMINATOR

Why AI training data has become tech’s own double standard

Reports cited by THE DECODER describe a sharp conflict in how major tech companies treat data access. The same industry seeking broad access to copyrighted work for AI training often bars others from scraping its own platforms without prior written consent.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

The story centers on large-scale unauthorized data scraping and corporate control over access, a mild power-and-harm concern more than a human-decline story.

Why AI training data has become tech’s own double standard

New reporting highlighted by THE DECODER puts a familiar AI dispute in sharper focus: who gets to use whose data, and on what terms. A two-year investigation by the International Confederation of Music Publishers (ICMP), alongside analysis by The Atlantic, points to a pattern in which major technology companies rely on large-scale scraping for AI training while restricting similar behavior on their own services.

The issue is not just whether AI systems used copyrighted music, lyrics, videos or platform content. It is also whether the rules being promoted for AI development are the same rules companies apply when their own platforms are the target.

What ICMP says it found

ICMP alleges that Google, Microsoft, Meta, OpenAI, and X trained AI systems at scale on copyrighted music. According to a Billboard-exclusive report cited by THE DECODER, the organization spent two years compiling evidence and describes the activity as "the largest IP theft in human history."

ICMP director general John Phelan said the industry is asking for broad access to data while requiring others to seek permission before using platform content. He told Billboard that "tens of millions of works" are being infringed every day.

The dossier, according to ICMP, includes several kinds of material:

Private datasets that allegedly show U.S. music apps Udio and Suno scraping YouTube.
Analyses suggesting Meta’s Llama 3 was trained on lyrics by artists such as The Weeknd and Ed Sheeran.
Court filings in the publishers’ lawsuit against Anthropic alleging that Claude reproduced hundreds of song lyrics, including "American Pie" and "Halo."
Claims that Microsoft’s Copilot and Google’s Gemini replicated copyrighted lyrics.

THE DECODER notes that not every item described as evidence carries the same weight. Chatbot statements about training data, for example, are weak proof because of how language models generate answers. Even so, the broader picture described in the source is that large and smaller AI firms have drawn heavily on copyrighted datasets across text, music and imagery.

YouTube videos became training material too

The Atlantic’s analysis, as summarized by THE DECODER, focuses on video. It reports that at least 15.8 million YouTube videos from more than 2 million channels were downloaded without permission and placed into at least 13 datasets. Nearly 1 million of those videos were how-to clips.

Titles and channel names were often removed, but The Atlantic reported that unique IDs could still be used to recover them. Mass downloading violates YouTube’s terms of service, yet The Atlantic wrote that YouTube has done little to stop it and did not comment.

The companies named as having used the datasets include Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. Meta, Amazon, and Nvidia said they respect creators and believe their use is lawful. Amazon said it is currently focused on producing "compelling, high-quality advertisements from simple prompts."

The exposure was not limited to obscure uploads. The Atlantic reported that news and educational channels were heavily represented, including the BBC with at least 33,000 videos and TED with nearly 50,000. Hundreds of thousands of videos from individual creators were also included.

How scraped video is made useful for AI

The source describes a practical reason these datasets matter: raw video is not enough. To train AI video systems, clips need structure. Long videos are split into segments and paired with English captions created by crowd workers or automatically by AI, so the system can connect language with moving images.

Dataset builders also appear to have selected material by quality signals. Curators of HowTo100M and HD-VILA-100M relied on high view counts, while HD-VG-130M used AI to choose clips of "aesthetic quality." The Atlantic also reported that datasets often avoid videos with overlays such as subtitles or logos, making watermarks a possible deterrent.

A leak from Runway, reported by 404 Media and cited by The Atlantic, showed interest in videos with "high camera movement," "beautiful cinematic landscapes," and "super high quality sci-fi short films." One channel was labeled "THE HOLY GRAIL OF CAR CINEMATICS SO FAR."

These details matter because they suggest the data was not an untraceable mass of anonymous files. The material was sorted, labeled and selected according to attributes that were useful for model training.

The contradiction at the center of the fight

The reports describe a double standard. On one side, AI companies argue for broad access to content so they can train models. On the other, ICMP points to terms at Facebook, YouTube, X, Google, OpenAI, Microsoft, and Adobe that require prior written consent for data use on their own platforms.

The same tension appears in the discussion over transparency. THE DECODER says the reporting challenges the argument that disclosing training data is too complex. ICMP’s reviewed data and leaks from companies such as Runway suggest that scraped content can be tracked with metadata such as artist, genre, and tempo.

That kind of traceability is relevant to the sort of disclosure envisioned by the EU’s AI Act, according to the source. If training material is already being categorized in detail for technical use, the question becomes whether companies can also account for that material to creators, publishers and regulators.

Why creators see a platform problem

The practical stakes are already visible in products named by the source. Meta is developing its Movie Gen text-to-video suite, Snap offers AI Video Lenses, and Google’s Gemini can animate photos into short clips or generate new videos with Veo 3.

At the same time, platforms are training on their own content. THE DECODER reports that Google trains on at least 70 million YouTube clips and Meta on more than 65 million Instagram clips.

For creators, that creates a direct concern: the platforms they helped fill with music, video, tutorials and visual work are also becoming places where synthetic content can compete for attention. The core dispute is therefore not only about past scraping. It is about whether the next generation of AI products is being built with one set of rules for technology companies and another set for everyone else.