New RSL protocol pushes AI data licensing toward scale

Real Simple Licensing is a new system designed to let publishers state AI data licensing terms in machine-readable form. It has support from major web publishers, but its impact depends on whether major AI labs decide to use it.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 0 ►

This is mostly a business and licensing infrastructure story, with only mild relevance to AI power through training-data access and control.

New RSL protocol pushes AI data licensing toward scale

The AI industry is being pushed toward a harder conversation about training data. After Anthropic’s $1.5 billion copyright settlement, and with as many as 40 other pending cases seeking damages for unlicensed data, a new protocol is trying to turn web content licensing into something that can work at internet scale.

That protocol is Real Simple Licensing, or RSL. It is backed by a group of technologists and web publishers, including major names such as Reddit, Quora, and Yahoo, and it aims to give AI companies a clearer way to identify licensing terms before using publisher content.

What RSL Is Trying To Fix

The central problem is straightforward: AI companies need large amounts of data, and web publishers want clearer control over how their content is used. Without some kind of licensing system, AI companies could face a wave of copyright lawsuits that some worry could permanently slow the industry.

RSL is intended to create a shared technical and legal framework for AI data licensing. Eckart Walther, a co-founder of RSL and co-creator of the RSS standard, described the goal to TechCrunch in direct terms: “We need to have machine-readable licensing agreements for the internet.”

That phrase matters because the web is too large for one-off negotiations to cover every page, publisher, and use case. If licensing terms are machine-readable, AI companies can more easily determine what a publisher allows, what requires permission, and what may need payment.

Groups such as the Dataset Providers Alliance have already pushed for clearer data collection practices. RSL goes further by attempting to provide both a technical layer and a legal structure that could make those practices easier to apply in the real world.

How The Protocol Works

On the technical side, the RSL Protocol lets publishers state specific licensing terms for their content. Those terms can indicate whether an AI company needs a custom license or whether Creative Commons provisions apply.

Participating websites will place those terms in their “robots.txt” file using a prearranged format. That would make the terms easier for software systems to find and interpret, rather than leaving companies to infer rights from scattered policies or manual review.

The approach is designed for scale. A publisher can publish terms once, and AI companies can read them automatically. In principle, that gives rights holders a cleaner way to say what is allowed and gives AI companies a clearer path to compliance.

RSL also includes a legal layer through the RSL Collective. The collective licensing organization can negotiate terms and collect royalties, in a model compared in the source to ASCAP for musicians or MPLC for films.

The point is to reduce friction on both sides. Licensors get a single point of contact for paying royalties. Rights holders get a way to set terms with many potential licensors at once, instead of negotiating every possible use separately.

Who Is Already Involved

A number of publishers have joined the RSL Collective. The list includes Yahoo, Reddit, Medium, O’Reilly Media, Ziff Davis, Internet Brands, People Inc., and The Daily Beast. Ziff Davis owns Mashable and Cnet, while Internet Brands owns WebMD.

Other companies are supporting the standard without joining the collective. That group includes Fastly, Quora, and Adweek.

One notable part of the early coalition is that some participating publishers already have their own licensing arrangements. Reddit, for example, receives an estimated $60 million a year from Google for use of its training data.

RSL does not prevent separate deals. Publishers can still negotiate their own terms inside the system. The source compares this to Taylor Swift setting special licensing terms while still collecting royalties through ASCAP.

For smaller publishers, however, collective terms may matter more. A publisher without enough leverage to secure a direct AI licensing deal could still participate in a shared system that sets terms and collects payment.

The Hard Part: Measuring Use

The promise of RSL is clearer licensing. The difficult part is proving when payment is owed.

With music, it is comparatively easier to determine when a song has been played. AI models create a more complicated problem because the value of a document may appear during training, in real-time retrieval, or in later outputs that do not obviously point back to one original source.

The source identifies Google’s AI Search Abstracts as the simpler case. Because that kind of product draws from the web in real time and maintains strict attribution for each fact, it is easier to connect use to a source.

Large language model training is harder. If the act of training is not logged when it happens, it may be nearly impossible to confirm that a specific document was ingested into a model.

The challenge grows if publishers want payment per inference rather than a blanket fee. The source notes that one of the stock RSL licenses offers that option, but it would require AI companies to track usage in a way that can support royalty reporting.

Doug Leeds, a co-founder of RSL and former CEO of IAC Publishing, argues that the problem can be managed. “Some of the licensing agreements they’ve already done have required them to be able to report on it, so it’s possible,” he said. “It doesn’t have to be perfect. It just has to be good enough to get people paid.”

Why AI Lab Adoption Is The Real Test

RSL can give publishers a mechanism for declaring terms, but it cannot by itself force AI companies to accept them. The major question is whether frontier labs and other AI companies will treat the system as a practical standard.

There are signs that AI companies do pay for data when they see enough value. The source points to the success of ScaleAI and Mercor as evidence that frontier labs have no problem paying for data.

The web, however, has often been viewed differently: as a source of cheap, low-quality data. With datasets such as the Common Crawl already available, RSL faces the challenge of persuading labs to pay royalties for something they have been used to accessing for free.

Another complication is the blurred line between web-scraping and machine-enhanced browsing. The recent dustup between Cloudflare and Perplexity shows that even identifying the activity can be difficult.

Leeds points to comments from AI leaders calling for a system like RSL, including Sundar Pichai at last year’s Dealbook Summit. Whether those calls were firm commitments or broad public signals, the RSL team appears ready to test them.

“They have said outwardly to everyone, something like this needs to exist,” Leeds told TechCrunch. “We need a protocol. We need a system.”

RSL may now provide that system. Its success will depend less on whether publishers can publish machine-readable terms, and more on whether AI companies decide those terms are worth honoring at scale.