MIT Tech Review AI June 24, 2026 NEUTRAL

Why AI Needs a Real-Time Web Data Infrastructure Layer

AI systems increasingly need fresh, relevant, and trustworthy web data, not just larger models or static training sets. A web data infrastructure layer can help enterprises retrieve, structure, and govern public web information at scale.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

The story is mainly about enterprise data infrastructure for fresher AI outputs, with only a mild lean toward more capable large-scale AI systems.

Why AI Needs a Real-Time Web Data Infrastructure Layer

AI is moving deeper into enterprise workflows, but many systems still face a basic constraint: the information they need is often blocked, unstructured, or changing too quickly for static training data to remain useful.

That is creating demand for a new kind of AI infrastructure. The emerging web data infrastructure layer is designed to help models discover, retrieve, and use current information from the public web at scale.

The Web Was Not Built for AI Retrieval

The modern web contains vast amounts of information, but it was not designed for automated discovery and retrieval by AI applications. New AI tools need to find, map, and interpret a digital environment that includes hundreds of millions of existing web domains and billions of new URLs created each week.

For enterprises, the challenge is not only volume. Web data varies by geography, language, format, and access rules. It may be locked behind technical barriers, presented in formats that are difficult for models to use, or updated faster than traditional systems can track.

Or Lenchner, CEO of Bright Data, describes the scale of the opportunity this way:

“The data suggests there's far more data out there,” says Or Lenchner, CEO of Bright Data, a web data collection platform. “Think of the universe: It's out there, but you don't know what you don't know.”

That gap matters because AI performance increasingly depends on more than model architecture. Compute, networking, retrieval, and data engineering all shape whether an AI system can provide answers grounded in current and verifiable information.

Static Training Data Is No Longer Enough

Early AI advances were driven by scaling training data and model size. But traditional model training relies on snapshots of information collected at a particular point in time. In fast-moving business contexts, that kind of static knowledge can quickly become stale.

Companies may need to monitor competitor pricing, consumer sentiment, market trends, prices, inventory, security threats, and customer behavior. Those conditions change continuously. An AI system that cannot retrieve current information may provide an answer that is technically fluent but operationally weak.

Lenchner puts the business risk plainly:

“If it can't retrieve real-time information, it lacks context,” Lenchner says. “In a business setting, that's not acceptable anymore. Stale answers lead to bad decisions and disappointed consumers.”

Real-time web data can also support trust. One survey found that 56% of AI practitioners said businesses need access to real-time web data to improve trust in AI outputs. The source article also notes that live, high-quality web data can reduce AI hallucinations by giving models a more relevant knowledge base.

Retrieval-augmented generation, or RAG, is one response to this problem. With RAG, models pull in external data when a query is made. But large-scale retrieval alone does not guarantee current, contextually relevant, and trustworthy outputs in operational settings.

According to Gartner, 60% of AI projects that are not supported by AI-ready data will be abandoned by the end of the year. In this context, AI-ready data means information that is accurate, structured, organized, and contextualized.

What the New Infrastructure Layer Does

A web data infrastructure layer is meant to help AI systems discover data, access it in real time, and shape it for a specific context. Instead of relying only on more computing power, this kind of platform focuses on retrieving and transforming web information so it can feed AI systems more effectively.

The source describes platforms that emulate human browsing behavior to access available content and turn raw code into structured data feeds. This can be important for websites that do not work well with traditional scraping tools, including sites heavy in JavaScript or protected by aggressive antibot software.

Lenchner describes the technical ambition as infrastructure that can mimic a web user with identifying information such as IP address, location, and “1,000 more parameters.” The same passage describes this happening “80 billion times a day for millions of websites.”

At enterprise scale, systems often combine multiple sources:

Public web retrieval
APIs
Licensed datasets
Proprietary internal data

The hard part is integrating those fragmented inputs into a timely and usable knowledge layer. Some research has found that 97% of AI organizations depend on real-time web data infrastructure, while 90% feel boxed in by various restrictions.

Latency is another pressure point. If the end user is waiting for an AI output, retrieval has to happen quickly enough to support the experience. As Lenchner says, “You need to retrieve data at scale, but also in real time. Latency becomes an issue because of the end user who is waiting for the output.”

Governance Becomes Part of the Stack

Continuous retrieval also raises governance questions. The source describes platforms that can enforce strict compliance protocols aligned with global privacy frameworks, including the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

These systems can also be limited to openly accessible, public information, avoiding paywalls or private logins. Networks used for access can be vetted and consent-based, with incentives provided to owners of IP addresses.

That governance layer matters because web data retrieval is becoming critical infrastructure for some companies. Lenchner argues that building it internally can become a full-time engineering problem that competes with the actual AI work.

As a result, organizations may look to specialized platforms for data retrieval, orchestration, and observability. The goal is not just to collect more information, but to make that information usable, timely, compliant, and relevant to the AI system’s task.

AI Systems Need Knowledge as Much as Intelligence

The emerging web data infrastructure layer points to a broader shift in how enterprises think about AI. A powerful model is not enough if the information feeding it is outdated, incomplete, or poorly organized.

Lenchner frames the relationship between models and data as intelligence and knowledge:

“Think of the trained model as intelligence and relevant data as knowledge. A powerful intelligence layer sitting on top of a hollow knowledge layer is like a genius who knows nothing—useless in practice. Intelligence and knowledge have to come together.”

Practical examples are already visible in the source article. A retail company can use public information to support a dynamic pricing engine, while global brands can track trademark infringements.

As organizations invest in this layer, AI systems may become more responsive, reliable, and aligned with real-world conditions. Over time, the distinction between AI models and the infrastructure that feeds them may begin to fade.