The Decoder September 16, 2024 NEUTRAL

Why World Labs is betting AI needs true 3D understanding

Fei-Fei Li has founded World Labs, a San Francisco-based startup building AI models that can understand the three-dimensional world. The company raised $230 million and plans to train large world models focused on spatial intelligence.

World Labs is entering the AI race with a specific premise: today’s systems can generate impressive images, videos and language, but they still lack a deeper grasp of how the physical world is built and how objects relate in three-dimensional space.

The San Francisco-based startup was founded by Fei-Fei Li, a prominent AI researcher known as the "godmother of AI." Its goal is to develop AI with spatial intelligence, a capability Li sees as central to stronger machine reasoning.

A new AI startup built around spatial intelligence

World Labs has raised $230 million in its initial funding round. The investor list includes venture capital firms Andreessen Horowitz, New Enterprise Associates, and Radical Ventures. Technology companies AMD, Intel, and Nvidia also participated through their investment divisions.

The company currently has 20 employees and is focused on developing models that can understand the three-dimensional world. That emphasis sets it apart from AI systems that mainly produce text, images or video without necessarily modeling the structure behind what they show.

Li explained the gap this way to Reuters: "The images and videos that you have seen so far coming out of generative AI models do not give you enough of the whole sense of how a 3D world is built."

That distinction matters because a flat image can contain visual information without giving an AI system a practical understanding of geometry, placement, physical relationships or likely outcomes. Spatial intelligence is about moving from recognition toward reasoning about the world as a place where objects occupy positions and actions have consequences.

What large world models are meant to do

World Labs plans to train "large world models" or "LWMs." These models will be based on the same Transformer architecture that powers OpenAI's ChatGPT chatbot, though Li stressed that the Transformer will not be the sole foundation of the company’s models.

The phrase large world models points to a broader ambition than generating a response or producing an image. The aim is to build systems that can reason about how the three-dimensional physical world works.

In plain terms, that means an AI system would need to interpret more than visible surfaces. It would need to understand where things are, how they relate to one another, what might happen next, and what kind of action would make sense in response.

The source example is simple but revealing: a picture of a cat pushing a glass to the edge of a table. A person can quickly understand the glass, its geometry, its location in 3D space, its relationship to the table and the cat, and the likely outcome. The human brain can also connect that understanding to action.

For AI, that chain of understanding is much harder. Recognizing a cat, a glass and a table is only one part of the task. Reasoning about the arrangement and likely physical result requires a richer model of the scene.

The "seeing and doing" cycle

Li describes spatial intelligence as part of a "virtuous cycle of seeing and doing." The idea is that perception and action are linked. A system that understands the world better can act more effectively, and the ability to act can deepen what the system needs to understand.

This is also connected to work at Li’s lab at Stanford University. The lab is working on teaching computers to act in the 3D world. One example involves using a large language model to instruct a robotic arm to carry out tasks from verbal instructions.

The tasks named in the source are opening a door and making a sandwich. Those examples show why spatial intelligence is more than a visual feature. A robotic arm needs to connect language, perception and movement in a world where objects have shape, position and physical constraints.

For an AI system, understanding an instruction is only the beginning. To act, it has to interpret the surrounding space, identify the relevant objects, reason about their relationships and carry out a sequence of movements that fits the situation.

That is the broader reasoning challenge World Labs is targeting. The company’s work is not described as merely improving image or video generation. It is aimed at giving machines a more useful internal understanding of the world those images and videos represent.

Fei-Fei Li’s role in the next stage of AI

Li’s background gives World Labs unusual visibility. She developed ImageNet, a comprehensive image dataset that enabled a new generation of computer vision technologies. From 2017 to 2018, she led the AI department at Google Cloud.

In addition to her work at World Labs, Li will continue her role at the Human-Centered AI Institute at Stanford University. The institute focuses on developing AI technologies that improve the human condition.

That dual role is notable because World Labs is pursuing a technical goal with broad implications for how AI systems perceive and act. Spatial intelligence could influence how future AI systems reason about scenes, follow instructions and interact with physical environments.

For now, the company’s public direction is clear: build large world models that go beyond language and surface-level generation, and move AI closer to understanding the three-dimensional world in which people live and act.