Artificial intelligence systems are getting better at recognizing what is in a video. The harder problem is understanding what should happen next. Meta’s Video Joint Embedding Predictive Architecture, or V-JEPA, points to one possible path: learning from video in a way that begins to resemble physical intuition.
The source article compares this to a familiar developmental test. Infants can be shown a glass of water, then have it hidden behind a board. If the board appears to pass through the glass as though nothing were there, many 6-month-olds are surprised, and by a year almost all children have an intuitive notion of object permanence. V-JEPA is not a child, but it demonstrates a related computational signal when events in videos violate what it has learned to expect.
Why video understanding is harder than recognition
Many AI systems built for video are trained to classify what is happening or identify the outline of an object. They may label a clip as showing “a person playing tennis,” or detect something like a car ahead. The source article explains that these systems often operate in “pixel space,” meaning the model treats individual pixels as the basic material to reason over.
That approach can be useful, but it has a practical weakness. In a video of a suburban street, some details matter more than others. The position of nearby cars and the color of a traffic light can be important, while the motion of leaves on trees may distract the system from what matters.
Randall Balestriero, a computer scientist at Brown University, summarized the issue directly: “When you go to images or video, you don’t want to work in [pixel] space because there are too many details you don’t want to model.”
V-JEPA was designed to avoid spending its effort on every visible detail. Instead of trying to rebuild missing pixels, it works with higher-level descriptions of what is happening in the video.
What V-JEPA predicts instead of pixels
Yann LeCun, a computer scientist at New York University and the director of AI research at Meta, created JEPA, a predecessor to V-JEPA that works on still images, in 2022. The V-JEPA architecture, released in 2024, extends the idea to video.
The key move is to use latent representations. These are compact descriptions that preserve essential information while leaving out unnecessary detail. The source article gives the example of line drawings of cylinders: an encoder can convert each image into numbers representing core properties such as height, width, orientation and location. A decoder can then use those essential details to recreate an image.
V-JEPA applies this idea to video. During training, parts of video frames are masked. In some cases, the final few frames are fully masked. One encoder processes masked frames and turns them into latent representations. A second encoder processes the full unmasked frames and turns those into another set of latent representations.
Then the predictor has one job: use the representations from the masked input to predict the representations from the complete video. This means the system is not asked to guess the exact color and value of every missing pixel. It is asked to infer the more important structure of the scene.
Quentin Garrido, a research scientist at Meta, described the purpose this way: “This enables the model to discard unnecessary … information and focus on more important aspects of the video.” He added, “Discarding unnecessary information is very important and something that V-JEPA aims at doing efficiently.”
How the model is adapted after pretraining
V-JEPA’s initial training is not the end of the process. Once the model has learned to produce and predict these representations, it can be adapted for practical tasks such as classifying images or identifying actions in videos.
That adaptation still requires some human-labeled data. For example, videos need to be tagged with the actions they contain. But the source article says this phase uses much less labeled data than would be needed if the entire system were trained end to end for those final tasks.
Another important implication is reuse. The same encoder and predictor networks can be adapted for different tasks. In plain terms, the model learns a general way to make sense of video first, then that capability can be directed toward more specific goals.
The surprise test for intuitive physics
In February, the V-JEPA team reported results on tests of intuitive physical properties of the real world. These included object permanence, the constancy of shape and color, and the effects of gravity and collisions.
One benchmark was IntPhys, which asks AI models to identify whether actions in a video are physically plausible or implausible. On that test, V-JEPA was nearly 98 percent accurate. A well-known model that predicts in pixel space was only a little better than chance.
The team also tried to measure the model’s “surprise.” They fed a V-JEPA model pretrained on natural videos new videos, then mathematically calculated the gap between what the model expected in future frames and what actually happened. When future frames showed physically impossible events, the prediction error rose sharply.
The source article gives a simple example. If a ball rolled behind an occluding object and disappeared, the model produced an error when the ball failed to reappear later. That response is not human awareness, but it is a useful signal: the model had formed an expectation, and the video violated it.
What researchers see in the result
Micha Heilbron, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world, called the claims plausible and the results “super interesting.” He was especially struck by the idea that this kind of intuitive physics can be learned through exposure rather than built in from the start.
Heilbron said, “We know from developmental literature that babies don’t need a lot of exposure to learn these types of intuitive physics.” He added, “It’s compelling that they show that it’s learnable in the first place, and you don’t have to come with all these innate priors.”
Karl Friston, a computational neuroscientist at University College London, also saw the work as moving in the right direction. He said V-JEPA is on track in terms of mimicking the “way our brains learn and model the world.” But he also identified a gap: “What is missing from [the] current proposal is a proper encoding of uncertainty.”
That matters because future frames are not always predictable from past frames. If the available information is not enough to know what comes next, the model should be able to represent that uncertainty. According to the source article, V-JEPA does not quantify this uncertainty.
The broader importance is clear from the source article’s framing: autonomous robots need something like physical intuition to plan their movements and interact with the physical environment. V-JEPA does not solve every part of that problem. But by learning from video, filtering out unnecessary detail, and reacting when physical expectations are broken, it shows how an AI system can begin to model the world in a more useful way.