TechCrunch AI July 29, 2024 TERMINATOR

Meta brings Segment Anything 2 from images into video

Meta introduced Segment Anything 2 at SIGGRAPH, extending its image segmentation work into video. The model will be open and free to use, with a free demo and a public annotated database of 50,000 videos.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

Extending segmentation from images to video modestly increases AI capability with possible surveillance and tracking uses, though the story is mainly a routine model release.

Meta brings Segment Anything 2 from images into video

Meta is moving one of its best-known vision AI tools from still images into video. At SIGGRAPH on Monday, CEO Mark Zuckerberg showed Segment Anything 2, a follow-up to Segment Anything that is designed to identify and outline objects across video rather than only in individual pictures.

The shift matters because video adds a much harder layer of work. A model is no longer looking at a single frame in isolation. It has to deal with a stream of visual information, while still giving users a fast way to point at what they want identified.

Why segmentation became a key vision AI task

Segmentation is the computer vision process of separating the meaningful parts of an image. In plain terms, it is how a model decides that one part of a picture is a dog, another part is a tree, and the boundary between them is not confused.

This problem is not new. The source article notes that segmentation has been worked on for decades. What has changed recently is the speed and reliability of the models, with Meta’s Segment Anything described as a major step forward.

Segment Anything stood out because it could quickly and reliably identify and outline a wide range of objects in an image. That made it useful beyond a narrow set of preselected categories. Instead of needing a system built for only one kind of visual task, users could apply the model more flexibly.

That flexibility is central to why Segment Anything 2 is notable. Meta is not simply improving a still-image model. It is taking the same basic promise and applying it to a more demanding medium.

What Segment Anything 2 changes

Segment Anything 2, also referred to as SA2 in the source article, applies natively to video. That is different from running the original Segment Anything model separately on every frame of a clip.

Running an image model frame by frame may be possible, but it is not the most efficient workflow. Video creates extra computational pressure, and the article frames SA2 as evidence that the field is moving quickly enough to make fast video segmentation practical.

Zuckerberg described possible scientific use cases in his conversation with Nvidia CEO Jensen Huang, including studying coral reefs and natural habitats. The important point is not only that the model can process video. It is that users can tell it what they want, and the system can work in a zero shot setting.

That combination makes the tool relevant for people who need to inspect changing visual scenes. A still image can show one moment. A video can show motion, continuity, and context. Bringing segmentation into that format expands what a vision model can help analyze.

Open access, but serious hardware

Meta plans to make Segment Anything 2 open and free to use, as it did with the first Segment Anything model. The source article says there is no word of a hosted version, though a free demo is available.

That open release fits a broader pattern at Meta. The company has released tools and models such as PyTorch, LLaMa, Segment Anything, and other AI systems freely, making them part of the broader AI ecosystem.

Still, open access does not mean the model is lightweight. The article makes clear that SA2 remains a large model that needs serious hardware. Video processing is more computationally demanding than still-image processing, and the progress here depends partly on efficiency gains across the industry.

That is an important distinction for builders. The model may be free to use, but practical use still depends on the ability to run it. For research groups, developers, and companies, the hardware requirement may shape how quickly SA2 can move from demo to production workflows.

The training data behind the model

Meta is also releasing a large annotated database of 50,000 videos created for this purpose. That dataset is one of the most concrete parts of the announcement, because video segmentation models need extensive labeled material to learn from.

The paper describing SA2 also used another database of over 100,000 internally available videos for training. That second database is not being released publicly, according to the source article.

This creates a familiar tension in open AI. Meta is making the model available and releasing a substantial public dataset, but not every part of the training pipeline is being opened. The source article notes that the reason for keeping the internally available video database private was not yet explained there.

For the broader AI community, the 50,000-video database may still become a useful resource. Annotated video data is valuable because it gives researchers and builders material for testing, comparing, and improving video vision systems.

Why Meta’s strategy matters

Zuckerberg framed openness as both helpful to the ecosystem and useful for Meta’s own goals. In the source article, he said the company is not releasing these systems purely out of altruism, but because an ecosystem around the technology can make what Meta is building stronger.

That logic is important. AI models are not just standalone files. They become more useful when researchers test them, developers build around them, and users expose their limits. Open releases can create feedback, adoption, and complementary tools that closed releases may not generate as quickly.

Segment Anything 2 therefore sits at the intersection of two Meta priorities described in the source article: stronger vision AI and a continued push into open AI releases. It also shows how quickly computer vision is moving from image understanding toward more flexible video understanding.

The result is a model that could be widely used, especially by people who need to identify and follow objects in video without building a narrowly specialized system first. Its real impact will depend on how easily researchers and builders can run it, test it, and adapt it to their own work.