The Decoder July 30, 2024 TERMINATOR

Why SAM 2 raises the stakes for open-source computer vision

Meta has released SAM 2, an open-source foundation model for segmenting images and videos. The model builds on SAM with video training, a memory module, and the new SA-V dataset, while still facing limits around scene cuts, long occlusions, fine details, and crowded moving objects.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

Open video segmentation improves powerful computer vision capabilities that could support tracking and surveillance, though the story is mainly a routine model release.

Why SAM 2 raises the stakes for open-source computer vision

Meta has moved its Segment Anything work from still images into video with SAM 2, a new foundation model for image and video segmentation. The company is releasing the model, code, and dataset openly, giving machine vision researchers and builders a broader base for experiments that depend on identifying objects across frames.

The release matters because segmentation is one of the basic tasks behind computer vision. A system that can mark out an object in an image or video can support editing tools, robotics research, and generative AI video effects. SAM 2 aims to make that process more accurate, faster, and easier to guide with fewer user interactions.

From SAM to SAM 2

Meta introduced its original Segment Anything Model, or SAM, in April 2023. That model focused on image segmentation and was described as the "GPT-3 moment" for computer vision because of its quality gains.

SAM 2 is the follow-up. The important shift is that the new model was trained on video data and can handle both images and video. Where SAM was trained on 11 million images, SAM 2 is designed to follow objects across video frames as well as segment still images.

According to Meta, the video capabilities are meant to hold up even when video quality is lower or when objects are partly hidden. That matters because real video is rarely clean. Objects pass behind other objects, lighting and clarity can change, and the thing being tracked may only be partly visible at a given moment.

The source example describes a boy behind a tree: the model tracks only the visible part rather than treating the hidden portion as if it were available. That kind of occlusion has been a difficult machine vision problem, and SAM 2 is presented as a major step toward handling it.

The SA-V dataset behind the model

SAM 2 was trained on SA-V, short for Segment Anything Video. Meta describes SA-V as the largest publicly available video segmentation dataset to date.

The scale is central to the release. SA-V contains 50,900 videos and 642,600 mask annotations. In total, it includes 35.5 million individual masks, which Meta says is 53 times more than previous datasets. The dataset also includes nearly 200 hours of annotated video.

Those figures explain why the dataset is part of the story rather than a background detail. Video segmentation models need examples of objects changing position, being covered, reappearing, and moving through different contexts. A large annotated video dataset gives the model more material for learning those patterns.

Meta used a human annotation workflow called a Data Engine to create the dataset. The two SAM systems helped in that process. Human annotators used SAM 2 interactively to label video segments, and the resulting data was then used to improve SAM 2.

Meta says this process can label videos up to 8.4 times faster than other systems because of the "SAM model in the loop". In practical terms, the model helps the human annotators work faster, while the human-reviewed data helps the model improve.

How SAM 2 follows objects through video

SAM 2 builds on the Transformer-based architecture of the earlier SAM model. The key new component is a memory module. That memory stores information about objects and previous interactions across video frames.

This design lets SAM 2 keep track of objects over longer video sequences and respond to user input as the video progresses. Instead of treating every frame as an isolated image, the model can use earlier context to inform what it sees later.

When SAM 2 is used on still images, the memory is empty. In that setting, the model behaves like SAM. That keeps the new model connected to the original image segmentation use case while extending it into video.

Meta reports that SAM 2 achieved better segmentation accuracy with three times fewer interactions than previous approaches. The company also says it outperforms the current state-of-the-art on established video object segmentation benchmarks.

The model also improves on the original SAM in image segmentation. According to Meta, SAM 2 produced better image segmentation results than SAM at six times the speed. Its inference speed is 44 frames per second, which puts it near real-time performance.

Meta also says SAM 2 should be robust regarding skin color and age, with minimal fluctuations between genders. Those claims matter because segmentation systems can be used on people as well as objects, and performance differences across groups can affect reliability.

Where SAM 2 still falls short

SAM 2 is not presented as a finished answer to every segmentation problem. The source identifies several areas where the model can still fail or lose precision.

It can lose objects after scene cuts.
It can struggle after long occlusions.
It can have difficulty segmenting very fine details.
It can slip when tracking individual objects inside groups of similar moving objects.

The crowded-object problem is especially important for video. When many similar objects are moving at the same time, a segmentation system has to decide which object is which across frames. The source notes that SAM 2 can slip in that situation.

The researchers suggest that explicit modeling of movement could help address this. That points to one likely direction for future work: not only recognizing shapes and object regions, but also representing how objects move through a scene.

Why the open-source release matters

Meta is releasing the SAM 2 model, code, and weights as open source under the Apache 2.0 license. The SA-V dataset is being released under the CC-by-4.0 license. SAM 2 can also be tested in an interactive demo.

That combination makes the release more than a model announcement. Researchers can inspect and use the model, work with the dataset, and test the system directly. For a field that depends heavily on benchmarks, data, and reproducible comparison, the release gives the community concrete material to build on.

The researchers see SAM 2 as an important advance for machine vision. The source points to possible applications including robots that move and interact with the real world more reliably, as well as video effects in generative AI video models.

The larger point is simple: SAM 2 brings image and video segmentation into a single open-source system with a large public video dataset behind it. It improves speed and accuracy, reduces the amount of interaction needed, and introduces memory for tracking objects across frames. Its limits are still clear, but its release gives computer vision a new foundation to test, challenge, and extend.