SAM2: Segment Anything in Images and Videos

Contents
▪ Introduction
▪ Task
▪ Model
▪ Dataset (we can skip how to get this dataset)
▪ Experiments
▪ Conclusions

Introduction
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images

Introduction
▪ Segment Anything Model 2 (SAM 2)
• A unified model for video and image segmentation
• Task : Promptable Visual Segmentation (PVS)
• Model : Memory attention & encoder & bank module
• Dataset : SA-V

Task
▪ Promptable visual segmentation (PVS)
• PVS allows providing prompts to the model on any frame of a video

Model
▪ A generalization of SAM to the video (and image) domain
• SAM 2 supports point, box, and mask prompts on individual frames to define
the spatial extent of the object to be segmented across the video
• The frame embedding used by the SAM 2 decoder is not directly from an
image encoder and is instead conditioned on memories of past predictions
and prompted frames

Model
▪ Image encoder
• MAE (Masked Auto Encoder) pre-trained Hiera image encoder
• Hierachical Vision Transformer (Hiera) is a simple hierarchical vision transformer by taking an existing
one and removing all its bells-and-whistles while supplying the model with spatial bias through MAE
pretraining
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org)

Model
▪ Image encoder
• Hierachical Vision Transformer (Hiera)
MViTv2
CSWin Transformer

Model
▪ Image encoder
• Hierachical Vision Transformer (Hiera)

Model
▪ Memory attention
• Condition the current frame features on the past frames features and
predictions as well as on any new prompts
• Self-attention, followed by cross-attention to memories of frames and object
pointers, followed by an MLP

Model
▪ Prompt encoder and mask decoder
• Prompt encoder is identical to SAM’s
• can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the
object in a given frame
• Decoder design largely follows SAM
• stacking “two-way” transformer blocks that update prompt and frame embeddings.
• an additional head to predict whether the object of interest
is present on the current frame
• skip connections from the hierarchical image encoder
bypassing the memory attention to incorporate
high-resolution information for mask decoding

Model
▪ Memory encoder
• Generates a memory by
• downsampling the output mask using a convolutional module
• summing it element-wise with the unconditioned frame embedding from the image-encoder
▪ Memory bank
• retains information about past predictions for the target object in the video by
maintaining a FIFO queue of memories of up to N recent frames
• stores information from prompts in a FIFO queue of up to M prompted frames
• stores a list of object pointers as lightweight vectors for high-level semantic
information of the object to segment

Model
▪ Training
• Pre-training
• Dataset : SA-1B
• Optimizer : AdamW w/ layer decay on image encoder and a reciprocal square-root schedule
• Loss : Focal loss and dice loss for the mask prediction, and MAE loss for the IoU prediction w/ 20:1:1
• Full-training
• Dataset : SA-1B, SA-V, and internal dataset
• Optimizer : AdamW w/ layer decay on image encoder and a cosine schedule
• Loss : Focal loss, dice loss, MAE loss, and cross-entropy loss for object prediction w/ 20:1:1:1

Experimental results
▪ Zero-shot experiments
• Video tasks
• Promptable video segmentation : involve simulating an interactive setting that resembles the UX
• Semi-supervised video object segmentation : prompts only on the first frame of the video

▪ Zero-shot experiments
• Image tasks
• 1-click and 5-click mIoUs by dataset domain
• model speed in frames per second (FPS) on a single A100 GPU

▪ Model architecture ablations

Couclusions
▪ three key aspects:
• extending the promptable segmentation task to video
• equipping the SAM architecture to use memory when applied to video
• the diverse SA-V dataset for training and benchmarking video segmentation.
▪ Hiera
• This model must be considered to be integrated into OTX

SAM2: Segment Anything in Images and Videos

More Related Content

What's hot

Similar to SAM2: Segment Anything in Images and Videos

More from Sungchul Kim

Recently uploaded

SAM2: Segment Anything in Images and Videos