Kim, Sungchul
Contents
▪ Introduction
▪ Task
▪ Model
▪ Dataset (we can skip how to get this dataset)
▪ Experiments
▪ Conclusions
Introduction
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images
Introduction
▪ Segment Anything Model 2 (SAM 2)
• A unified model for video and image segmentation
• Task : Promptable Visual Segmentation (PVS)
• Model : Memory attention & encoder & bank module
• Dataset : SA-V
Task
▪ Promptable visual segmentation (PVS)
• PVS allows providing prompts to the model on any frame of a video
Model
▪ A generalization of SAM to the video (and image) domain
• SAM 2 supports point, box, and mask prompts on individual frames to define
the spatial extent of the object to be segmented across the video
• The frame embedding used by the SAM 2 decoder is not directly from an
image encoder and is instead conditioned on memories of past predictions
and prompted frames
Model
▪ Image encoder
• MAE (Masked Auto Encoder) pre-trained Hiera image encoder
• Hierachical Vision Transformer (Hiera) is a simple hierarchical vision transformer by taking an existing
one and removing all its bells-and-whistles while supplying the model with spatial bias through MAE
pretraining
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org)
Model
▪ Image encoder
• Hierachical Vision Transformer (Hiera)
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org)
MViTv2
CSWin Transformer
Model
▪ Image encoder
• Hierachical Vision Transformer (Hiera)
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org)
Model
▪ Memory attention
• Condition the current frame features on the past frames features and
predictions as well as on any new prompts
• Self-attention, followed by cross-attention to memories of frames and object
pointers, followed by an MLP
Model
▪ Prompt encoder and mask decoder
• Prompt encoder is identical to SAM’s
• can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the
object in a given frame
• Decoder design largely follows SAM
• stacking “two-way” transformer blocks that update prompt and frame embeddings.
• an additional head to predict whether the object of interest
is present on the current frame
• skip connections from the hierarchical image encoder
bypassing the memory attention to incorporate
high-resolution information for mask decoding
Model
▪ Memory encoder
• Generates a memory by
• downsampling the output mask using a convolutional module
• summing it element-wise with the unconditioned frame embedding from the image-encoder
▪ Memory bank
• retains information about past predictions for the target object in the video by
maintaining a FIFO queue of memories of up to N recent frames
• stores information from prompts in a FIFO queue of up to M prompted frames
• stores a list of object pointers as lightweight vectors for high-level semantic
information of the object to segment
Model
▪ Training
• Pre-training
• Dataset : SA-1B
• Optimizer : AdamW w/ layer decay on image encoder and a reciprocal square-root schedule
• Loss : Focal loss and dice loss for the mask prediction, and MAE loss for the IoU prediction w/ 20:1:1
• Full-training
• Dataset : SA-1B, SA-V, and internal dataset
• Optimizer : AdamW w/ layer decay on image encoder and a cosine schedule
• Loss : Focal loss, dice loss, MAE loss, and cross-entropy loss for object prediction w/ 20:1:1:1
Data
▪ SA-V dataset
Experimental results
▪ Zero-shot experiments
• Video tasks
• Promptable video segmentation : involve simulating an interactive setting that resembles the UX
• Semi-supervised video object segmentation : prompts only on the first frame of the video
Experimental results
▪ Zero-shot experiments
• Image tasks
• 1-click and 5-click mIoUs by dataset domain
• model speed in frames per second (FPS) on a single A100 GPU
Experimental results
▪ Model architecture ablations
Couclusions
▪ three key aspects:
• extending the promptable segmentation task to video
• equipping the SAM architecture to use memory when applied to video
• the diverse SA-V dataset for training and benchmarking video segmentation.
▪ Hiera
• This model must be considered to be integrated into OTX

SAM2: Segment Anything in Images and Videos

  • 1.
  • 2.
    Contents ▪ Introduction ▪ Task ▪Model ▪ Dataset (we can skip how to get this dataset) ▪ Experiments ▪ Conclusions
  • 3.
    Introduction Introducing SAM 2:The next generation of Meta Segment Anything Model for videos and images
  • 4.
    Introduction ▪ Segment AnythingModel 2 (SAM 2) • A unified model for video and image segmentation • Task : Promptable Visual Segmentation (PVS) • Model : Memory attention & encoder & bank module • Dataset : SA-V
  • 5.
    Task ▪ Promptable visualsegmentation (PVS) • PVS allows providing prompts to the model on any frame of a video
  • 6.
    Model ▪ A generalizationof SAM to the video (and image) domain • SAM 2 supports point, box, and mask prompts on individual frames to define the spatial extent of the object to be segmented across the video • The frame embedding used by the SAM 2 decoder is not directly from an image encoder and is instead conditioned on memories of past predictions and prompted frames
  • 7.
    Model ▪ Image encoder •MAE (Masked Auto Encoder) pre-trained Hiera image encoder • Hierachical Vision Transformer (Hiera) is a simple hierarchical vision transformer by taking an existing one and removing all its bells-and-whistles while supplying the model with spatial bias through MAE pretraining Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org)
  • 8.
    Model ▪ Image encoder •Hierachical Vision Transformer (Hiera) Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org) MViTv2 CSWin Transformer
  • 9.
    Model ▪ Image encoder •Hierachical Vision Transformer (Hiera) Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 2306.00989 (arxiv.org)
  • 10.
    Model ▪ Memory attention •Condition the current frame features on the past frames features and predictions as well as on any new prompts • Self-attention, followed by cross-attention to memories of frames and object pointers, followed by an MLP
  • 11.
    Model ▪ Prompt encoderand mask decoder • Prompt encoder is identical to SAM’s • can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the object in a given frame • Decoder design largely follows SAM • stacking “two-way” transformer blocks that update prompt and frame embeddings. • an additional head to predict whether the object of interest is present on the current frame • skip connections from the hierarchical image encoder bypassing the memory attention to incorporate high-resolution information for mask decoding
  • 12.
    Model ▪ Memory encoder •Generates a memory by • downsampling the output mask using a convolutional module • summing it element-wise with the unconditioned frame embedding from the image-encoder ▪ Memory bank • retains information about past predictions for the target object in the video by maintaining a FIFO queue of memories of up to N recent frames • stores information from prompts in a FIFO queue of up to M prompted frames • stores a list of object pointers as lightweight vectors for high-level semantic information of the object to segment
  • 13.
    Model ▪ Training • Pre-training •Dataset : SA-1B • Optimizer : AdamW w/ layer decay on image encoder and a reciprocal square-root schedule • Loss : Focal loss and dice loss for the mask prediction, and MAE loss for the IoU prediction w/ 20:1:1 • Full-training • Dataset : SA-1B, SA-V, and internal dataset • Optimizer : AdamW w/ layer decay on image encoder and a cosine schedule • Loss : Focal loss, dice loss, MAE loss, and cross-entropy loss for object prediction w/ 20:1:1:1
  • 14.
  • 15.
    Experimental results ▪ Zero-shotexperiments • Video tasks • Promptable video segmentation : involve simulating an interactive setting that resembles the UX • Semi-supervised video object segmentation : prompts only on the first frame of the video
  • 16.
    Experimental results ▪ Zero-shotexperiments • Image tasks • 1-click and 5-click mIoUs by dataset domain • model speed in frames per second (FPS) on a single A100 GPU
  • 17.
    Experimental results ▪ Modelarchitecture ablations
  • 18.
    Couclusions ▪ three keyaspects: • extending the promptable segmentation task to video • equipping the SAM architecture to use memory when applied to video • the diverse SA-V dataset for training and benchmarking video segmentation. ▪ Hiera • This model must be considered to be integrated into OTX