Segment Anything

2
Segment Anything?
• 어떤 것이든 분할한다.

3
• 개요 : LLM은 zero-shot을 바탕으로 NLP에 큰 영향을 주는데 아직까지 Computer Vision에서는 그 효과가 밋밋하다.
• Goal : 빠르고 강력하면서 일반화된 image segmentation 작업
• 이 작업을 달성하기 위한 아래 3가지 내용의 해결 필요
1. What task will enable zero-shot generalization?
2. What is the corresponding model architecture?
3. What data can power this task and model?
Abstract
Goal

Method
Task
5
• Prompt : 마스크를 생성할 대상을 지정
→ 지정하는 방식은 3가지 중 하나(point, box, text)
point box text
fish

Method
Data
6
• data engine의 구성은 총 3단계
1. Assisted-manual stage
2. Semi-automatic stage
3. Fully automatic stage

Method
Data
7
1. Assisted-manual stage
: public segmentation dataset을 이용하여 SAM 학습 및 추론한 결과에서 사람이 픽셀 단위로 수정
수집된 데이터를 바탕으로 주기적으로 6번의 모델 재학습
12만장의 image로 430만개의 mask 수집
public
segmentation
dataset
학습 추론
수정

Method
Data
8
2. Semi-automatic stage
: 이전 단계에서 만든 데이터셋으로 SAM 학습 및 추론한 결과에서 제외된 object만 수정
1단계와 마찬가지로 수집된 데이터를 바탕으로 주기적으로 5번의 재학습
18만장의 image로 590만개의 mask 추가 수집(total mask 1020만개 = 1단계 430만개 + 590만개)
이전 단계에서 만
든 mask 430만개
학습 추론
제외된 object
수정

Method
Data
9
3. Fully automatic stage
: 1,2 단계로 만든 mask 1020만개로 SAM 학습 및 추론한 결과를 사용
이미지 내의 32× 32 grid point를 주어 각 point마다 유효한 object에 해당할 수 있는 mask 예측
SA-1B 데이터셋(1100만개의 이미지로 mask 11억개 생성)
1,2 단계로 만든
mask 1020만개
학습 labeling
SA-1B

Method
Model
10
• SAM은 3가지로 구성되어 있음
• Image encoder
• Prompt encoder
• Mask decoder
MAE 방식으로 학습한 ViT
1024×1024 input size(짧은 부분은 padding)

Method
Model – Image encoder
11
• MAE(masked autoencoders)
: 이미지를 grid로 나누고 patch 중 일부를 가리고 다시 원본을 복원하도록 학습하는 방법
학습이 끝난 후에는 encoder만 embedding 모델로 사용

Method
Model – Prompt encoder
12
• mask : convolution 차원 맞추고, image imbedding에 pixel wise sum(픽셀별 sum)
• point & box : positional encoding으로 표현
• text : CLIP 모델 text encoder를 가져와 embedding
MAE 방식으로 학습한 ViT

Method
Model – Mask decoder
13
attention block
cross attention

Method
Ambiguity
14
• mask 후보군 3개 생성
• mask 3개 중 ground-truth와 가장 유사한 mask의 loss
만 역전파
• 모호성 : prompt의 대상이 사람인지 가방인지 애매함

2. Zero-Shot Transfer Experiments
15

Zero-Shot Transfer Experiments
16
• 5가지 task를 진행
1. Zero-Shot Single Point Valid Mask Evaluation
2. Zero-Shot Edge Detection
3. Zero-Shot Object Proposals
4. Zero-Shot Instance Segmentation
5. Zero-Shot Text-to-Mask
1 2
3 4
5

Zero-Shot Single Point Valid Mask Evaluation
17
• task : point를 찍을 때 그에 해당하는 mask를 얼마나 잘 생성하는지 판단

Zero-Shot Single Point Valid Mask Evaluation
18
• 23개의 데이터셋을 가지고 RITM 모델과 비교
• circle point는 추론한 mask 3장 중에 1장이라도 맞았을 때의 경우

Zero-Shot Edge Detection
19
• edge task에서는 추론 방식 변경
16×16×3
= 768 mask
16×16 point로 prompt
NMS
filtering
Sobel
filter

Zero-Shot Edge Detection
20

Zero-Shot Object Proposals
21
• 64×64 grid point, NMS threshold 0.9로 이미지당 평균 900개
가량의 mask 생성
• Mask가 1000개 이상 생성된 경우, confidence & stability score
상위 1000개로 제한함

Zero-Shot Instance Segmentation
22

Zero-Shot Text-to-Mask
23
• CLIP 모델

24
• CLIP 모델의 encoder만 활용
CLIP image encoder
CLIP text encoder

25

27
• computer vision foundation model 제시
• 대규모 데이터셋 구축(SA-1B) (기존 Open Images 데이터셋보다 11배 이상의 image와 400배 이상의 mask)
• zero-shot(unseen data classification)
• 다양한 task의 문제를 해결
Conclusion

Segment Anything

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Segment Anything

Similar to Segment Anything (20)

Segment Anything