Single shot multiboxdetectors

OneStage
DeTectors
Here is where your presentations begins!

RETINANETSSD
01 02 03
NAS-FPN
04
EFFICIENTDET

SSD:sINGLEsHOT
mULTIBOX dETECTOR
01

SSD : Introduction
Object Detection 역사

Faster RCNN과 YOLO비교
SSD : Introduction

SSD : Introduction
SOTA는 FASTER RCNN(2 Stage Detector)
- BoundingBox 가설을 통해 각 Box에 대한 픽셀이나 피처의 Resample하고 Class를 분류하는 방법
Too computationally intensive for embedded systems
- Faster RCNN도 7fps밖에 안나옴
Significantly increased speed
- 정확도가 떨어짐, YOLO
- Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%
The first deep network based object detector
- does not resample pixels features for bounding box
- accurate as approaches
두마리 토끼(속도와 정합성)을 잡자!

SSD : Single shot Detector
- 여러개의 Default Box 사용, 여러개의 피처에 Prediction 진행
- 높은 레벨의 피처는 추상화가 잘되어 있어서 큰 물체를 잘 찾음
- 낮은 레벨의 피처는 위치정보가 정확함
이런 느낌?
마지막 피처에서만 찾지 말고, 처음, 중간, 마지막 피처에서 찾아보자

SSD : Model
- VGG 16 의 변경
- VGG 16의 Conv5_3
Conv_7, Conv8_2, Conv9_2
Conv10_2, Conv11_2에서 추출
- Clasifier : 3x3x
- Detections : 8732
- 74.3 mAP, 59FPS
- 다양한 피처맵
SSD
- 중간에 FC(?)
- Detecion 98
Conv_7, Conv8_2, Conv9_2
Conv10_2, Conv11_2에서 추출
- Clasifier : 3x3x
- Detections : 8732
- 63.4mAP, 45FPS(?)
- 마지막 피처맵만
YOLO

SSD : Model
Multi-scale feature maps for detection
- 다른 Feature map에서 detection을 수행함
- 낮은 레이어는 물체의 위치가 더 정확히, 높은 레이어에서는 추상화가 잘되어 있으므로, 두개를 잘 섞자.
Convolutional predictors for detection
The SSD approach is based on a feed-forward convolutional network that
produces a fixed-size collection of bounding boxes and scores for the presence
of object class instances in those boxes, followed by a non-maximum
suppression step to produce the final detections
- Detection을 할때는 3x3xP개의 Conv필터를 사용함
- 출력은 a score for a category(1개), or a shape offset relative to the default box coordinates(4개)
Default boxes and aspect ratios
- Our default boxes are similar to the anchor boxes used in Faster R-CNN
- 마치 Faster RCNN처럼 기본 박스를 initial로 정하고, x, y, dw dh의 변화량을 학습함

SSD : Model
Convolutional predictors for detection 좀더 자세히
- Classifier : Conv: 3x3x(4x(Classes+4))
- 구조 : 첫번째 박스[(4개(dx, dy, dh, dw), 20개(Poscal voc기준 20 class), + 1개(bg)]
두번째, 세번째 , ~6번재박스까지
- 출력 채널 : 150 = 6 x (21 = 4)

SSD : Model
Yolo v3 참고 : 먼가 SSD랑 비슷함..(?)

SSD : Training
Matching strategy
- 많은 Default Boxes에서 GT랑 많이 겹치는 부분을 찾아내고 나머지는 Background처리 하는 기준이 IOU 0.5
- we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5)
- Jaccard overlap이 iou임
The key difference between training SSD and training a typical detector that
uses region proposals, is that ground truth information needs to be assigned to
specific outputs in the fixed set of detector outputs. YOLO and for the region
proposal stage of Faster R-CNN

SSD : Training
Training objective
- Faster RCNN이랑 비슷함
● L conf : The confidence loss is the softmax loss over multiple classes confidences
● L Loc : we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and
height (h), default box에서 얼마나 이동시키면 되는건가를 학습하는것임
Width와 height는 log임
스케일이 커질수 있으니까.
N : the number of matched default boxes

SSD : Training
- 고양이와 개가 존재(고양이는 작고, 개는 큼)
- 8 x 8(낮은 레벨의 피처) 에서 iou가 0,5이상인것은 고양이만 검출(개는 더 크게 봐야함)
- 4 x 4(높은 레벨의 피처) 에서는 iou사 0.5이상인것은 개만 검출(고양이는 너무 작음)
- 피처에 따라 한 픽셀이 담당하는 원본이미지의 영역이 달라짐
Maching 알고리즘과 로스를 보고 다시한번 첫번째 그림을 해석하면

SSD : Training
여러 피처 맵에서 동일 물체를 찾을려고 서로 노력함

SSD : Training
- 디폴트 박스를 만드는 식 설명
Choosing scales and aspect ratios for default boxes
● M : 몇개의 feature map에서 박스를 뽑아 낼것이냐
● Smin, Smax는 상수(0.2~0.9)
● K는 선택하는 값
● Example PASCAL VOC : sk 0.1, 0.2, 0.55, 0.725, 0.9
- Sk 계산이 끝나면 박스의 비율을 선택
● ar ∈ {1, 2, 3, 1/2 , 1/3 }.
● 비율을 계산 width= sk √ ar, height = sk / √ ar 1이면, 정사각형 2 이면은 세로가 작은, 1/2이면 세로가 큰
● 5개의 비율이 다른 박스를 생성
● 바운딩 박스를 6개나 4개를 뽑았는데 1개는 sk만 가지고 추가로 만듬
● 4개는 3이랑 1/3이 빠져서 4개가 됨

SSD : Training
- After the matching step, most of the default boxes are negatives, especially when the number of possible
default boxes is large
- 모든 Detection에 대한 공통적인 문제, Bounding Box가 8732개인데 iou 0.5만 추려내서 사용한다면은
8732개중에 대부분이 Negative Sample이므로 거의 대부분의 데이터가 배경임
- Using the highest confidence loss for each default box
- Thee ratio between the negatives and positives is at most 3:1.
- 그래서 confidence로 순서를 세우고, Negative중에 높은것들중에 Positive의 3배만 선택
Hard negative mining
- Use the entire original input image.
- Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9
- Randomly sample a patch.
- The aspect ratio is between 1 2 and 2
- Horizontally flipped with probability of 0.5
- Applying some photo-metric distortions
Data augmentation

SSD : Experimental Results
- VGG16
- We convert fc6 and fc7 to convolutional layers
- Using the highest confidence loss for each default box
- Subsample parameters from fc6 and fc7, change pool5 from 2 × 2 − s2 to 3 × 3 − s1
- We remove all the dropout layers and the fc8 layer
- We fine-tune the resulting model using SGD with initial learning rate 10−3 , 0.9 momentum, 0.0005 weight
decay, and batch size 32
Base network

- Both Fast and Faster R-CNN use input images whose minimum dimension is 600
- The two SSD models have exactly the same settings except that they have different input sizes (300×300 vs.
512×512)

- XS=extra-small; S=small; M=medium; L=large; XL =extra-large. Aspect Ratio: XT=extra-tall/narrow; T=tall;
M=medium; W=wide; XW =extra-wide
- SSD는 작은 물체를 잘 검출하지 못한다.
- 비율은 일그러져도 나름 잘 찾음

- 이 논문에서는 Data Augmentation 으로 해결 할려함. 작은 이미지를 train data에 추가함
Sensitivity and impact of different object
● we first randomly place an image on a canvas of 16× of the original image size filled with mean values
원본이미지에 16배 큰 캔버스에 붙여 넣기할 이미지의 평균값으로 채운다
● We we do any random crop operation
● 그리고 이미지를 붙여 넣음
나름 잘 찾음

Other reasons? FPN의 시작
- 작은 물체는 낮은 레이어에서 검출됨.
- 낮은 레이어는 충분하게 Abstraction 이 되어 있지 않아서 검출이 힘듬
- 높은 레이어에서는 충분한 Abtration이 되어 있으나 작은 물체는 검출이 힘듬(큰물체는 잘 찾음)
- 높은 레이어의 Abtration결과를 낮은 레이어로 전파해주자. 다시 거꾸로 올려줌
- FPN의 시작. 그중 Retina를 살펴보겠음

RETINANET:FocalLossfor
DenseObjectDetection
02

RETINA : Introduction
SOTA는 Two Stage Detector(FASTER FCNN …)
Could a simple one-stage detector achieve similar accuracy?
Class imbalance가 문제인데 (Negative : 배경이 너무 많음)
We propose a new loss function that acts as a more effective alternative to
previous approaches for dealing with class imbalance
- Faster RCNN은 RPN을 통해 바운딩 박스를 휴리스틱방법을 통해 줄여줌
- Single Stage Detector는 제안하는 박스가 너무 많고 대부분이 배경임
- One Stage : Fast, Simple
- Two Stage : 10~40% better accuracy
- CE(Cross Entropy)에 몇개 Term을 추가한 focal loss를 제안
- 쉬운 샘플을 더욱더 쉽게 만들어서 어려운 샘플에 더 focus하게 만드는 loss
- YOLOv1(98 boxes), YOLOv2(1K), OverFeat(1~2K), SSD(~8-26k)
- Default boxes가 많을수록 성능이 좋음

RETINA : Introduction
Cross Entropy with Imbalance Data
We propose a new loss function that acts as a more effective alternative to
previous approaches for dealing with class imbalance
- CE(Cross Entropy)에 몇개 Term을 추가한 focal loss를 제안
- 쉬운 샘플을 더욱더 쉽게 만들어서 어려운 샘플에 더 focus하게 만드는 loss
- 100000 easy, 100 hard examples
- 40x bigger loss from easy examples
- 그래서 CE를 살짝 변경함

RETINA : Focal loss
Focal Loss
- We introduce the focal loss starting from the cross entropy (CE) loss for binary classification
● y ∈ {±1} specifies the ground-truth class
● p ∈ [0, 1] is the model’s estimated probability for the class with label y = 1

RETINA : Focal loss
Balanced Cross Entropy
● For instance, with γ = 2, an example classified with pt = 0.9 would have 100× lower loss compared with
CE and with pt ≈ 0.968
Focal Loss Definition
쉬운것을 더 쉽게 만들어서 Hard sample에 더 집중하게 만드는 loss

RETINA : Retinanet Detector
RetinaNet Detector
- RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks
- The backbone is responsible for computing a convolutional feature map over an entire input image
- The second subnet performs convolutional bounding box regression
- We construct a pyramid with levels P3 through P7
- the spatial resolution is upsampled by a factor of 2 using the nearest neighbor for simplicity.(FPN), 1 by 1 Conv
추상화가 잘된 피처를 낮은 레이어로 내려서 작은 물체도 잘 디텍션 하게

Experiments

추가 고민사항
- Backbone을 유지한채로 FPN부분만 잘 설계하면 성능이 좋아지지 않을까?
- 꼭 FPN을 top-down으로 섞어야 하는가?
- 어떻게 섞는것이 효율적일까?
- 잘 모르겠으니 Automl로 이것저것 다 섞어서 테스트를 해보자
NAS-FPN으로 넘어감

NAS-FPN:
LearningScalableFeaturePyramid
ArchitectureforObjectDetection
03

NAS-FAN : Introduction
The challenge of designing feature pyramid architecture is in its huge design space
The key contribution of our work is in designing the search space that
covers all possible cross-scale connections to generate multiscale feature
representations.
The discovered architecture, named NAS-FPN, offers great flexibility in
building object detection architecture.
- Recently, Neural Architecture Search algorithm demonstrates promising results on efficiently
discovering top-performing architectures for image classification in a huge search space
Current state-of-the-art convolutional architectures for object detection are
manually designed. Here we aim to learn a better architecture of feature
pyramid network for object detection.

NAS-FAN : Method
- The architecture of FPN can be stacked N times for better accuracy
- The backbone model and the subnets for class and box predictions follow the original design in RetinaNet
RetinaNet with NAS-FPN

NAS-FAN : Method
- 5 scales {C3, C4, C5, C6, C7} with corresponding feature stride of {8, 16, 32, 64, 128} pixels
- The C6 and C7 are created by simply applying stride 2 and stride 4 max pooling to C5
- 피처맵 2개 선택해서 적당한 연산을 통해 합쳐주는 방법 MergingCell을 제안
Merging Cell
- Feature map을 2개 뽑고, output resolution 선택하고, Binary op를 해서 합친다.
- The input feature layers are adjusted to the output resolution by nearest neighbor
upsampling or max pooling if needed before applying the binary operation
- The merged feature layer is always followed by a ReLU, a 3x3 convolution, and a
batch normalization layer
- 다시 피처맵에 넣고 N time 반복

NAS-FAN : Experiments
Architecture Search for NAS-FPN
- To speed up the training of the RNN controller we need a proxy task
- Proxy task for 10 epochs, instead of 50 epochs
- A small backbone architecture of ResNet-10 with input 512 × 512 image size
- Reward : We reserve a randomly selected 7392 images from the COCO train2017 set as the validation set,
which we use to obtain rewards
Proxy Task
- Similar to our controller is a recurrent neural network (RNN) and it is trained using the Proximal Policy
Optimization (PPO) algorithm.
- The total number of unique architectures generated by the RNN controller
Contoller

Architecture Search for NAS-FPN
- Left : The reward is computed as the AP of sampled architectures on the proxy task
- Right: The number of sampled unique architectures to the total number of sampled architectures
- Unique 한 FPN 구조는 대충 8000개 정도에서 수렴함
- 수많은 TPUs 사용해서 만들어낸 결과는?(100 TPUs,? 1000 TPUs??)

Scalable Feature Pyramid Architecture
- 7 merging cell
- RCB : Relu, Conv, BatchNorm
- GP : Global pooling
- 파란색(서로다른 스케일의 feature map)에서 feature에서 Box Regression

Architecture graph of NAS-FPN
- Feature layers in the same row have identical resolution
- The resolution decreases in the bottom-up direction
- 해석을 하자면 FPN은 low 에서 high resolution 으로만 연결이 있음
- NAS가 AP가 높은것을 찾을수록 High resolution을 low resolution으로 연결할려는 모습을 보임
작은 물체를 감지하는 고해상도 피처를 연결하는 feature를 생성할수록 성능이 좋아짐

Detection accuracy

Further Improvements with DropBlock
- We apply DropBlock with block size 3x3 after batch normalization layers in the the NAS-FPN layers
- DropBlock을 사용하면 성능이 더 좋아짐

추가 고민사항
- AutoML이 Detection 영역으로 적용된 사례
- AutoML을 돌릴려면 무지막지한 장비와 시간이 드는데 과연 우리들이 할수 있을까?
- 더 효과적인 방법이 있을까?
- Multi resolution feature를 더할때 그냥 sum만 하는데 다른 방법이 없을까?
Efficient DET의 시작.

EfﬁcientDET:
Scalable andEfﬁcientObject
Detection
04

EFFICIENTDET : Introduction
The state of-the-art object detectors also become increasingly more expensive
The key contribution of our work is in designing the search space that
covers all possible cross-scale connections to generate multiscale feature
representations.
- The latest AmoebaNet-based NASFPN detector requires 167M parameters and 3045B FLOPS (30x
more than RetinaNet)
- Given these real-world resource constraints, model efficiency becomes increasingly important for
object detection.
Model efficiency has become increasingly important in computer vision. First,
we propose a weighted bi-directional feature pyramid network. Second, we
propose a compound scaling method(EfficientNet). We have developed a new
family of object detectors, called EfficientDet

Although these methods tend to achieve better efficiency, they usually sacrifice
accuracy
- Most previous works only focus on a specific or a small range of resource requirements
- the variety of real-world applications, from mobile devices to datacenters
A natural question
Is it possible to build a scalable detection architecture with both higher accuracy
and better efficiency across a wide spectrum of resource constraints.
모든 OD 논문의 공통 질문, 정확도와 효율성을 동시에 잡겠다!

Challenge 1: efficient multi-scale feature fusion
- FPN has been widely used for multiscale feature fusion
- PANet, NAS-FPN, and other studies have developed more network structures for cross-scale feature fusion
- Most previous works simply sum them up without distinction
- We propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN)
- PANet Retina Top-Down에서 하나더 Down-Top을 추가로 넣음
- 이유는 낮은 레벨의 feature는 위치정보가 더 있으니, 한번더 위로 올려주어서 상위레벨의 feature에
위치정보를 더 주면 성능이 좋아질것으로 예상.

Challenge 2: model scaling
- Inspired by recent works EfficientNet, we propose a compound scaling method for object detectors, which
jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network
- 모델을 크게 만드는 3가지 방법이 width, depth, resolution이 있는데 3개를 동시에 적절히 잘해보자.(Efficient
Net방법 적용)

Our contributions can be summarized
- We proposed BiFPN, a weighted bidirectional feature network for easy and fast multi-scale feature fusion
- We proposed a new compound scaling method, which jointly scales up backbone, feature network,
box/class network, and resolution, in a principled way
- Based on BiFPN and compound scaling, we developed EfficientDet

EFFICIENTDET : BiFPN
Problem Formulation
- We proposed BiFPN, a weighted bidirectional feature network for easy and fast multi-scale feature fusion
- We proposed a new compound scaling method, which jointly scales up backbone, feature network,
box/class network, and resolution, in a principled way
- Based on BiFPN and compound scaling, we developed EfficientDet

Problem Formulation
- Formally, given a list of multi-scale features
Feature Pyramid에서 사용하는 Feature를 P in
- Our goal is to find a transformation f that can effectively aggregate different features.
- Output a list of new features

Feature network design

Cross-Scale Connections
- We observe that PANet achieves better accuracy than FPN and NAS-FPN
- 진짜?? 그럼 왜 NAS를 돌린걸까??
- First, we remove those nodes that only have one input edge
- Our intuition is simple: if a node has only one input edge with no feature fusion then it will have less
contribution called Simplified PANet
- Second, we add an extra edge from the original input to output node if they are at the same level
- Third, unlike PANet that only has one top-down and one bottom-up path, we treat each bidirectional
(top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to
enable more high-level feature fusion
First Second Third N
times repeat

Weighted Feature Fusion
- A common way is to first resize them to the same resolution and then sum them up.
- Pyramid attention network introduces global self-attention upsampling to recover pixel
localization(SENET과 비슷)
Unbounded fusion
- Wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional
tensor (per-pixel).
- We find a scale, The scalar weight is unbounded
- we resort to weight normalization to bound the value range of each weight

Softmax-based fusion
- An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability
with value range from 0 to 1, representing the importance of each input.
- The extra softmax leads to significant slowdown on GPU hardware
Fast normalized fusion
- where wi ≥ 0 is ensured by applying a Relu after each Wi
- E = 0.0001 is a small value to avoid numerical instability
- This fast fusion approach has very similar learning behavior and accuracy as the softmax-based fusion,
but runs up to 30% faster on GPUs

Fast normalized fusion
Ptd 6 P out 6
P out 5

EFFICIENTDET : Architecture
EfficientDet architecture
- EfficientNet as the backbone network
- BiFPN as the feature network n times
- Shared class/box prediction network

EFFICIENTDET : EFFICIENTNET
Efficient Net
채널을 늘리거나
(width)
더 깊게 쌓거나
(Depth)
Input Image를
키우거나
(Resolution)
적당한 방법으로
늘리자

Compound Scaling
- We propose a new compound scaling method for object detection, which uses a simple compound
coefficient φ to jointly scale up all dimensions of backbone network, BiFPN network, class/box
network, and resolution.
- Grid search for all dimensions is prohibitive expensive. Therefore, we use a heuristic-based scaling
approach
Backbone network
- We reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6

BiFPN network
- We exponentially grow BiFPN width Wbifpn (#channels)
- Linearly increase depth Dbifpn (#layers)
Box/class prediction network
- We fix their width to be always the same as BiFPN (i.e., Wpred = Wbifpn)
- But linearly increase the depth (#layers)
채널 깊이, 레이어 수
Input image resolution
- Since feature level 3-7 are used in BiFPN, the input resolution must be dividable by 2^7=128
- But linearly increase the depth (#layers)

Scaling configs for EfficientDet D0-D7
Wpred = Wbifpn
EfficientNet-B0 to B6
Heuristic-based 만든 공식으로 Scale up 진행

EFFICIENTDET : Experiments
EfficientDet performance on COCO

EFFICIENTDET : Experiments
Model size and inference latency comparison

EFFICIENTDET : Conclusion
Weighted bidirectional feature network
Customized compound scaling method
Improve accuracy and efficiency
EfficientDet-D7 achieves state-of-the-art accuracy
3.2x faster on GPUs and 8.1x faster on CPU

Appendix
ntos.gitbooks.io/artificial-inteligence/content/single-shot-detectors/ssd.html
https://uk-kim.github.io/2018/12/07/Focal-loss-for-dense-object-detection.htmlDeep Learning for
Generic Object Detection: A Survey
https://taeu.github.io/paper/deeplearning-paper-ssd/
https://leonardoaraujosa
https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-detection-262fc7482610
https://www.groundai.com/project/pyramid-attention-network-for-semantic-segmentation/1
https://www.youtube.com/watch?v=11jDC8uZL0E

Single shot multiboxdetectors

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Single shot multiboxdetectors

Similar to Single shot multiboxdetectors (20)

Recently uploaded

Recently uploaded (20)

Single shot multiboxdetectors