You only look once

YOLOYou Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Cirshick, Ali Farhadi
20180125 이진경

You Only Look Once:
Unified, Real-Time Object Detection
무엇을 한번 보나?
어떤 과정들이 unified?
어떻게 unify시켰지?
어떤 구조일까?
Real-time 이면 어느정도 속도?
정확도는 과연 높을까?

Faster R-CNN vs YOLO
Faster R-CNN
YOLO

Introduction
• Deformable Part Models (DPM) : sliding window approach
• R-CNN : region proposal methods
Current detection system

Sliding window
Region proposal methods.
Complex pipelines
slow
hard to optimize
Introduction
Current detection system
Why? Each individual component must be trained separately

Introduction
YOLO : object detection as a single regression problem.
Neural network
• Bounding box coordinates
• Class probabilities
Single regression
You Only Look Once at an image to predict what objects are present and where they are.
Directly optimize

Classification vs Regression??

Unified Detection
: Unify the separate components of object detection into a single neural network.
Neural network
Predict all bounding boxes for an image simultaneously
Uses features from the entire image
• End-to-end training
• Real-time speeds
기존의 방법과 무엇이 다른가?

Unified Detection
결과가 어떤 형태로 나오길래 다음과 같이
Bounding box와 각 bounding box의 class를
동시에 알 수 있을까?
Neural network

1) Divide the input image into a S*S grid.
Image > grid cell
• YOLO 에서는 detection을 위해 image 를 N*N으로 쪼갠 grid라는 단위를 이용한다.
• 예시에서의 S=7
Unified Detection

2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
• cell 내부에 중심을 두는 bounding box를 B개 propose.
• B개의 box 에서 confidence scores.
Image > grid cell > bounding boxes
* If no object exists in that cell, the confidence scores should be zero.
* 𝐵는 파라미터로 줄수있고, 이 예시에서 B=2.
Confidence scores reflect
1. how confident the model is that the box contains an object.
2. How accurate it thinks the box is that it predicts.
Unified Detection

* If no object exists in that cell, the confidence scores should be zero.
* 𝐵는 파라미터로 줄수있고, 이 예시에서 B=2.
• cell 내부에 중심을 두는 bounding box를 B개 propose.
• B개의 box 에서 confidence scores
Box를 어떤 형태로 predict? 어떻게 계산?
Unified Detection

Each bounding box consists of 5 predictions: 𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆.
• (𝒙, 𝒚) 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔 represent the center of the box relative to the bounds of the grid cell.
• The 𝒘𝒊𝒅𝒕𝒉(𝒘) and 𝒉𝒆𝒊𝒈𝒉𝒕(𝒉) are predicted relative to the whole image
• The 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 represents the IOU between the predicted box and any ground truth box.
*predicted box and any ground truth box
Unified Detection
𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 같은 경우는 아직 blackbox인 neural network를 통해서 나올것이다.

IOU(intersection over union)
Unified Detection

Each bounding box consists of 5 predictions: 𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆.
• (𝒙, 𝒚) 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔 represent the center of the box relative to the bounds of the grid cell.
• The 𝒘𝒊𝒅𝒕𝒉(𝒘) and 𝒉𝒆𝒊𝒈𝒉𝒕(𝒉) are predicted relative to the whole image
• The 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 represents the IOU between the predicted box and any ground truth box.
*predicted box and any ground truth box
Unified Detection
(0.3, 0.2, 0.17, 0.76, 0.91)
𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 같은 경우는 아직 blackbox인 neural network를 통해서 나올것이다.

Each grid cell also predicts C conditional class probabilities.
박스 안에 object가 존재한다면, i 클래스 일 확률은?
Conditional class probability 같은 경우는 아직 blackbox인 neural network를 통해서 나올것이다.
We only predict one set of class probabilities per grid cell, regardless of the number of boxes 𝐵.
그 cell안에 중심이 있는 object가 여러 개일 경우에도, 하나의 class로만 정의.
Cell의 크기를 줄인다면 ? Object 여러 개 찾을 수 있다. 계산량 많아짐. 속도 느려짐
Cell의 크기를 늘린다면 ? Bounding box 개수 줄어듬. 찾을 수 있는 object 개수 적어짐.
Unified Detection

Each bounding box consists of 5 predictions:
𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆.
Each grid cell also predicts C conditional class probabilities.
현재까지 계산한 것 이제 계산해야 되는것
how confident the model is that the box contains an object.
How accurate it thinks the box is that it predicts.

Unified Detection
현재까지 계산한 것의 형태

Unified Detection
These predictions are encoded as an S x S x (B x 5 + C) tensor.
Ex) 7*7*30

Neural network

class-specific confidence scores for each box =
(conditional class probabilities) x (the individual box confidence predictions)
class-specific confidence scores for each box.
Unified Detection

각 박스마다
X,y,w,h,confidence
각 cell마다 클래스
class-specific confidence scores for each box.
NMS 등으로 처리한 후.
2. Unified Detection

2.1. Design
Neural network

2.1. Design
Fully
connected
layers
Extract features from the image Predict the output probabilities and coordinates
Convolutional layers of the network

GoogLeNet Inception module
2.1. Design
GoogLeNet model

Fully
connected
layers
24 2
2.1. Design
YOLO model

2.1. Design
YOLO model
뒤에 training 에서 설명

x
y
왜 이런 activation function?
Relu는 node의 output이 0미만이면
아예 0으로 수렴 해당 노드 다음부터는
고려아예 안함.
0미만이어도 조금의 영향을 살려준다.

Fully
connected
layers
9 2
2.1. Design
Fast YOLO model

2.2. Training
classification
Accuracy of 88 %
1) Pretrain classification

detection
2.2. Training
2) detection
Pretrained model for classification +4 conv lay, 2 FC

Error를 어떤식으로 계산해서 Training하는가?
2.2. Training

Use sum-squared error  easy to optimize
x ,y ,w ,h ,confidence scores, conditional class probabilities
targetPrediction
S x S x (B x 5 + C) S x S x (B x 5 + C)
Normalize 시켰기 때문에 각 value의 error weight 는 현재 같다.
2.2. Training
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
𝑜𝑏𝑗 𝑜𝑟 𝑛𝑜𝑜𝑏𝑗
(𝑥𝑖 − 𝑥𝑖)2
+ (𝑦𝑖 − 𝑦𝑖)2
+
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
(𝑤𝑖 − 𝑤𝑖)2
+ (ℎ𝑖 − ℎ)2
+
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
(𝐶𝑖 − 𝐶𝑖)2
+
𝑖=0
𝑆2
ℝ𝑖𝑗
𝑐∈𝑐𝑙𝑎𝑠𝑠𝑒𝑠
(𝑝𝑖(𝑐) − 𝑝𝑖(𝑐))2
다음과 같이 sum-squared error 계산?
(대략적인 모습)

1) Localization error와 classification error의 Weight가 같다
Localization error, classification error weight를 다르게 준다.
• Weight가 같은게 왜 문제일까?
1) 이미 classificatio은 pre-trained 되어있기 때문에?
2) localization의 변수는 더 미세한 조절이 필요하다?
• 비율을 어떻게 줄까?
 ????
sum-squared error 이 training에서 갖는 단점
2.2. Training

2) Object가 없는 grid cell들이 많다. 이 셀들의 confidence score  0.
Object가 없는 cell들의 gradient 에 기울어진 방향으로 training 된다.
obj와 noobj cell 구분하여 confidence score 반영해야한다.
2.2. Training

target prediction
1 0.9
0.9
0.1
0.1
0.2
0.2
target
3) Box의 크기와 상관없이 error 계산.
weight : 큰 box에서의 작은 차이 < 작은 box에서의 작은 차이
prediction
2.2. Training
prediction
target
1
영향력
<<

1) Coordinate prediction(localization error)의 loss weight을 증가
2) noobj 의 confidence score의 loss weight을 감소
Localization error weight : classification error weight = 5 : 1
Obj confidence score error weight : noObj confidence score error weight = 1 : 0.5
3) Squre root of the bounding box width and height.
sum-squared error 이 training에서 갖는 단점 보완 방법
2.2. Training

𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
(𝑥𝑖 − 𝑥𝑖)2
+ (𝑦𝑖 − 𝑦𝑖)2
+
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
(𝑤𝑖 − 𝑤𝑖)2
+ (ℎ𝑖 − ℎ)2
+
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
(𝐶𝑖 − 𝐶𝑖)2
+
𝑖=0
𝑆2
ℝ𝑖𝑗
𝑐∈𝑐𝑙𝑎𝑠𝑠𝑒𝑠
(𝑝𝑖(𝑐) − 𝑝𝑖(𝑐))2
2.2. Training

Training을 위한 error 계산시 고려하는 것.
1. Object가 있는 cell에 대해서만 classification error 줄이는 방향.
2. Responsible한 bounding box coordinate 에 대해서만 error 줄이는 방향
Training이 향하는 방향
2.2. Training

Training이 향하는 방향
2.2. Training

2.2. Training
What have done….

75 105 135
0.0001
0.001
0.00001
1) Learning rate
To avoid overfitting…
2) dropout( rate =0.5)
3) Data augmentation (scaling, translation ,exposure , saturation)
2.2. Training
What have done….

2.4. Limitations of YOLO
(1) Strong spatial constraints on bounding box predictions
Each grid cell only predicts two boxes and can only have one class.
For door
This spatial constraint limits the number of nearby objects that our model can predict.
Our model struggles with small objects that appear in groups
For motocycle

(2) Struggles to generalize to objects in new or unusual aspect ratios or configurations.
Our model learns to predict bounding boxes from data.
???training Unusual aspect ratio

(3) Loss function treats errors the same in small bounding boxes versus large bounding boxes.
???
Width,heigh이 아니고,
IOU에 대해서 Large , small box 고려안함

DPM R-CNN Other Fast Detectors
(FAST R-CNN,FASTER R-
CNN)
Deep multi-box OverFeat MultiGrasp
• Sliding window
• Extract features :disjoint pipeline
1. extract static features
2. classify regions
3. predict bounding boxes
…
• Find object : Region
proposals.
1. Selective search
2. CNN
3. SVM
4. NMS
disjoint
Selective search (x)
Sharing computation
and using neural
network region
proposal
Single object detection Efficiently performs
sliding window
detection.
Disjoint system.
Grid approach to
bounding box
predictions is based on
the MultiGrasp system
for regression to grasps.
Single convolutional neural network • Each grid cell proposes
a potential bounding
boxes and scores using
CNN.
• Bounding box 98 : 2000
• Single, jointly optimized
model
• real-time
• gene
• CNN to predict
bounding boxes in
and image
3. Comparisons to Other Detection Systems

4. Experiments
• Fast YOLO is the fastest object detection method On PASCAL.
• YOLO is 10mAP more accurate than the fast version
While still well above real-time in speed
• R-CNN –R : Selective Search(x)  static bounding box proposal
• Fast R-CNN : Relies on Selective Search slow
• Faster RNN :
Selective Search(x) neural network to propose bounding boxes

4.2. VOC 2007 Error Analysis
YOLO vs Fast R-CNN
Each prediction is either correct Or error.
we look at the top N predictions for that
category.(top N detections)
• YOLO struggles to localize objects correctly.
• Fast RCNN is almost 3x more likely to predict background error. false positive. (objec가 없는데 있다고 했다)

4.3. Combining Fast R-CNN and YOLO
YOLO & Fast R-CNN
YOLO 랑 combine 한것이 단순히 ensemble의 효과는 아니다. 왜냐하면 다른 Fast RCNN과 ensemble 했을 때에 비해 조금 더 많은 benefit을 갖고있다.
 YOLO makes different kinds of mistakes at test time  effective at boosting Fast R-CNN
 benefit from the speed of YOLO (x), why not? We run each model separately.
Give that prediction a boost
(YOLO 의 probability,
overlap between the two box 고려)
R-CNN Every proposal box
predicts a similar box?YOLO
Better localization
Better background
yes No 라면 background일 확률 높다?
YOLO에서 거른다?

4.5. Generalizability
We want our detector to learn robust representations of visual concepts
so that they can generalize to new domains or unexpected input at test time.
Picasso Dataset People-Art Dataset
• In real-word applications, it is hard to predict all possible use cases.
• Test data can diverge from wat the system has seen before.
.training
VOC data
Test For generalizing detection to artwork.
Detect people
Natural image Art image

4.5. Generalizability
• YOLO : AP degrades less than other methods
• R-CNN : highly overfit to PASCAL VOC.
• Selective Search : highly tuned for natural image
• Only sees small regions
• DPM : AP degrades less than other methods.
But it starts from a lower AP.
Artwork and natural images are very
different on a pixel level But they are similar
in terms of the size and shape of objects.

5. Real-Time Detection In The Wild
• When attachaed to a webcam it functions like a traking
system.
• Demo is available.

6. Conclusion
• A Unified model for object detection
• Simple to construct
• Can be trained directly on full images.
• Generalizes well to new domains

You only look once

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to You only look once

Similar to You only look once (20)

Recently uploaded

Recently uploaded (20)

You only look once

Editor's Notes