5. Introduction
• Deformable Part Models (DPM) : sliding window approach
• R-CNN : region proposal methods
Current detection system
6. Sliding window
Region proposal methods.
Complex pipelines
slow
hard to optimize
Introduction
Current detection system
Why? Each individual component must be trained separately
7. Introduction
YOLO : object detection as a single regression problem.
Neural network
• Bounding box coordinates
• Class probabilities
Single regression
You Only Look Once at an image to predict what objects are present and where they are.
Directly optimize
9. Unified Detection
: Unify the separate components of object detection into a single neural network.
Neural network
Predict all bounding boxes for an image simultaneously
Uses features from the entire image
• End-to-end training
• Real-time speeds
기존의 방법과 무엇이 다른가?
10. Unified Detection
결과가 어떤 형태로 나오길래 다음과 같이
Bounding box와 각 bounding box의 class를
동시에 알 수 있을까?
Neural network
Predict all bounding boxes for an image simultaneously
Uses features from the entire image
11. 1) Divide the input image into a S*S grid.
Image > grid cell
• YOLO 에서는 detection을 위해 image 를 N*N으로 쪼갠 grid라는 단위를 이용한다.
• 예시에서의 S=7
Unified Detection
12. 2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
• cell 내부에 중심을 두는 bounding box를 B개 propose.
• B개의 box 에서 confidence scores.
Image > grid cell > bounding boxes
* If no object exists in that cell, the confidence scores should be zero.
* 𝐵는 파라미터로 줄수있고, 이 예시에서 B=2.
Confidence scores reflect
1. how confident the model is that the box contains an object.
2. How accurate it thinks the box is that it predicts.
Unified Detection
13. Image > grid cell > bounding boxes
* If no object exists in that cell, the confidence scores should be zero.
* 𝐵는 파라미터로 줄수있고, 이 예시에서 B=2.
Confidence scores reflect
1. how confident the model is that the box contains an object.
2. How accurate it thinks the box is that it predicts.
2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
• cell 내부에 중심을 두는 bounding box를 B개 propose.
• B개의 box 에서 confidence scores
Box를 어떤 형태로 predict? 어떻게 계산?
Unified Detection
14. Image > grid cell > bounding boxes
Each bounding box consists of 5 predictions: 𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆.
• (𝒙, 𝒚) 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔 represent the center of the box relative to the bounds of the grid cell.
• The 𝒘𝒊𝒅𝒕𝒉(𝒘) and 𝒉𝒆𝒊𝒈𝒉𝒕(𝒉) are predicted relative to the whole image
• The 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 represents the IOU between the predicted box and any ground truth box.
*predicted box and any ground truth box
Unified Detection
2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 같은 경우는 아직 blackbox인 neural network를 통해서 나올것이다.
15. IOU(intersection over union)
Confidence scores reflect
1. how confident the model is that the box contains an object.
2. How accurate it thinks the box is that it predicts.
Unified Detection
16. Image > grid cell > bounding boxes
Each bounding box consists of 5 predictions: 𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆.
• (𝒙, 𝒚) 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔 represent the center of the box relative to the bounds of the grid cell.
• The 𝒘𝒊𝒅𝒕𝒉(𝒘) and 𝒉𝒆𝒊𝒈𝒉𝒕(𝒉) are predicted relative to the whole image
• The 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 represents the IOU between the predicted box and any ground truth box.
*predicted box and any ground truth box
Unified Detection
(0.3, 0.2, 0.17, 0.76, 0.91)
2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 같은 경우는 아직 blackbox인 neural network를 통해서 나올것이다.
17. Image > grid cell > bounding boxes
Each grid cell also predicts C conditional class probabilities.
박스 안에 object가 존재한다면, i 클래스 일 확률은?
Conditional class probability 같은 경우는 아직 blackbox인 neural network를 통해서 나올것이다.
We only predict one set of class probabilities per grid cell, regardless of the number of boxes 𝐵.
그 cell안에 중심이 있는 object가 여러 개일 경우에도, 하나의 class로만 정의.
2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
Cell의 크기를 줄인다면 ? Object 여러 개 찾을 수 있다. 계산량 많아짐. 속도 느려짐
Cell의 크기를 늘린다면 ? Bounding box 개수 줄어듬. 찾을 수 있는 object 개수 적어짐.
Unified Detection
Confidence scores reflect
1. how confident the model is that the box contains an object.
2. How accurate it thinks the box is that it predicts.
18. Each bounding box consists of 5 predictions:
𝒙, 𝒚, 𝒘, 𝒉 and 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆.
Each grid cell also predicts C conditional class probabilities.
현재까지 계산한 것 이제 계산해야 되는것
how confident the model is that the box contains an object.
How accurate it thinks the box is that it predicts.
23. Neural network
Predict all bounding boxes for an image simultaneously
Uses features from the entire image
24. 2) Each grid cell predicts
B bounding boxes and confidence scores for detecting that object.
Image > grid cell > bounding boxes
class-specific confidence scores for each box =
(conditional class probabilities) x (the individual box confidence predictions)
class-specific confidence scores for each box.
Confidence scores reflect
1. how confident the model is that the box contains an object.
2. How accurate it thinks the box is that it predicts.
Unified Detection
50. Use sum-squared error easy to optimize
x ,y ,w ,h ,confidence scores, conditional class probabilities
targetPrediction
S x S x (B x 5 + C) S x S x (B x 5 + C)
Normalize 시켰기 때문에 각 value의 error weight 는 현재 같다.
2.2. Training
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
𝑜𝑏𝑗 𝑜𝑟 𝑛𝑜𝑜𝑏𝑗
(𝑥𝑖 − 𝑥𝑖)2
+ (𝑦𝑖 − 𝑦𝑖)2
+
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
𝑜𝑏𝑗 𝑜𝑟 𝑛𝑜𝑜𝑏𝑗
(𝑤𝑖 − 𝑤𝑖)2
+ (ℎ𝑖 − ℎ)2
+
𝑖=0
𝑆2
𝑗=0
𝐵
ℝ𝑖𝑗
𝑜𝑏𝑗 𝑜𝑟 𝑛𝑜𝑜𝑏𝑗
(𝐶𝑖 − 𝐶𝑖)2
+
𝑖=0
𝑆2
ℝ𝑖𝑗
𝑜𝑏𝑗 𝑜𝑟 𝑛𝑜𝑜𝑏𝑗
𝑐∈𝑐𝑙𝑎𝑠𝑠𝑒𝑠
(𝑝𝑖(𝑐) − 𝑝𝑖(𝑐))2
다음과 같이 sum-squared error 계산?
(대략적인 모습)
51. 1) Localization error와 classification error의 Weight가 같다
Localization error, classification error weight를 다르게 준다.
• Weight가 같은게 왜 문제일까?
1) 이미 classificatio은 pre-trained 되어있기 때문에?
2) localization의 변수는 더 미세한 조절이 필요하다?
• 비율을 어떻게 줄까?
????
sum-squared error 이 training에서 갖는 단점
2.2. Training
52. 2) Object가 없는 grid cell들이 많다. 이 셀들의 confidence score 0.
Object가 없는 cell들의 gradient 에 기울어진 방향으로 training 된다.
obj와 noobj cell 구분하여 confidence score 반영해야한다.
sum-squared error 이 training에서 갖는 단점
2.2. Training
56. Training을 위한 error 계산시 고려하는 것.
1. Object가 있는 cell에 대해서만 classification error 줄이는 방향.
2. Responsible한 bounding box coordinate 에 대해서만 error 줄이는 방향
Training이 향하는 방향
2.2. Training
57. Image > grid cell > bounding boxes
Training이 향하는 방향
2.2. Training
58. Image > grid cell > bounding boxes
Training이 향하는 방향
2.2. Training
60. 75 105 135
0.0001
0.001
0.00001
1) Learning rate
To avoid overfitting…
2) dropout( rate =0.5)
3) Data augmentation (scaling, translation ,exposure , saturation)
2.2. Training
What have done….
61. 2.4. Limitations of YOLO
(1) Strong spatial constraints on bounding box predictions
Each grid cell only predicts two boxes and can only have one class.
For door
This spatial constraint limits the number of nearby objects that our model can predict.
Our model struggles with small objects that appear in groups
For motocycle
62. 2.4. Limitations of YOLO
(2) Struggles to generalize to objects in new or unusual aspect ratios or configurations.
Our model learns to predict bounding boxes from data.
???training Unusual aspect ratio
63. 2.4. Limitations of YOLO
(3) Loss function treats errors the same in small bounding boxes versus large bounding boxes.
???
Width,heigh이 아니고,
IOU에 대해서 Large , small box 고려안함
64. DPM R-CNN Other Fast Detectors
(FAST R-CNN,FASTER R-
CNN)
Deep multi-box OverFeat MultiGrasp
• Sliding window
• Extract features :disjoint pipeline
1. extract static features
2. classify regions
3. predict bounding boxes
…
• Find object : Region
proposals.
1. Selective search
2. CNN
3. SVM
4. NMS
disjoint
Selective search (x)
Sharing computation
and using neural
network region
proposal
Single object detection Efficiently performs
sliding window
detection.
Disjoint system.
Grid approach to
bounding box
predictions is based on
the MultiGrasp system
for regression to grasps.
Single convolutional neural network • Each grid cell proposes
a potential bounding
boxes and scores using
CNN.
• Bounding box 98 : 2000
• Single, jointly optimized
model
• real-time
• gene
• CNN to predict
bounding boxes in
and image
3. Comparisons to Other Detection Systems
66. 4. Experiments
• Fast YOLO is the fastest object detection method On PASCAL.
• YOLO is 10mAP more accurate than the fast version
While still well above real-time in speed
• R-CNN –R : Selective Search(x) static bounding box proposal
• Fast R-CNN : Relies on Selective Search slow
• Faster RNN :
Selective Search(x) neural network to propose bounding boxes
67. 4.2. VOC 2007 Error Analysis
YOLO vs Fast R-CNN
Each prediction is either correct Or error.
we look at the top N predictions for that
category.(top N detections)
• YOLO struggles to localize objects correctly.
• Fast RCNN is almost 3x more likely to predict background error. false positive. (objec가 없는데 있다고 했다)
68. 4.3. Combining Fast R-CNN and YOLO
YOLO & Fast R-CNN
YOLO 랑 combine 한것이 단순히 ensemble의 효과는 아니다. 왜냐하면 다른 Fast RCNN과 ensemble 했을 때에 비해 조금 더 많은 benefit을 갖고있다.
YOLO makes different kinds of mistakes at test time effective at boosting Fast R-CNN
benefit from the speed of YOLO (x), why not? We run each model separately.
Give that prediction a boost
(YOLO 의 probability,
overlap between the two box 고려)
R-CNN Every proposal box
predicts a similar box?YOLO
Better localization
Better background
yes No 라면 background일 확률 높다?
YOLO에서 거른다?
70. 4.5. Generalizability
We want our detector to learn robust representations of visual concepts
so that they can generalize to new domains or unexpected input at test time.
Picasso Dataset People-Art Dataset
• In real-word applications, it is hard to predict all possible use cases.
• Test data can diverge from wat the system has seen before.
.training
VOC data
Test For generalizing detection to artwork.
Detect people
Natural image Art image
71. 4.5. Generalizability
• YOLO : AP degrades less than other methods
• R-CNN : highly overfit to PASCAL VOC.
• Selective Search : highly tuned for natural image
• Only sees small regions
• DPM : AP degrades less than other methods.
But it starts from a lower AP.
Artwork and natural images are very
different on a pixel level But they are similar
in terms of the size and shape of objects.
72. 5. Real-Time Detection In The Wild
• When attachaed to a webcam it functions like a traking
system.
• Demo is available.
73. 6. Conclusion
• A Unified model for object detection
• Simple to construct
• Can be trained directly on full images.
• Generalizes well to new domains
Editor's Notes
*Predict할때 데이터는 어떤식으로?
*Pr(object) : confidence as Pr(Object) * IOU(truth,pred)
왜 2개만 곱하나?
*왜 단점에 box크기 고려안한다고되어있지?
http://www.cnblogs.com/lillylin/p/6207119.html
http://www.cnblogs.com/lillylin/p/6207119.html
DPM : extract static features, classify regions, predict bounding boxes for high scoring regions etc.
https://www.slideshare.net/aurot/googlenet-insights
1*1 convolutional kernel
Channel 별로 linear combination 해주는 효과
Inception module 계산량 대폭 줄여줌
Feature pooling