SlideShare a Scribd company logo
1 of 24
Download to read offline
YOLO9000:
Better, Faster, Stronger
16th July, 2017
Jinwon Lee
Samsung Electronics
Redmon, Joseph, et al. “YOLO9000: Better, Faster, Stronger"
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
Before We Start…
• YOLO
 PR-016 : presented by Taegyun Jeon
 https://youtu.be/eTDcoeqj1_w
• Faster R-CNN
 PR-013 : presented by Jinwon Lee
 https://youtu.be/kcPAGIgBGRs
• Concepts of Distance / Metric
 Terry’s deep learning talk by Terry Taewoong Um
 https://youtu.be/4KXgdf6Bmo4?list=PL0oFI08O71gKEXITQ7OG2SCCX
krtid7Fq
Let’s See the Result First
Let’s See the Result First
YOLO9000 - Better, Faster, and Stronger
• Better
 Batch normalization
 High resolution classifier
 Convolution with anchor boxes
 Dimension clusters
 Direct location prediction
 Fine-grained features
 Multi-scale training
• Faster
 Darknet-19
 Training for classification
 Training for detection
• Stronger
 Hierarchical classification
 Dataset combination with Word-tree
 Joint classification and detection
YOLOv2
YOLO9000
Introduction & Motivation
• Detection frameworks have become increasingly fast and
accurate  However, most detection methods are still
constrained to a small set of objects.
• We would like detection to scale to level of object
classification  However, labelling images for detection is far
more expensive than labelling for classification.
• Maybe we cannot see detection datasets on the same scale
as classification datasets in the near future.
Introduction
• Authors propose a new method to expand the scope of
current detection systems by using classification dataset we
already have.
• Joint training algorithm is also proposed that allows to train
object detectors on both detection and classification data
• First, improving upon the base YOLO detection system to
produce YOLOv2 (SOTA real-time detector)
• Then, using dataset combination method and joint training
algorithm to train model on more than 9000 classes!
To Improve YOLO
• YOLO’s shortcomings relative to SOTA detection systems
 YOLO makes a significant number of localization errors
 YOLO has relatively low recall compared to region proposal based
method
• Focus mainly on improving recall and localization while
maintain classification accuracy.
• It is still very important to detect FAST!
Better
• Batch Normalization
 Adding BN on all of conv layers in YOLO – 2% improvement in mAP
 Removing dropout without overfitting
• High Resolution Classifier
 YOLO trains the classifier network @224x224 and increase resolution
to 448 for detection  The network has to simultaneously switch to
learning object detection and adjust to the new input resolution
 Fine tuning the classifier network @448x448 resolution for 10
epochs on ImageNet – almost 4% improvement in mAP
Better
• Convolution with Anchor Boxes
 YOLO predicts the coordinates of b-boxes directly using fc layers.
 Faster R-CNN predicts b-boxes using hand-picked priors(anchor boxes).
 YOLOv2 removes the fc layers and use anchor boxes to predict
bounding boxes
 Shrinking the network to operate on 416x416 input images instead of
448x448
YOLO’s CNN downsample the image by a factor of 32
Objects, especially large objects, tend to occupy the center of the image, so it is
better to have one cell in the middle than to have four cells.  odd number of
grid cells are required(416x416  13x13)
 Predicting class and objectness for every anchor box
 Using anchor boxes makes a small decrease in accuracy
69.5mAP to 69.2mAP but 81% recall to 88% recall  high recall means there is
more room to improve
Better
• Dimension Clusters
 The problem with anchor boxes is
that they are hand-picked
 Running k-means clustering on the
training set b-boxes to find good
priors
 To prevent larger boxes from making
more error, they use a different
distance metric(not Euclidian distance)
d(box, centroid) = 1 – IOU(box, centroid)
 k=5 is chosen as a good tradeoff
between complexity and high recall
Better
• Direct Location Prediction
 Another problem with anchor
boxes is instability, in RPNs the
anchor box can be anywhere in
the image, regardless of what
location predicted the box
 Instead of predicting offsets,
YOLOv2 predicts locations
relative to the location of the
grid cells
 5 bounding boxes for each cell,
and 5 values for each
bounding box  almost 5%
improvement over the version
with anchor boxes
RCNN’s b-box regression
Different equations
– direct prediction
Better
• Fine-Grained Features
 Modified YOLO uses 13x13 feature maps to predict detections
Good at large object, finer grained features are required for small objects
 Simply adding a passthrough layer that brings features from an
earlier layer at 26x26 resolution
 Turn 26x26x512 feature map into a 13x13x2048 feature map which
can be concatenated with original features
 Detector runs on top of this expanded features
 1% performance increase
512
26
26
13
13
2048
Better
• Multi-Scale Training
 Removing fc layers  can be resized on the fly
 Every 10 batches YOLOv2 randomly choose a new image dimension
size in {320, 352, …, 608}
 Effect of the input size
At low resolution, fair detection performance but runs fast
At high resolution, SOTA detection and slower but still above real time speeds
Further Experiments
• Pascal VOC2012 results
• COCO2015 results
Faster
• Darknet-19
 mostly 3x3 filters (similar to VGG)
 Following the work on Network in Network
(NIN), use global average pooling to make
predictions as well as 1x1 filters to compress
the feature representation between 3x3
convolutions
 Darknet-19: 19 convolutional layers & 5 max-
pooling layers
 5.58 billion operations: 72.9% top-1 & 91.2%
top-5 accuracy (30.69 billion operations &
90.0% top-5 accuracy for VGG-16)
Faster
• Training for classification
 ImageNet 1000 classes for 160 epochs
 Standard data augmentation: random crops, rotations, hue,
saturation, and exposure shifts
 Initial training : 224x224  448x448 fine-tuning for 10 epochs
Higher resolution achieves a top-5 accuracy of 93.3%
• Training for detection
 Adding 3x3 conv layers with 1024 filters each followed by a final 1x1
conv layer
 For VOC, predicting 5 boxes with 5 coordinates each and 20 classes
per box, so 125 filters
 160 epochs with a start learning rate of 10-3 dividing it by 10 at 60
and 90 epochs
Stronger
• Hierarchical classification
 ImageNet labels are pulled from WordNet, a language database that
structures concepts and how they relate
“Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which is
a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc.
 WordNet is structured as a directed graph, not a tree, because
language is complex
To make a tree not a graph, If a concept has tow paths to the root, choose the
shorter path
 Root note is a “Physical Object”
Stronger
• Hierarchical classification
 To perform classification with WordTree, predicting conditional
probabilities at every node for the probability of each hyponym of
that synset given that synset
 Then if a picture of a Norfolk terrier is encountered, its probability is
calculated as below
Stronger
• Hierarchical classification
 To build WordTree1k, adding in all of the intermediate nodes which
expands the label space from 1000 to 1369
 During training, ground truth labels are propagated up
If an image is labelled as a “Norfolk terrier” it also gets labelled as a “dog”
and a “mammal”, etc.
 To compute the conditional probabilities the model predicts a vector
of 1369 values and computes the softmax over all synsets that are
hyponyms of the same concept  90.4% top-5 accuracy
Stronger
• Dataset Combination with
WordTree
 Example of using WordTree to
combine the labels from
ImageNet and COCO.
COCO – general concepts(higher
nodes)
ImageNet – specific concepts(lower
nodes and leaves)
Stronger
• Joint Classification and Detection
 New dataset created using the COCO detection dataset and the top
9000 classes from the full ImageNet release. ImageNet detection
challenge dataset is also added for evaluation
 Using YOLOv2 but only 3 priors instead of 5 to limit output size
 The network sees a detection image, backpropagate loss as normal.
For classification loss, only backpropagate loss at or above the
corresponding level and only backpropagate classification loss
Simply find the b-box that predicts the highest probability for that class and
compute the loss on just its predicted tree
Stronger
• Joint Classification and Detection
 Evaluating YOLO9000 on the ImageNet
detection task
ImageNet only share 44 object categories with COCO
19.7mAP overall with 16.0mAP on the disjoint 156
object classes that it has never seen
 YOLO9000 learns new species of animals well
but struggles with learning categories like
clothing and equipment
Conclusion
• YOLOv2 is fast and accurate
• YOLO9000 is a strong step towards closing the dataset size
gap between detection and classification
• Dataset combination using hierarchical classification would be
useful in the classification and segmentation domains

More Related Content

What's hot

PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 

What's hot (20)

You Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object DetectionYou Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object Detection
 
You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
Yolov3
Yolov3Yolov3
Yolov3
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural Networks
 
Anatomy of YOLO - v1
Anatomy of YOLO - v1Anatomy of YOLO - v1
Anatomy of YOLO - v1
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
YOLO v1
YOLO v1YOLO v1
YOLO v1
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
Single Shot Multibox Detector
Single Shot Multibox DetectorSingle Shot Multibox Detector
Single Shot Multibox Detector
 
Yolo
YoloYolo
Yolo
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learning
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
 
YOLO
YOLOYOLO
YOLO
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
 
Yolo v1 urop 발표자료
Yolo v1 urop 발표자료Yolo v1 urop 발표자료
Yolo v1 urop 발표자료
 

Similar to YOLO9000 - PR023

Andrii Belas "Overview of object detection approaches: cases, algorithms and...
Andrii Belas  "Overview of object detection approaches: cases, algorithms and...Andrii Belas  "Overview of object detection approaches: cases, algorithms and...
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
Lviv Startup Club
 

Similar to YOLO9000 - PR023 (20)

MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
object detection paper review
object detection paper reviewobject detection paper review
object detection paper review
 
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
Andrii Belas  "Overview of object detection approaches: cases, algorithms and...Andrii Belas  "Overview of object detection approaches: cases, algorithms and...
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
SimCLR: A Simple Framework for Contrastive Learning of Visual RepresentationsSimCLR: A Simple Framework for Contrastive Learning of Visual Representations
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
 
Yolov3
Yolov3Yolov3
Yolov3
 
Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)
Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)
Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
ML for detecting products on grocery shelves
ML for detecting products on grocery shelvesML for detecting products on grocery shelves
ML for detecting products on grocery shelves
 
Object Detection - Míriam Bellver - UPC Barcelona 2018
Object Detection - Míriam Bellver - UPC Barcelona 2018Object Detection - Míriam Bellver - UPC Barcelona 2018
Object Detection - Míriam Bellver - UPC Barcelona 2018
 
Densebox
DenseboxDensebox
Densebox
 
FINAL_Team_4.pptx
FINAL_Team_4.pptxFINAL_Team_4.pptx
FINAL_Team_4.pptx
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 

More from Jinwon Lee

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 

More from Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

YOLO9000 - PR023

  • 1. YOLO9000: Better, Faster, Stronger 16th July, 2017 Jinwon Lee Samsung Electronics Redmon, Joseph, et al. “YOLO9000: Better, Faster, Stronger" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
  • 2. Before We Start… • YOLO  PR-016 : presented by Taegyun Jeon  https://youtu.be/eTDcoeqj1_w • Faster R-CNN  PR-013 : presented by Jinwon Lee  https://youtu.be/kcPAGIgBGRs • Concepts of Distance / Metric  Terry’s deep learning talk by Terry Taewoong Um  https://youtu.be/4KXgdf6Bmo4?list=PL0oFI08O71gKEXITQ7OG2SCCX krtid7Fq
  • 3. Let’s See the Result First
  • 4. Let’s See the Result First
  • 5. YOLO9000 - Better, Faster, and Stronger • Better  Batch normalization  High resolution classifier  Convolution with anchor boxes  Dimension clusters  Direct location prediction  Fine-grained features  Multi-scale training • Faster  Darknet-19  Training for classification  Training for detection • Stronger  Hierarchical classification  Dataset combination with Word-tree  Joint classification and detection YOLOv2 YOLO9000
  • 6. Introduction & Motivation • Detection frameworks have become increasingly fast and accurate  However, most detection methods are still constrained to a small set of objects. • We would like detection to scale to level of object classification  However, labelling images for detection is far more expensive than labelling for classification. • Maybe we cannot see detection datasets on the same scale as classification datasets in the near future.
  • 7. Introduction • Authors propose a new method to expand the scope of current detection systems by using classification dataset we already have. • Joint training algorithm is also proposed that allows to train object detectors on both detection and classification data • First, improving upon the base YOLO detection system to produce YOLOv2 (SOTA real-time detector) • Then, using dataset combination method and joint training algorithm to train model on more than 9000 classes!
  • 8. To Improve YOLO • YOLO’s shortcomings relative to SOTA detection systems  YOLO makes a significant number of localization errors  YOLO has relatively low recall compared to region proposal based method • Focus mainly on improving recall and localization while maintain classification accuracy. • It is still very important to detect FAST!
  • 9. Better • Batch Normalization  Adding BN on all of conv layers in YOLO – 2% improvement in mAP  Removing dropout without overfitting • High Resolution Classifier  YOLO trains the classifier network @224x224 and increase resolution to 448 for detection  The network has to simultaneously switch to learning object detection and adjust to the new input resolution  Fine tuning the classifier network @448x448 resolution for 10 epochs on ImageNet – almost 4% improvement in mAP
  • 10. Better • Convolution with Anchor Boxes  YOLO predicts the coordinates of b-boxes directly using fc layers.  Faster R-CNN predicts b-boxes using hand-picked priors(anchor boxes).  YOLOv2 removes the fc layers and use anchor boxes to predict bounding boxes  Shrinking the network to operate on 416x416 input images instead of 448x448 YOLO’s CNN downsample the image by a factor of 32 Objects, especially large objects, tend to occupy the center of the image, so it is better to have one cell in the middle than to have four cells.  odd number of grid cells are required(416x416  13x13)  Predicting class and objectness for every anchor box  Using anchor boxes makes a small decrease in accuracy 69.5mAP to 69.2mAP but 81% recall to 88% recall  high recall means there is more room to improve
  • 11. Better • Dimension Clusters  The problem with anchor boxes is that they are hand-picked  Running k-means clustering on the training set b-boxes to find good priors  To prevent larger boxes from making more error, they use a different distance metric(not Euclidian distance) d(box, centroid) = 1 – IOU(box, centroid)  k=5 is chosen as a good tradeoff between complexity and high recall
  • 12. Better • Direct Location Prediction  Another problem with anchor boxes is instability, in RPNs the anchor box can be anywhere in the image, regardless of what location predicted the box  Instead of predicting offsets, YOLOv2 predicts locations relative to the location of the grid cells  5 bounding boxes for each cell, and 5 values for each bounding box  almost 5% improvement over the version with anchor boxes RCNN’s b-box regression Different equations – direct prediction
  • 13. Better • Fine-Grained Features  Modified YOLO uses 13x13 feature maps to predict detections Good at large object, finer grained features are required for small objects  Simply adding a passthrough layer that brings features from an earlier layer at 26x26 resolution  Turn 26x26x512 feature map into a 13x13x2048 feature map which can be concatenated with original features  Detector runs on top of this expanded features  1% performance increase 512 26 26 13 13 2048
  • 14. Better • Multi-Scale Training  Removing fc layers  can be resized on the fly  Every 10 batches YOLOv2 randomly choose a new image dimension size in {320, 352, …, 608}  Effect of the input size At low resolution, fair detection performance but runs fast At high resolution, SOTA detection and slower but still above real time speeds
  • 15. Further Experiments • Pascal VOC2012 results • COCO2015 results
  • 16. Faster • Darknet-19  mostly 3x3 filters (similar to VGG)  Following the work on Network in Network (NIN), use global average pooling to make predictions as well as 1x1 filters to compress the feature representation between 3x3 convolutions  Darknet-19: 19 convolutional layers & 5 max- pooling layers  5.58 billion operations: 72.9% top-1 & 91.2% top-5 accuracy (30.69 billion operations & 90.0% top-5 accuracy for VGG-16)
  • 17. Faster • Training for classification  ImageNet 1000 classes for 160 epochs  Standard data augmentation: random crops, rotations, hue, saturation, and exposure shifts  Initial training : 224x224  448x448 fine-tuning for 10 epochs Higher resolution achieves a top-5 accuracy of 93.3% • Training for detection  Adding 3x3 conv layers with 1024 filters each followed by a final 1x1 conv layer  For VOC, predicting 5 boxes with 5 coordinates each and 20 classes per box, so 125 filters  160 epochs with a start learning rate of 10-3 dividing it by 10 at 60 and 90 epochs
  • 18. Stronger • Hierarchical classification  ImageNet labels are pulled from WordNet, a language database that structures concepts and how they relate “Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc.  WordNet is structured as a directed graph, not a tree, because language is complex To make a tree not a graph, If a concept has tow paths to the root, choose the shorter path  Root note is a “Physical Object”
  • 19. Stronger • Hierarchical classification  To perform classification with WordTree, predicting conditional probabilities at every node for the probability of each hyponym of that synset given that synset  Then if a picture of a Norfolk terrier is encountered, its probability is calculated as below
  • 20. Stronger • Hierarchical classification  To build WordTree1k, adding in all of the intermediate nodes which expands the label space from 1000 to 1369  During training, ground truth labels are propagated up If an image is labelled as a “Norfolk terrier” it also gets labelled as a “dog” and a “mammal”, etc.  To compute the conditional probabilities the model predicts a vector of 1369 values and computes the softmax over all synsets that are hyponyms of the same concept  90.4% top-5 accuracy
  • 21. Stronger • Dataset Combination with WordTree  Example of using WordTree to combine the labels from ImageNet and COCO. COCO – general concepts(higher nodes) ImageNet – specific concepts(lower nodes and leaves)
  • 22. Stronger • Joint Classification and Detection  New dataset created using the COCO detection dataset and the top 9000 classes from the full ImageNet release. ImageNet detection challenge dataset is also added for evaluation  Using YOLOv2 but only 3 priors instead of 5 to limit output size  The network sees a detection image, backpropagate loss as normal. For classification loss, only backpropagate loss at or above the corresponding level and only backpropagate classification loss Simply find the b-box that predicts the highest probability for that class and compute the loss on just its predicted tree
  • 23. Stronger • Joint Classification and Detection  Evaluating YOLO9000 on the ImageNet detection task ImageNet only share 44 object categories with COCO 19.7mAP overall with 16.0mAP on the disjoint 156 object classes that it has never seen  YOLO9000 learns new species of animals well but struggles with learning categories like clothing and equipment
  • 24. Conclusion • YOLOv2 is fast and accurate • YOLO9000 is a strong step towards closing the dataset size gap between detection and classification • Dataset combination using hierarchical classification would be useful in the classification and segmentation domains