Deep Learning Techniques for
Object Detection and Recognition
Chu-Song Chen
Outline
● Computer Vision
● Image Classification and Object Detection
● Crowdsourcing + Machine Learning
o Image Net + ILSVRC Challenge
o Deep Convolution Nets
● Recent Advances and Results
Computer Vision
● Research on the methods for acquiring, processing,
analyzing, and understanding images and, in general, high-
dimensional data from the real world in order to produce
numerical or symbolic information, e.g., in the forms of
decisions.
Object Detection & Recognition
● Object recognition is one of the main tasks in
computer vision.
Semantic segmentation Object detection
What is object detection?
● Image classification
● object localization
● object detection
● segmentation
difficulty
Why is object detection important?
● Perception is one of the biggest bottlenecks of
○ Robotics
○ Self-driving cars
○ Surveillance
Applications
● Image classification
○ image search (Google, Baidu, Bing)
● Object detection
○ face
■ smart phone/cameras
■ election duplicate votes
■ CCTV
■ border control
■ casinos
■ visa processing
■ crime solving
■ prosopagnosia (face blindness)
○ objects
■ license plates
■ pedestrian detection (Daimler, MobileEye):
● warning and automatic braking reducing accidents
and severity
■ vehicle detection for forward collision
warning (MobileEye)
■ traffic sign detection (MobileEye)
 E-commerce
 machine inspection
Machine Learning & Computer
Vision
● How to achieve object recognition?
o Typically through machine learning in computer vision.
● Training stage:
o Collect training sample images.
o Learn an object detector.
● Inference stage: Employ the learned detector for detection.
o Take pedestrian detection as an example:
Pedestrian detection: training phase
(traditional approach)
● Collecting training data
o Extracting features (or casting data into feature space).
 color, edge, gradient, silhouette, dimension reduction, etc.
o Learning an object detector classifier
 Many learning methods: eg., Neural Networks, SVM,
Boosting, Cascaded AdaBoost, random forest.
Positive training data Negative training data
10
Pedestrian detection: testing phase
(traditional approach)
● After learning a human detector
o A detection window can be used to scan the testing image
along x and y directions for human detection.
11
Pedestrian detection: inference phase
● Human detection
o Detection windows with different sizes are used to detect
humans with different scales.
…
…
…
Difficulties for object recognition
● Object recognition
To human (an image and an image block)
To machine (a data ary of real numbers)
Past breakthroughs in object
detection researches
o Face detection: Haar
feature + AdaBoost
learning. (2000)
● Every mobile phone is
equipped with this function now.
o SIFT and HOG: local
discriminating features.
(2004) + SVM for object
detection.
● A key component to RGB vision-
based positioning and localization.
Examples of several breakthroughs
in object detection researches
● Deformable part models (2008):
o HOG feature
o Latent SVM + stochastic gradient descent (SGD) training
o Training scale of the above: 5K ~ 20K training images.
General object recognition
o The above methods bring many ingredient in application.
o However, they are still difficult to achieve general object
detection/recognition.
● Recent big breakthroughs of object detection
comes from crowdsourcing + machine learning:
o More labeled training data are gathered from mechanical
turk.
o More suitable machine learning techniques: deep
convolution neural networks (CNNs).
Artificial neural networks and deep
learning
● Why deep learning?
o A limitation of tradition methods: separate feature
extraction and classifier training as two independent
processes.
o One motivation in deep learning is to joining feature
extraction and classification into a single framework.
o This causes a large number of parameters. However,
when the number of training images is huge, the issue of
over-fitting is lessened.
o Deep learning: end-to-end learning.
That is feature extraction + classification in a single step
slide: R. Fergus
slide: R. Fergus
slide: Honglak Lee
Artificial neural networks and deep
learning
● Deep learning stems from artificial neural networks.
● There are many deep learning architectures.
● Among them, deep convolutional networks (CNN)
perform the best on the recognition tasks.
● In the following, we will review convolutional neural
networks (CNN) for
o image classification
o object detection
Convoltional Neural Networks
● CNN: a neural network consists of
o fully-connected layer
o convolution layer
o max-pooling
o nonlinear activation (ReLU or sigmoid)
o ………
Fully-connected layers
● If the input is an image, the fully connected layer
will have a huge amount of links between layers:
● The weights are required to be learned.
Convolution layer
● Instead of fully connection, using a 𝑘𝑘 × 𝑘𝑘 widow to slide the
image and performing inner product on every site.
● That is, applying a 𝑘𝑘 × 𝑘𝑘 FIR filter or convolution on the image.
● The coefficients are required to be learned.
Convolution vs. fully-connection
●.Convolutional layer:
o Shared weights
o Shift invariance
o Local
Convolutional layer Fully connected layer
Multiple FIR filters in a convolutional layer
● Often multiple FIR filters are in a convolutional layer.
● The filters’ outputs serves the inputs of the next layer.
● So, if the number of filters used in a convolution layer
are a number of 𝑐𝑐𝑙𝑙, the output of this layer forms an 𝑛𝑛𝑙𝑙 ×
𝑛𝑛𝑙𝑙 × 𝑐𝑐𝑙𝑙 volume.
𝑛𝑛𝑙𝑙
𝑛𝑛𝑙𝑙
Multiple “volume” FIR filters
● So, the output of the convolution layer has 𝑐𝑐𝑙𝑙
channels, forming an 𝑛𝑛𝑙𝑙 × 𝑛𝑛𝑙𝑙 × 𝑐𝑐𝑙𝑙 volume.
● Actually, the FIR filters applied in a CNN are
of size 𝑘𝑘 × 𝑘𝑘 × 𝑐𝑐𝑙𝑙 (though we usually
abbreviate it as 𝑘𝑘 × 𝑘𝑘 in for simplicity); it is
indeed a “volume” FIR filter).
Input: a RGB (3-chanel) image of size 𝑁𝑁 × 𝑁𝑁
● Eg., 𝑁𝑁 = 32, input to the first convolutional layer having 5
filters
● Eg., 𝑁𝑁 = 40, input to a cascade of convolutional layers, a
fully connected layer, and the final output layer. (entire network)
A single
neuron
o activation
function example
o sigmoid
o ReLU
z
Nonlinear activation function
● or if the layers are cascaded linearly, they can be replaced
by a single equivalent layer.
Pooling for dimension (size)
reduction
or the weights will still be.
 Summaries the input
● Eg, Max pooling
Max pooling layer (cont)
After max
pooling, the size
(i.e., dimension)
of the feature
map is reduced.
● Sharing parameters is good
○ taking advantage of local coherence to learn a more efficient representation:
■ no redundancy
■ translation invariance
■ slight rotation invariance with pooling
● Efficient for detection:
○ all computations are shared
○ can handle varying input sizes (no need to relearn weights for new sizes)
● ConvNets are convolutional all the way up including fully connected layers
Why are ConvNets good for detection?
slide: Pierre Sermanett
Big-data training images from Internet
● ILSVRC competition (ImageNet Challenge)
o ImageNet: collecting images according to the Wordnet tree.
o ILSVRC: choosing words in different tree branches.
ILSVRC Image classification
challenge
Fine tuning
● ILSVRC (ImageNet challenge) is a large
dataset with diverse object classes.
● Using the pre-trained weights on ILSVRC for
fine-tuning is a popular strategy.
Winner of ILSVRC 2012 of Image
classification: AlexNet
• 5 convolutional layers, 3 fully-connected layers
• The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000.
● This was made possible by:
○ fast hardware: GPU-optimized code
○ big dataset: 1.2 million images vs thousands before
○ better regularization: dropout
Winner of ILSVRC 2014 of Image
classification: GoogleNet
● Inception: basic
building block in
googlenet
● GoogleNet: many
versions later. (Here, 7
inceptions)
a single inception
ILSVRC 2014 Single-net best performed –
VGG network (11- 19 layers)
Design criterion:
Using 3 × 3 filters
(to find small
details in every
layer)
Max-pooling (half-
size reduced of
the height and
width of the
feature map)
+
Double the
number of feature
maps by doubling
the filters.
ILSVRC 2015 winner – Residual
network (50- 151 layers)
Design criterion:
Add the short-cut link
Fully connected layer → average
pooling
Use batch normalization
From image classification to object
detection
● The above CNNs are designed for image classification (i.e.,
assume only one concept is contained in the input image).
● However, they serve as important building blocks for
feature extraction, and can be migrated to a new architecture
for object detection.
Image classification task
Object detection task
Object detection CNNs
● RCNN – Fast RCNN – Raster RCNN
● RFCN
● SSD
● PVA net
● Yolo v2
● ……
R-CNN
●R-CNN: Regions with CNN features
43
Koen E. A. van de Sande, Jasper R. R.
Uijlings, Theo Gevers, Arnold W. M. Smeulders,
Segmentation As Selective Search for
Object Recognition, in ICCV 2011
● Scan the input image for possible objects using an algorithm called Selective
Search, generating ~2000 region proposals
● Run a convolutional neural net (CNN) on top of each of these region proposals.
The CNN are pre-trained on the ImageNet and fine-tuned here.
● Take the output of each CNN and feed it into a) an SVM to classify the region
and b) a regressor to tighten the bounding box of the object, if such an object
exists.
● bounding box regression: output the center
and size of tight bounding box of the object.
● Generate region proposals based on the last feature map of the network, not from the
original image itself. As a result, we can train just one CNN for the entire image.
● The CNN is fined-tuned from the image classification network pre-trained on ImageNet.
● However, selective search in the original image is still needed.
● Without using SVMs: replacing SVMs with the CNN output.
● At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map
and maps it to a lower dimension (e.g. 256-d)
● For each sliding-window location, it generates multiple possible regions based on k
fixed-ratio anchor boxes (default bounding boxes)
● Each region proposal consists of a) an “objectness” score for that region and b) 4
coordinates representing the bounding box of the region
Faster RCNN: region proposal ntwork
● The main insight of Faster R-CNN was to replace the slow selective search algorithm
with a fast neural net. Specifically, it introduced the region proposal network (RPN).
● Faster R-CNN = RPN + Fast R-CNN
● In other words, look at each location in our last feature map and consider 𝑘𝑘 boxes
centered around it: a tall, a wide, and a large box, etc. For each of those boxes, output
whether or not we think it contains an object, and what the coordinates for that box are.
● Feed the proposal into what is essentially a Fast R-CNN.
● Union the CNN in the bottom for both the region proposal network in faster RCNN and
the bounding-box-regression/object-classification in fast RCNN.
Results of Faster RCNN
SSD
● Region proposal and classification are trained simultaneously, unlike faster
RCNN that they are trained alternatively.
● Early convolution layers are also used. Early layers corresponds to smaller
objects, and rear layers corresponds to large objects.
● Faster and performance even better than faster RCNN
Yolo v2 (cvpr 2017)
● Modified from faster RCNN and Yolo
o use batch normalization; remove dropout.
o higher-resolution CNN classifier pretrained: from 224 ×
224 to 448 × 448
o use 9000 classes in the ImageNet for pre-training, instead
of 1000.
o direct location prediction: solve the instability in the
bounding box regression of faster RCNN.
● state-of-the-art on standard detection tasks like PASCAL
VOC and COCO datasets.
o At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets
78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet
and SSD while still running significantly faster.
Dataset
●Total 9,667 images:
o1,964 images annotated by ourselves
o7,703 images with bounding annotations from a public
dataset (ATR)
53
Applications: faster RCNN for clothing
detection
Dataset (Cont’d)
●9 categories:
●bag, belt, dress, footwear, glasses, hat, pants, skirt,
upperclothes
●# bounding boxes per category
54
LabelMe Annotation Tool
● A web-based tool to
create bounding boxes
and assign labels.
55
56
Ournotated
dataATR
Detection Results
57
Approach: Faster RCNN
Quantitative Results
● Metric: mAP (mean Average Precision)
● A detection is considered correct if its IoU
(intersection over union) with ground truth ≥
0.5 and its label is correct.
●Detection performance
58
Quantitative Results
●Metric: mAP (mean Average Precision)
●A detection is considered correct if its IoU
(Intersection over union) with ground truth ≥
0.5 and its label is correct.
●Detection performance
59
❏Perform better on larger items, e.g., upperclothes, dress, pants
Quantitative Results
●Metric: mAP (mean Average Precision)
●A detection is considered correct if its IoU
(Intersection over union) with ground truth ≥
0.5 and its label is correct.
●Detection performance
60
❏Perform better on larger items, e.g., upperclothes, dress, pants
❏Belts are very difficult to detect.
Summary
● The clothes item detector trained with
bounding box annotations can produce
satisfactory results. Even only a small set of
training data is applied.
● Trainin data is an issue: It is time-consuming
to obtain ground-truth bounding boxes.
61
Face detection
● Face-detection CNN: it is trained on a large-
scale face image dataset following similar ideas.
● We show that the face detector can be
realized in a CPU-based machine, Zenbo.
Deep CNN face detection/alignment
on Zenbo
●Zenbo Specifications
o CPU: Intel Atom x5-Z8550 2.4 GHz
o OS: Android 6.0.1
o RAM: 4G
o without using GPUs
● Frames per second
o 2.5 FPS [Resolution (640x480) ]
● Code optimizations
o C++ and OpenBLAS library
o Multi-threads computation
o without using any deep learning frameworks such as
tensorflow or pytorch
海洋空拍機魟魚偵測
● Chien-Hung Chen’s master thesis (Dept. of
Mech. & Elec. Mach. Eng., NSYSU);
● advisor: Prof. Keng-Hao Liu
A difficult problem: human may fail to track all
the 魟魚 successfully.
● Using Faster RCNN to train and detect
 base net: ZF or VGG
 Detection based on a video; using continuous
frames to refine the results.
Demo (close range)
ZF model VGG model
ZF model with time information VGG model with time information
Demo (distant range)
ZF model VGG model
ZF model with time information VGG model with time information
Demo (hard case)
ZF model VGG model
ZF model with time information VGG model with time information
Quantitative Results
● In the ground-truth, some 魟魚 sequences
detected by our method are not marked by human.
● After re-investigating these cases with human
experts, they have re-marked them as ground
truth.
Results of some video
Applicatins of deep CNN detector
● Deep CNN object detection techniques have
grown very fast in recent years. Several
promising models have been developed.
● The methods can be used for machine
inspection.
● Preparing data (with ground-truth regions)
would be an issue.
o Make the data type diverse
o If only few data with labeled regions can be collected,
augmenting the data by some attack (eg., by flipping,
rotation, cropping, lighting changes, blurring, sharpening
JPEG, etc.) is a useful technique for training.
Acknowledgement
Part of the slides are from the tutorial of CVPR2014, Deep
Learning for Computer Vision.

物件偵測與辨識技術

  • 1.
    Deep Learning Techniquesfor Object Detection and Recognition Chu-Song Chen
  • 2.
    Outline ● Computer Vision ●Image Classification and Object Detection ● Crowdsourcing + Machine Learning o Image Net + ILSVRC Challenge o Deep Convolution Nets ● Recent Advances and Results
  • 3.
    Computer Vision ● Researchon the methods for acquiring, processing, analyzing, and understanding images and, in general, high- dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions.
  • 4.
    Object Detection &Recognition ● Object recognition is one of the main tasks in computer vision. Semantic segmentation Object detection
  • 5.
    What is objectdetection? ● Image classification ● object localization ● object detection ● segmentation difficulty
  • 6.
    Why is objectdetection important? ● Perception is one of the biggest bottlenecks of ○ Robotics ○ Self-driving cars ○ Surveillance
  • 7.
    Applications ● Image classification ○image search (Google, Baidu, Bing) ● Object detection ○ face ■ smart phone/cameras ■ election duplicate votes ■ CCTV ■ border control ■ casinos ■ visa processing ■ crime solving ■ prosopagnosia (face blindness) ○ objects ■ license plates ■ pedestrian detection (Daimler, MobileEye): ● warning and automatic braking reducing accidents and severity ■ vehicle detection for forward collision warning (MobileEye) ■ traffic sign detection (MobileEye)  E-commerce  machine inspection
  • 8.
    Machine Learning &Computer Vision ● How to achieve object recognition? o Typically through machine learning in computer vision. ● Training stage: o Collect training sample images. o Learn an object detector. ● Inference stage: Employ the learned detector for detection. o Take pedestrian detection as an example:
  • 9.
    Pedestrian detection: trainingphase (traditional approach) ● Collecting training data o Extracting features (or casting data into feature space).  color, edge, gradient, silhouette, dimension reduction, etc. o Learning an object detector classifier  Many learning methods: eg., Neural Networks, SVM, Boosting, Cascaded AdaBoost, random forest. Positive training data Negative training data
  • 10.
    10 Pedestrian detection: testingphase (traditional approach) ● After learning a human detector o A detection window can be used to scan the testing image along x and y directions for human detection.
  • 11.
    11 Pedestrian detection: inferencephase ● Human detection o Detection windows with different sizes are used to detect humans with different scales. … … …
  • 12.
    Difficulties for objectrecognition ● Object recognition To human (an image and an image block) To machine (a data ary of real numbers)
  • 13.
    Past breakthroughs inobject detection researches o Face detection: Haar feature + AdaBoost learning. (2000) ● Every mobile phone is equipped with this function now. o SIFT and HOG: local discriminating features. (2004) + SVM for object detection. ● A key component to RGB vision- based positioning and localization.
  • 14.
    Examples of severalbreakthroughs in object detection researches ● Deformable part models (2008): o HOG feature o Latent SVM + stochastic gradient descent (SGD) training o Training scale of the above: 5K ~ 20K training images.
  • 15.
    General object recognition oThe above methods bring many ingredient in application. o However, they are still difficult to achieve general object detection/recognition. ● Recent big breakthroughs of object detection comes from crowdsourcing + machine learning: o More labeled training data are gathered from mechanical turk. o More suitable machine learning techniques: deep convolution neural networks (CNNs).
  • 16.
    Artificial neural networksand deep learning ● Why deep learning? o A limitation of tradition methods: separate feature extraction and classifier training as two independent processes. o One motivation in deep learning is to joining feature extraction and classification into a single framework. o This causes a large number of parameters. However, when the number of training images is huge, the issue of over-fitting is lessened. o Deep learning: end-to-end learning. That is feature extraction + classification in a single step
  • 17.
  • 18.
  • 21.
  • 22.
    Artificial neural networksand deep learning ● Deep learning stems from artificial neural networks. ● There are many deep learning architectures. ● Among them, deep convolutional networks (CNN) perform the best on the recognition tasks. ● In the following, we will review convolutional neural networks (CNN) for o image classification o object detection
  • 23.
    Convoltional Neural Networks ●CNN: a neural network consists of o fully-connected layer o convolution layer o max-pooling o nonlinear activation (ReLU or sigmoid) o ………
  • 24.
    Fully-connected layers ● Ifthe input is an image, the fully connected layer will have a huge amount of links between layers: ● The weights are required to be learned.
  • 25.
    Convolution layer ● Insteadof fully connection, using a 𝑘𝑘 × 𝑘𝑘 widow to slide the image and performing inner product on every site. ● That is, applying a 𝑘𝑘 × 𝑘𝑘 FIR filter or convolution on the image. ● The coefficients are required to be learned.
  • 26.
    Convolution vs. fully-connection ●.Convolutionallayer: o Shared weights o Shift invariance o Local Convolutional layer Fully connected layer
  • 27.
    Multiple FIR filtersin a convolutional layer ● Often multiple FIR filters are in a convolutional layer. ● The filters’ outputs serves the inputs of the next layer. ● So, if the number of filters used in a convolution layer are a number of 𝑐𝑐𝑙𝑙, the output of this layer forms an 𝑛𝑛𝑙𝑙 × 𝑛𝑛𝑙𝑙 × 𝑐𝑐𝑙𝑙 volume. 𝑛𝑛𝑙𝑙 𝑛𝑛𝑙𝑙
  • 28.
    Multiple “volume” FIRfilters ● So, the output of the convolution layer has 𝑐𝑐𝑙𝑙 channels, forming an 𝑛𝑛𝑙𝑙 × 𝑛𝑛𝑙𝑙 × 𝑐𝑐𝑙𝑙 volume. ● Actually, the FIR filters applied in a CNN are of size 𝑘𝑘 × 𝑘𝑘 × 𝑐𝑐𝑙𝑙 (though we usually abbreviate it as 𝑘𝑘 × 𝑘𝑘 in for simplicity); it is indeed a “volume” FIR filter).
  • 29.
    Input: a RGB(3-chanel) image of size 𝑁𝑁 × 𝑁𝑁 ● Eg., 𝑁𝑁 = 32, input to the first convolutional layer having 5 filters ● Eg., 𝑁𝑁 = 40, input to a cascade of convolutional layers, a fully connected layer, and the final output layer. (entire network)
  • 30.
    A single neuron o activation functionexample o sigmoid o ReLU z Nonlinear activation function ● or if the layers are cascaded linearly, they can be replaced by a single equivalent layer.
  • 31.
    Pooling for dimension(size) reduction or the weights will still be.  Summaries the input ● Eg, Max pooling
  • 32.
    Max pooling layer(cont) After max pooling, the size (i.e., dimension) of the feature map is reduced.
  • 33.
    ● Sharing parametersis good ○ taking advantage of local coherence to learn a more efficient representation: ■ no redundancy ■ translation invariance ■ slight rotation invariance with pooling ● Efficient for detection: ○ all computations are shared ○ can handle varying input sizes (no need to relearn weights for new sizes) ● ConvNets are convolutional all the way up including fully connected layers Why are ConvNets good for detection? slide: Pierre Sermanett
  • 34.
    Big-data training imagesfrom Internet ● ILSVRC competition (ImageNet Challenge) o ImageNet: collecting images according to the Wordnet tree. o ILSVRC: choosing words in different tree branches.
  • 35.
  • 36.
    Fine tuning ● ILSVRC(ImageNet challenge) is a large dataset with diverse object classes. ● Using the pre-trained weights on ILSVRC for fine-tuning is a popular strategy.
  • 37.
    Winner of ILSVRC2012 of Image classification: AlexNet • 5 convolutional layers, 3 fully-connected layers • The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000. ● This was made possible by: ○ fast hardware: GPU-optimized code ○ big dataset: 1.2 million images vs thousands before ○ better regularization: dropout
  • 38.
    Winner of ILSVRC2014 of Image classification: GoogleNet ● Inception: basic building block in googlenet ● GoogleNet: many versions later. (Here, 7 inceptions) a single inception
  • 39.
    ILSVRC 2014 Single-netbest performed – VGG network (11- 19 layers) Design criterion: Using 3 × 3 filters (to find small details in every layer) Max-pooling (half- size reduced of the height and width of the feature map) + Double the number of feature maps by doubling the filters.
  • 40.
    ILSVRC 2015 winner– Residual network (50- 151 layers) Design criterion: Add the short-cut link Fully connected layer → average pooling Use batch normalization
  • 41.
    From image classificationto object detection ● The above CNNs are designed for image classification (i.e., assume only one concept is contained in the input image). ● However, they serve as important building blocks for feature extraction, and can be migrated to a new architecture for object detection. Image classification task Object detection task
  • 42.
    Object detection CNNs ●RCNN – Fast RCNN – Raster RCNN ● RFCN ● SSD ● PVA net ● Yolo v2 ● ……
  • 43.
    R-CNN ●R-CNN: Regions withCNN features 43 Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders, Segmentation As Selective Search for Object Recognition, in ICCV 2011
  • 44.
    ● Scan theinput image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals ● Run a convolutional neural net (CNN) on top of each of these region proposals. The CNN are pre-trained on the ImageNet and fine-tuned here. ● Take the output of each CNN and feed it into a) an SVM to classify the region and b) a regressor to tighten the bounding box of the object, if such an object exists.
  • 45.
    ● bounding boxregression: output the center and size of tight bounding box of the object.
  • 46.
    ● Generate regionproposals based on the last feature map of the network, not from the original image itself. As a result, we can train just one CNN for the entire image. ● The CNN is fined-tuned from the image classification network pre-trained on ImageNet. ● However, selective search in the original image is still needed. ● Without using SVMs: replacing SVMs with the CNN output.
  • 47.
    ● At thelast layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d) ● For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes) ● Each region proposal consists of a) an “objectness” score for that region and b) 4 coordinates representing the bounding box of the region Faster RCNN: region proposal ntwork
  • 48.
    ● The maininsight of Faster R-CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it introduced the region proposal network (RPN). ● Faster R-CNN = RPN + Fast R-CNN
  • 49.
    ● In otherwords, look at each location in our last feature map and consider 𝑘𝑘 boxes centered around it: a tall, a wide, and a large box, etc. For each of those boxes, output whether or not we think it contains an object, and what the coordinates for that box are. ● Feed the proposal into what is essentially a Fast R-CNN. ● Union the CNN in the bottom for both the region proposal network in faster RCNN and the bounding-box-regression/object-classification in fast RCNN.
  • 50.
  • 51.
    SSD ● Region proposaland classification are trained simultaneously, unlike faster RCNN that they are trained alternatively. ● Early convolution layers are also used. Early layers corresponds to smaller objects, and rear layers corresponds to large objects. ● Faster and performance even better than faster RCNN
  • 52.
    Yolo v2 (cvpr2017) ● Modified from faster RCNN and Yolo o use batch normalization; remove dropout. o higher-resolution CNN classifier pretrained: from 224 × 224 to 448 × 448 o use 9000 classes in the ImageNet for pre-training, instead of 1000. o direct location prediction: solve the instability in the bounding box regression of faster RCNN. ● state-of-the-art on standard detection tasks like PASCAL VOC and COCO datasets. o At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster.
  • 53.
    Dataset ●Total 9,667 images: o1,964images annotated by ourselves o7,703 images with bounding annotations from a public dataset (ATR) 53 Applications: faster RCNN for clothing detection
  • 54.
    Dataset (Cont’d) ●9 categories: ●bag,belt, dress, footwear, glasses, hat, pants, skirt, upperclothes ●# bounding boxes per category 54
  • 55.
    LabelMe Annotation Tool ●A web-based tool to create bounding boxes and assign labels. 55
  • 56.
  • 57.
  • 58.
    Quantitative Results ● Metric:mAP (mean Average Precision) ● A detection is considered correct if its IoU (intersection over union) with ground truth ≥ 0.5 and its label is correct. ●Detection performance 58
  • 59.
    Quantitative Results ●Metric: mAP(mean Average Precision) ●A detection is considered correct if its IoU (Intersection over union) with ground truth ≥ 0.5 and its label is correct. ●Detection performance 59 ❏Perform better on larger items, e.g., upperclothes, dress, pants
  • 60.
    Quantitative Results ●Metric: mAP(mean Average Precision) ●A detection is considered correct if its IoU (Intersection over union) with ground truth ≥ 0.5 and its label is correct. ●Detection performance 60 ❏Perform better on larger items, e.g., upperclothes, dress, pants ❏Belts are very difficult to detect.
  • 61.
    Summary ● The clothesitem detector trained with bounding box annotations can produce satisfactory results. Even only a small set of training data is applied. ● Trainin data is an issue: It is time-consuming to obtain ground-truth bounding boxes. 61
  • 62.
    Face detection ● Face-detectionCNN: it is trained on a large- scale face image dataset following similar ideas. ● We show that the face detector can be realized in a CPU-based machine, Zenbo.
  • 63.
    Deep CNN facedetection/alignment on Zenbo ●Zenbo Specifications o CPU: Intel Atom x5-Z8550 2.4 GHz o OS: Android 6.0.1 o RAM: 4G o without using GPUs ● Frames per second o 2.5 FPS [Resolution (640x480) ] ● Code optimizations o C++ and OpenBLAS library o Multi-threads computation o without using any deep learning frameworks such as tensorflow or pytorch
  • 64.
    海洋空拍機魟魚偵測 ● Chien-Hung Chen’smaster thesis (Dept. of Mech. & Elec. Mach. Eng., NSYSU); ● advisor: Prof. Keng-Hao Liu A difficult problem: human may fail to track all the 魟魚 successfully. ● Using Faster RCNN to train and detect  base net: ZF or VGG  Detection based on a video; using continuous frames to refine the results.
  • 65.
    Demo (close range) ZFmodel VGG model ZF model with time information VGG model with time information
  • 66.
    Demo (distant range) ZFmodel VGG model ZF model with time information VGG model with time information
  • 67.
    Demo (hard case) ZFmodel VGG model ZF model with time information VGG model with time information
  • 68.
    Quantitative Results ● Inthe ground-truth, some 魟魚 sequences detected by our method are not marked by human. ● After re-investigating these cases with human experts, they have re-marked them as ground truth. Results of some video
  • 69.
    Applicatins of deepCNN detector ● Deep CNN object detection techniques have grown very fast in recent years. Several promising models have been developed. ● The methods can be used for machine inspection. ● Preparing data (with ground-truth regions) would be an issue. o Make the data type diverse o If only few data with labeled regions can be collected, augmenting the data by some attack (eg., by flipping, rotation, cropping, lighting changes, blurring, sharpening JPEG, etc.) is a useful technique for training.
  • 70.
    Acknowledgement Part of theslides are from the tutorial of CVPR2014, Deep Learning for Computer Vision.