物件偵測與辨識技術

Deep Learning Techniques for
Object Detection and Recognition
Chu-Song Chen

Outline
● Computer Vision
● Image Classification and Object Detection
● Crowdsourcing + Machine Learning
o Image Net + ILSVRC Challenge
o Deep Convolution Nets
● Recent Advances and Results

Computer Vision
● Research on the methods for acquiring, processing,
analyzing, and understanding images and, in general, high-
dimensional data from the real world in order to produce
numerical or symbolic information, e.g., in the forms of
decisions.

Object Detection & Recognition
● Object recognition is one of the main tasks in
computer vision.
Semantic segmentation Object detection

What is object detection?
● Image classification
● object localization
● object detection
● segmentation
difficulty

Why is object detection important?
● Perception is one of the biggest bottlenecks of
○ Robotics
○ Self-driving cars
○ Surveillance

Applications
● Image classification
○ image search (Google, Baidu, Bing)
● Object detection
○ face
■ smart phone/cameras
■ election duplicate votes
■ CCTV
■ border control
■ casinos
■ visa processing
■ crime solving
■ prosopagnosia (face blindness)
○ objects
■ license plates
■ pedestrian detection (Daimler, MobileEye):
● warning and automatic braking reducing accidents
and severity
■ vehicle detection for forward collision
warning (MobileEye)
■ traffic sign detection (MobileEye)
 E-commerce
 machine inspection

Machine Learning & Computer
Vision
● How to achieve object recognition?
o Typically through machine learning in computer vision.
● Training stage:
o Collect training sample images.
o Learn an object detector.
● Inference stage: Employ the learned detector for detection.
o Take pedestrian detection as an example:

Pedestrian detection: training phase
(traditional approach)
● Collecting training data
o Extracting features (or casting data into feature space).
 color, edge, gradient, silhouette, dimension reduction, etc.
o Learning an object detector classifier
 Many learning methods: eg., Neural Networks, SVM,
Boosting, Cascaded AdaBoost, random forest.
Positive training data Negative training data

10
Pedestrian detection: testing phase
(traditional approach)
● After learning a human detector
o A detection window can be used to scan the testing image
along x and y directions for human detection.

11
Pedestrian detection: inference phase
● Human detection
o Detection windows with different sizes are used to detect
humans with different scales.
…
…
…

Difficulties for object recognition
● Object recognition
To human (an image and an image block)
To machine (a data ary of real numbers)

Past breakthroughs in object
detection researches
o Face detection: Haar
feature + AdaBoost
learning. (2000)
● Every mobile phone is
equipped with this function now.
o SIFT and HOG: local
discriminating features.
(2004) + SVM for object
detection.
● A key component to RGB vision-
based positioning and localization.

Examples of several breakthroughs
in object detection researches
● Deformable part models (2008):
o HOG feature
o Latent SVM + stochastic gradient descent (SGD) training
o Training scale of the above: 5K ~ 20K training images.

General object recognition
o The above methods bring many ingredient in application.
o However, they are still difficult to achieve general object
detection/recognition.
● Recent big breakthroughs of object detection
comes from crowdsourcing + machine learning:
o More labeled training data are gathered from mechanical
turk.
o More suitable machine learning techniques: deep
convolution neural networks (CNNs).

Artificial neural networks and deep
learning
● Why deep learning?
o A limitation of tradition methods: separate feature
extraction and classifier training as two independent
processes.
o One motivation in deep learning is to joining feature
extraction and classification into a single framework.
o This causes a large number of parameters. However,
when the number of training images is huge, the issue of
over-fitting is lessened.
o Deep learning: end-to-end learning.
That is feature extraction + classification in a single step

Artificial neural networks and deep
learning
● Deep learning stems from artificial neural networks.
● There are many deep learning architectures.
● Among them, deep convolutional networks (CNN)
perform the best on the recognition tasks.
● In the following, we will review convolutional neural
networks (CNN) for
o image classification
o object detection

Convoltional Neural Networks
● CNN: a neural network consists of
o fully-connected layer
o convolution layer
o max-pooling
o nonlinear activation (ReLU or sigmoid)
o ………

Fully-connected layers
● If the input is an image, the fully connected layer
will have a huge amount of links between layers:
● The weights are required to be learned.

Convolution layer
● Instead of fully connection, using a 𝑘𝑘 × 𝑘𝑘 widow to slide the
image and performing inner product on every site.
● That is, applying a 𝑘𝑘 × 𝑘𝑘 FIR filter or convolution on the image.
● The coefficients are required to be learned.

Convolution vs. fully-connection
●.Convolutional layer:
o Shared weights
o Shift invariance
o Local
Convolutional layer Fully connected layer

Multiple FIR filters in a convolutional layer
● Often multiple FIR filters are in a convolutional layer.
● The filters’ outputs serves the inputs of the next layer.
● So, if the number of filters used in a convolution layer
are a number of 𝑐𝑐𝑙𝑙, the output of this layer forms an 𝑛𝑛𝑙𝑙 ×
𝑛𝑛𝑙𝑙 × 𝑐𝑐𝑙𝑙 volume.
𝑛𝑛𝑙𝑙
𝑛𝑛𝑙𝑙

Multiple “volume” FIR filters
● So, the output of the convolution layer has 𝑐𝑐𝑙𝑙
channels, forming an 𝑛𝑛𝑙𝑙 × 𝑛𝑛𝑙𝑙 × 𝑐𝑐𝑙𝑙 volume.
● Actually, the FIR filters applied in a CNN are
of size 𝑘𝑘 × 𝑘𝑘 × 𝑐𝑐𝑙𝑙 (though we usually
abbreviate it as 𝑘𝑘 × 𝑘𝑘 in for simplicity); it is
indeed a “volume” FIR filter).

Input: a RGB (3-chanel) image of size 𝑁𝑁 × 𝑁𝑁
● Eg., 𝑁𝑁 = 32, input to the first convolutional layer having 5
filters
● Eg., 𝑁𝑁 = 40, input to a cascade of convolutional layers, a
fully connected layer, and the final output layer. (entire network)

A single
neuron
o activation
function example
o sigmoid
o ReLU
z
Nonlinear activation function
● or if the layers are cascaded linearly, they can be replaced
by a single equivalent layer.

Pooling for dimension (size)
reduction
or the weights will still be.
 Summaries the input
● Eg, Max pooling

Max pooling layer (cont)
After max
pooling, the size
(i.e., dimension)
of the feature
map is reduced.

● Sharing parameters is good
○ taking advantage of local coherence to learn a more efficient representation:
■ no redundancy
■ translation invariance
■ slight rotation invariance with pooling
● Efficient for detection:
○ all computations are shared
○ can handle varying input sizes (no need to relearn weights for new sizes)
● ConvNets are convolutional all the way up including fully connected layers
Why are ConvNets good for detection?
slide: Pierre Sermanett

Big-data training images from Internet
● ILSVRC competition (ImageNet Challenge)
o ImageNet: collecting images according to the Wordnet tree.
o ILSVRC: choosing words in different tree branches.

ILSVRC Image classification
challenge

Fine tuning
● ILSVRC (ImageNet challenge) is a large
dataset with diverse object classes.
● Using the pre-trained weights on ILSVRC for
fine-tuning is a popular strategy.

Winner of ILSVRC 2012 of Image
classification: AlexNet
• 5 convolutional layers, 3 fully-connected layers
• The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000.
● This was made possible by:
○ fast hardware: GPU-optimized code
○ big dataset: 1.2 million images vs thousands before
○ better regularization: dropout

Winner of ILSVRC 2014 of Image
classification: GoogleNet
● Inception: basic
building block in
googlenet
● GoogleNet: many
versions later. (Here, 7
inceptions)
a single inception

ILSVRC 2014 Single-net best performed –
VGG network (11- 19 layers)
Design criterion:
Using 3 × 3 filters
(to find small
details in every
layer)
Max-pooling (half-
size reduced of
the height and
width of the
feature map)
+
Double the
number of feature
maps by doubling
the filters.

ILSVRC 2015 winner – Residual
network (50- 151 layers)
Design criterion:
Add the short-cut link
Fully connected layer → average
pooling
Use batch normalization

From image classification to object
detection
● The above CNNs are designed for image classification (i.e.,
assume only one concept is contained in the input image).
● However, they serve as important building blocks for
feature extraction, and can be migrated to a new architecture
for object detection.
Image classification task
Object detection task

Object detection CNNs
● RCNN – Fast RCNN – Raster RCNN
● RFCN
● SSD
● PVA net
● Yolo v2
● ……

R-CNN
●R-CNN: Regions with CNN features
43
Koen E. A. van de Sande, Jasper R. R.
Uijlings, Theo Gevers, Arnold W. M. Smeulders,
Segmentation As Selective Search for
Object Recognition, in ICCV 2011

● Scan the input image for possible objects using an algorithm called Selective
Search, generating ~2000 region proposals
● Run a convolutional neural net (CNN) on top of each of these region proposals.
The CNN are pre-trained on the ImageNet and fine-tuned here.
● Take the output of each CNN and feed it into a) an SVM to classify the region
and b) a regressor to tighten the bounding box of the object, if such an object
exists.

● bounding box regression: output the center
and size of tight bounding box of the object.

● Generate region proposals based on the last feature map of the network, not from the
original image itself. As a result, we can train just one CNN for the entire image.
● The CNN is fined-tuned from the image classification network pre-trained on ImageNet.
● However, selective search in the original image is still needed.
● Without using SVMs: replacing SVMs with the CNN output.

● At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map
and maps it to a lower dimension (e.g. 256-d)
● For each sliding-window location, it generates multiple possible regions based on k
fixed-ratio anchor boxes (default bounding boxes)
● Each region proposal consists of a) an “objectness” score for that region and b) 4
coordinates representing the bounding box of the region
Faster RCNN: region proposal ntwork

● The main insight of Faster R-CNN was to replace the slow selective search algorithm
with a fast neural net. Specifically, it introduced the region proposal network (RPN).
● Faster R-CNN = RPN + Fast R-CNN

● In other words, look at each location in our last feature map and consider 𝑘𝑘 boxes
centered around it: a tall, a wide, and a large box, etc. For each of those boxes, output
whether or not we think it contains an object, and what the coordinates for that box are.
● Feed the proposal into what is essentially a Fast R-CNN.
● Union the CNN in the bottom for both the region proposal network in faster RCNN and
the bounding-box-regression/object-classification in fast RCNN.

SSD
● Region proposal and classification are trained simultaneously, unlike faster
RCNN that they are trained alternatively.
● Early convolution layers are also used. Early layers corresponds to smaller
objects, and rear layers corresponds to large objects.
● Faster and performance even better than faster RCNN

Yolo v2 (cvpr 2017)
● Modified from faster RCNN and Yolo
o use batch normalization; remove dropout.
o higher-resolution CNN classifier pretrained: from 224 ×
224 to 448 × 448
o use 9000 classes in the ImageNet for pre-training, instead
of 1000.
o direct location prediction: solve the instability in the
bounding box regression of faster RCNN.
● state-of-the-art on standard detection tasks like PASCAL
VOC and COCO datasets.
o At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets
78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet
and SSD while still running significantly faster.

Dataset
●Total 9,667 images:
o1,964 images annotated by ourselves
o7,703 images with bounding annotations from a public
dataset (ATR)
53
Applications: faster RCNN for clothing
detection

Dataset (Cont’d)
●9 categories:
●bag, belt, dress, footwear, glasses, hat, pants, skirt,
upperclothes
●# bounding boxes per category
54

LabelMe Annotation Tool
● A web-based tool to
create bounding boxes
and assign labels.
55

Detection Results
57
Approach: Faster RCNN

Quantitative Results
● Metric: mAP (mean Average Precision)
● A detection is considered correct if its IoU
(intersection over union) with ground truth ≥
0.5 and its label is correct.
●Detection performance
58

●Metric: mAP (mean Average Precision)
●A detection is considered correct if its IoU
(Intersection over union) with ground truth ≥
59
❏Perform better on larger items, e.g., upperclothes, dress, pants

●Metric: mAP (mean Average Precision)
●A detection is considered correct if its IoU
(Intersection over union) with ground truth ≥
60
❏Perform better on larger items, e.g., upperclothes, dress, pants
❏Belts are very difficult to detect.

Summary
● The clothes item detector trained with
bounding box annotations can produce
satisfactory results. Even only a small set of
training data is applied.
● Trainin data is an issue: It is time-consuming
to obtain ground-truth bounding boxes.
61

Face detection
● Face-detection CNN: it is trained on a large-
scale face image dataset following similar ideas.
● We show that the face detector can be
realized in a CPU-based machine, Zenbo.

Deep CNN face detection/alignment
on Zenbo
●Zenbo Specifications
o CPU: Intel Atom x5-Z8550 2.4 GHz
o OS: Android 6.0.1
o RAM: 4G
o without using GPUs
● Frames per second
o 2.5 FPS [Resolution (640x480) ]
● Code optimizations
o C++ and OpenBLAS library
o Multi-threads computation
o without using any deep learning frameworks such as
tensorflow or pytorch

海洋空拍機魟魚偵測
● Chien-Hung Chen’s master thesis (Dept. of
Mech. & Elec. Mach. Eng., NSYSU)；
● advisor: Prof. Keng-Hao Liu
A difficult problem: human may fail to track all
the 魟魚 successfully.
● Using Faster RCNN to train and detect
 base net: ZF or VGG
 Detection based on a video; using continuous
frames to refine the results.

Demo (close range)
ZF model VGG model
ZF model with time information VGG model with time information

Demo (distant range)
ZF model VGG model

Demo (hard case)
ZF model VGG model

● In the ground-truth, some 魟魚 sequences
detected by our method are not marked by human.
● After re-investigating these cases with human
experts, they have re-marked them as ground
truth.
Results of some video

Applicatins of deep CNN detector
● Deep CNN object detection techniques have
grown very fast in recent years. Several
promising models have been developed.
● The methods can be used for machine
inspection.
● Preparing data (with ground-truth regions)
would be an issue.
o Make the data type diverse
o If only few data with labeled regions can be collected,
augmenting the data by some attack (eg., by flipping,
rotation, cropping, lighting changes, blurring, sharpening
JPEG, etc.) is a useful technique for training.

Acknowledgement
Part of the slides are from the tutorial of CVPR2014, Deep
Learning for Computer Vision.

物件偵測與辨識技術

More Related Content

What's hot

Similar to 物件偵測與辨識技術

More from CHENHuiMei

Recently uploaded

物件偵測與辨識技術