Pres Tesi LM-2016+transcript_eng

UNIVERSITY OF BERGAMO
ENGINEERING DEPARTMENT
MASTER OF SCIENCE IN COMPUTER ENGINEERING
CONVOLUTIONAL NEURAL NETWORKS FOR
COMPUTER VISION
Supervisor
Mario Verdicchio
Candidate
Daniele Ettore Ciriello
13 June 2016
Welcome and thank for being here, I am Daniele Ciriello and I wrote
my master’s thesis on convolutional neural networks for computer
vision.

Overview
Introduction
Neural Networks Fundamentals
State of the Art
Implementation
Results
Future Developments
2 of 32
CONVOLUTIONAL NEURAL NETWORKS FOR COMPUTER VISION
The presentation is split in six parts, initially I introduce computer
vision and neural networks, of which I present the fundamental
concepts and the state of the art in matter of image classiﬁcation
tasks, then I present the different parts that compose the
implemented project and result of some experiment, ﬁnally I show
possible developments for the future and applications in practice.

Overview
Introduction
State of the Art
Implementation
Results
Future Developments
3 of 32
Let’s start introducing the main concepts for computer vision and
artiﬁcial neural networks.

Computer Vision
Acquisition, process, analysis and comprehension of images
and high-dimensionality data
Applications Examples:
Control processes
Navigation
Event recognition
Information organization
Interaction
4 of 32
Computer vision is a filed of computer science which deals the
acquisition, process, analysis and comprehension of the information
contained in images and high dimensionality data from the real world,
with the objective of produce information, in numeric or symbolic
form, for example in form of decision.
Represents the basis for many artificial intelligence systems, finding
application for example in control processes, navigation system,
information organization and human-machine interaction systems.
The kernel of many of these applications are often image
classification problems, in which a system have to decide an image
class from a set of predefined classes.

Neural Networks and Computer Vision
Information process paradigm inspired to the biologic nervous
system
Highly scalable non-linear decision models
Autonomously set the right parameters through learning
algorithms
Need big training data-sets
5 of 32
In the last years considerable progress have been possible thanks to
convolutional neural networks, which have established the state of
the art technology solving image classiﬁcation problems and many
others, related to computer vision. Artiﬁcial neural networks are
model inspired by the biologic nervous system and can represent
non-linear, highly scalable, decision models. By using learning
algorithms, we permit the network to set by its-self the parameters
which minimize the error loss.

Overview
Introduction
State of the Art
Implementation
Results
Future Developments
6 of 32
Let’s see some fundamental concepts about artiﬁcial neural networks.

Neural Units
Neural unit
x1
x2
...
xn
b y
w1
w2
wn
y = f(w · x + b)
w: weights
b: bias
f: activation function
Activation functions
. . . . . .
7 of 32
The basic element of artificial neural networks is the neural unit, or
artificial neuron, which take in input n values x, at which correspond n
weights w, inputs and weights can be seen as two vectors x and w,
the neural unit’s output consists of the application of a function,
named activation function, to the sum between the dot product
between x and w and a value b named bias.
The most simple activation function and the first being used is the
step function and neural unit using it are called perceptrons. The
most famous activation function in literature, even if now is it no more
used, is the sigmoid (or logistic) function, in literature we can find
similar functions like hyperbolic tangent. The most used activation
function in these days is the rectifying function, and neurons which
use it are called ReLU (rectifier linear unit).

Feed-forward Neural Networks
II strato
nascosto
I strato
nascosto
strato
di input
strato di
output
y1 = x, yj = f(Wj · y −1 + bj )
8 of 32
By disposing these elements in layers we can build neural networks,
in particular, classic feed-forward models expect the composition in
many layers where, the ﬁrst layer is called input layer because it
represents the network’s input values, last layer is called output layer
because it carries the output of the network, and internal layers are
called hidden layers, simply because they are nor input nor output
layers.
Layers like these are called afﬁne or fully connected layer, in which
each neuron’s inputs consists in all the neurons’ output from the
previous layer.

Convolution Layers
Groups of neurons which share parameters with all neurons of
the same group
Each neuron take in input a portion of x
Very efﬁcient with image processing problems
x1 x2 x3 x4 x5
A A A A
y1 y2 y3 y4
...
xn−1 xn
A
yn
9 of 32
Many other types of layer exist, a very important one, especially for
computer vision problems and more in general, pattern recognition
problems, is the convolution layer. It consists of groups of neurons
where each neuron take in input a portion of the input values and
shares their parameters with every other neuron in the same group.
In the ﬁgure you can see a mono-dimensional convolution layer,
composed by neurons of type A, which take in input a segment of x,
in case of computer vision problems layers can be bi-dimensional and
can take in input areas of x.

Other Types of Layer
Pooling
Softmax
Normalization
10 of 32
Convolution layers are often interspersed with pooling layers, which
sub-sample the input keeping the number of channels invariant,
another important layer is the softmax layer, used as output layer in
many classiﬁcation problems, as the output of this layer can be seen
as a probability distribution.
Another important layer type is the normalization layer, that normalize
input values in an unit interval along each channel.
Modularity of these simple concepts allows the composition of more
(or less) complex and speciﬁc models.

Overview
Introduction
State of the Art
Implementation
Results
Future Developments
11 of 32
To exhibit the state of the art I show results for a competition that take
place annually attracting institutions from all over the world, becoming
a benchmark for the evaluation of computer vision systems.

ILSVRC
ImageNet Large-Scale Visual Recognition Challenge
1000 classes
1.2 M training
samples
500 k validation
samples
12 of 32
ILSRVC is a competition composed by many computer vision tasks
like localization and classification, with the passing years other
sub-tasks are born, like scene classification or object localization in
videos. This competition has become a benchmark for large-scale
convolutional networks performance analysis.
The classification task is supported by 1.2 million images and 1000
classes, in the year 2015 human average error has been surpassed
by convolutional models.

ILSVRC
Blue: Traditional computer vision
Purple: Deep learning
Red: Human capacity
13 of 32
In the ﬁgure you can see how since deep learning systems (or
systems based on neural models with more than one hidden layer),
have been used, it became possible to obtain considerable progress,
till surpassing human capacity in 2015. You can also see how neural
models has supplanted the classic computer vision approaches.

Residual Networks
14 of 32
The model which won all the tasks in the year 2015 is a convolutional
model called residual netowrk, proposed by MSRA, consists in a
convolutional network in which the residual learning concept is
applied.
The basis structure in the middle is called plain network and is
inspired to the VGG networks, a model presented the previous year
at the same competition (image above), carrying a residual of the
input values to the output of a group of convolutional layers by using a
skip path, creating a residual convolution network like the one below.

Overview
Introduction
State of the Art
Implementation
Results
Future Developments
15 of 32
So my objective was to reproduce the results obtained by the residual
model on more simple datasets.

Project Overview
PyFunt
Library for development and training convolutional neural
networks
PyDatSet
Library for loading various data-sets and a collection of
functions for artiﬁcial training data augmentation
Deep-residual-networks-pyfunt
Implementation and training of parametric residual networks
on various data-sets
16 of 32
The project is composed by three Python repositories. PyFunt
contains the library for development and training convolutional neural
networks, PyDatSet conists of a collection for loading various
data-sets in a python environment and a set functions for artiﬁcial
training data augmentation, and the main repository that contains the
residual model implementation and the main application that makes
use of the model and the two libraries to load the data-sets and train
the networks.
All the three repositories are published with an open-source license
on GitHub in a way that anyone can use them or contribute to the
development.

Package Diagram
17 of 32
In this diagram you can see how the various parts of the project
interact each other to train the residual model, in particular, the main
application creates a ResNet object, that uses the implementations of
the various layers and initialization functions provided by pyfunt, then
creates a Solver object provided by the same library, to which passes
the model, the data loaded with pydatset and many hyper parameter
for training like number of epochs and learning rate. pyfunt also
contains utilities to verify the correct implementation of the layers and
to visualize the ﬁrst convolution layer’s weights.

Overview
Introduction
State of the Art
Implementation
Results
Future Developments
18 of 32
So let’s look at the results of experiments on several data-sets.

CIFAR-10 - Canadian Institute For Advanced
Research
60 k images
(50 k + 10 k)
RGB 32x32
10 classi:
airplane
car
bird
cat
deer
dog
frog
horse
ship
truck
19 of 32
Wanting to replicate the results obtained by residual networks, I
trained the model implemented by myself on CIFAR 10, which
consists of 60,000 RGB images of 32 pixels per side spread on 10
disjoint classes.
This data-set is much simpler than the previous one but being varied
allows you to easily see if a network can learn properly from the
training samples.

CIFAR-10 – Results
Accuracy: 90.41 %; parameters: ˜248 k
20 of 32
To evaluate a network’s behaviour we can analyze the learning
curves, which report the value of the moving average of the cost at
each iteration (above), and the error values on the training data (the
dotted lines) and the error on validation data (the solid lines below).
In this case, by using a 20 layers network, become evident by
observing the errors, a typical phenomenon of these models called
overfitting, according to which the network uses his several
parameters to learn specific features of the training set, without
generalizing enough on the validation set. Memorizing in a certain
sense, the training samples.
By artificially augmenting the size of the training set, we can reduce
this phenomenon and obtain better results in terms of validation error.

MNIST – Mixed National Institute of Standards
and Technology
70 k images
(60 k + 10 k)
B/W 28x28
handwritten
digits
10 classes
(digits from 0 to
9)
21 of 32
The MNIST data-set has been presented for the ﬁrst time in 1991 and
is composed by 60 thousand 28 pixel per side, black/white images of
handwritten digits.

MNIST – Results
Accuracy: 99.64 %; parameters: 442 k
22 of 32
Another way to low the validation error in residual models is to
increment the number of layers or the number of neurons for each
layer, in this case for the ﬁrsts two experiments I halved the number of
neurons used in the previous experiments, incrementing the number
of layers from 20 to 32, and re-doubling the number of neurons in the
32 layers model I obtained an accuracy of 99.64%, which means the
network erroneously classify just 36 images of the 10 000 validation
images after trained.
The number of each layers’ ﬁlters starts from 16 and get doubled
sporadically to 64.

MNIST – Results
23 of 32
In this figure we can see all 36 erroneously classified images, in each
cell’s middle there is the original image, in the top left the correct
class for the image, in bottom left the wrong classification from the
network and lower right the second classification for confidence.
So we can see that in many cases the second classification is the
correct one, furthermore we can see some examples which are
almost indistinguishable for a human eye.

SFDDD - State Farm Distracted Driver Detection
˜22 k RGB 640x480 images
of drivers
10 distraction classes:
safe driving
texting - right
talking on the phone - right
texting - right
talking on the phone - left
operating the radio
drinking
reaching behind
hair and makeup
talking to passenger
24 of 32
The last dataset I present has been provided by State Farm trough
Kaggle, an institution which collect many machine learning
competitions. In this case the insurance agency wanted to verify the
best accuracy level obtainable in the classiﬁcation of RGB, 640x480
images of drivers in 10 distraction classes: safe driving, texting with
the right hand, talking on phone with the right hand, etcetera...
To simplify the dataset and the training process, I resized all the
images to 64x48 pixels, by selecting for training a random portion of
each image at each iteration, and the central 32x32 portion for
validation.

SFSDDD – Results
Accuracy: 99.75 %, parameters: ˜636 k
25 of 32
Given that the competition is still not ended, the validation dataset is
still not public, so I used a validation set composed by 2000 images,
randomly extracted and excluded from the 22000 circa of the training
set. By analyzing the learning curves of two residual networks of 32
and 44 layers. Despite the difference in the loss values are almost
imperceptible, and despite both the network can recognize practically
all the training samples after 80 epochs, we can see that with the 44
layers resnet I obtained an accuracy of 99.75% on my validation set.

SFSDDD – Saliency Maps
26 of 32
By evaluating and visualizing the values of the cost’s derivative, with
respect of the input images, we can observe the so called saliency
maps, where lighter areas represents the images’ portions which
most contribute to the right classiﬁcation by the trained network, in
this case we can see some samples from the class “talking on phone
with the right hand”. Is interesting to note for example how the
classiﬁcation is affected by the steering areas where usually hands
reside or areas where it should reside the arm, in the fourth picture
you can instead see the zone that has affected most is that around
the head.

Overview
Introduction
State of the Art
Implementation
Results
Future Developments
27 of 32
We come then to the conclusion, where I describe some possibilities
for Future Developments and applications in practice.

Future Developmentss
Extend the framework
Implement other models
Train on other data-sets
28 of 32
In a near future I would like to continue the project, extending the
implemented library, allowing for example the usage of GPU based
hardware accelerated computation, or developing other types of layer
or new models that will be presented in future, or by training the same
model on other data-sets.

Examples of Applications in Practice
Health
. . .
Robotics
. . .
Navigation
. . .
Physics, Natural Sciences, art and entertainment, ...
29 of 32
Some possible applications in which convolutional neural netwrorks
are being used are in Health for classify chest diseases from x-ray
images, localization and segmentation of tumor masses in the brain,
or of the pancreas from the trunk sections of the images, useful as
the pancreas varies in size and shape from person to person. In
robotics, for eye-hand coordination systems in 7 degrees of freedom
robotic arms, to improve locomotion skills of robots on irregular
ground through reinforcement learning algorithms, and for the
improvement of the relationship between humans and robots with
facial expressions classification networks. In navigation they are used
in a massive way for autonomous vehicle systems, or for example for
detecting and classifying roads starting from satellite images, or road
signs images classification.
Furthermore, neural models are being used successfully in physics,
natural sciences, for example for classify flowers in botanics or
plankton in marine biology, and many more.

UNIVERSITY OF BERGAMO
ENGINEERING DEPARTMENT
MASTER OF SCIENCE IN COMPUTER ENGINEERING
CONVOLUTIONAL NEURAL NETWORKS FOR
COMPUTER VISION
Supervisor
Mario Verdicchio
Candidate
Daniele Ettore Ciriello
13 June 2016
Thank you for the attention.

Images Credits (1 of 2)
Slide Image Credits URL (http://)
10 Super Vision UofT goo.gl/fbyXSK
10 UvA-Euvision UvA goo.gl/ltNvaP
12 ILSRVC samples ImageNet goo.gl/PvFCAv
13 ILSRVC results 1 NVIDIA goo.gl/QkG3Nf
13 ILSRVC results 2 NVIDIA goo.gl/THMZ1X
14 ResNet MSRA goo.gl/uR0ZXL
29 Medi1 NVIDIA goo.gl/DdPSvD
29 Medi2 BRATS goo.gl/rGgz22
29 Medi3 SPIE goo.gl/R5mbmj
29 Rob1 Google goo.gl/6BycQ7
29 Rob2 UBC goo.gl/A585Iz
29 Rob3 MSRC goo.gl/3exWqJ
31 of 32

Images Credits (2 of 2)
Slide Image Credits URL (http://)
29 Nav1 Google goo.gl/DFgPcl
29 Nav2 DeepOSM goo.gl/sR72BF
29 Nav4 IDSIA goo.gl/SR16Uk
32 of 32

Pres Tesi LM-2016+transcript_eng

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Pres Tesi LM-2016+transcript_eng

Similar to Pres Tesi LM-2016+transcript_eng (20)

Recently uploaded

Recently uploaded (20)

Pres Tesi LM-2016+transcript_eng