A gentle introduction to Deep Learning

A gentle introduction
to Deep Learning
Jose Fernando Rodrigues-Jr
University of Sao Paulo, Brazil
Supervision: Sihem Amer-Yahia
Université Grenoble Alpes, France
Funding: Fundação de Amparo à Pesquisa do Estado de São Paulo (Fapesp)
Grant 2018/17620-5

Laboratoire d’Informatique de Grenoble
/66
About me and my university
● The University of Sao Paulo
-Ranked number 2 among latin-american universities
-Ranked in the 250-300 stratum in the world (UGA is in the 300-350 stratum)
(source: Times Higher Education, 2019)
● Faculty at University of Sao Paulo since 2010, associate professor since 2014
● My campus is in the city of Sao Carlos, country side of the state of Sao Paulo
● The HDI of Sao Carlos is 0.805 (Brazil is 0.754 and France is 0.897)
2

/66
About me and my university
● The University of Sao Paulo
-Ranked number 2 among latin-american universities
-Ranked in the 250-300 stratum in the world (UGA is in the 300-350 stratum)
(source: Times Higher Education, 2019)
● Faculty at University of Sao Paulo since 2010, associate professor since 2014
● My campus is the city of Sao Carlos, country side of the state of Sao Paulo
● The IDH of Sao Carlos is 0.805 (Brazil is 0.754 and France is 0.897)
3

/66
Deep Learning
-From the IEEE top 10 computing trends 2018, Deep Learning is the number 1
https://www.computer.org/press-room/2017-news/top-technology-trends-2018;
-Not new: most of the techniques are 20, 30, even 50, years old;
-Not necessarily deep: some architectures have one single (hidden) layer;
-Myth: it is about artiﬁcial intelligence, not artiﬁcial conscience.
4

/66
Deep Learning
Specifically, Deep Learning refers to the revival of artificial intelligence (artificial
neural networks) due to four factors:
1) lots of data: while a child learns what a dog looks like from three images, a computer
demands 3 million images;
2) computing power: 2.0xx computers have memory and processing power orders of
magnitude higher than 19xx computers; GPUs scaled the process even more;
3) algorithmic improvements: gradient descent, back propagation and architectural
innovations amplified the range of possibilities;
4) robust frameworks: Theano, TensorFlow, Keras, and many others made complex
parallel math computing accessible.
5

/66
Image classiﬁcation breakthrough
Large Scale Visual Recognition Challenge (ILSVRC) - ImageNet for short
2017
Training: 1.2 million images
Validation: 150.000 images
Test: 50.000 images
1.000 classes
6

/66
Image classiﬁcation breakthrough
Large Scale Visual Recognition Challenge (ILSVRC) - ImageNet for short
Super-human performance
2017
Training: 1.2 million images
Validation: 150.000 images
Test: 50.000 images
1.000 classes
7

/66
No more features engineering
8

/66
No more features engineering
9
● The idea of features engineering applies to data
processing problems that demand features extraction.
This is not always the case;
● Yet, it is still possible to use Artificial Neural Networks
with manually extracted features - sometimes it is the
only course of action, like in regression problems.

/66
Promising results
Soon (or already?) better than human skills:
-Computer Vision
-Text translation
-Text generation
-Games: Go, Chess, …
-Medicine: heart attack, neuro degenerative diseases, oncology….
Esteva, A. et al.; Dermatologist-level classification of skin cancer with deep neural networks,
Nature, 2017
-super-human performance on classifying skin lesions
-identified classes still unknown in the literature
-1500+ citations in one year
10

/66
Turing award 2018
Yoshua Bengio, French-Canadian
(theoretical backgrounds)
Geoffrey Hinton, British-Canadian
(back propagation, AlexNet)
Yann LeCun, French
(convolutional networks and
engineering)
“For conceptual and engineering
breakthroughs that have made
deep neural networks a critical
component of computing.”
11

/66
Context
12

/66
Biology inspiration
13

/66
Biology inspiration
14
● Inspiration only - not simulation. It is not yet fully
understood how the brain works.

/66
Existential parenthesis
Why neurons?
-The universe is made up of sets (for Comp. Science, unordered lists without repetition)
-In a world of sets, what do smart things do?
Ans.: they build up functions (or maps, for CS)
-What is a function, broadly speaking?
Given two sets X and Y, a function deﬁnes a mapping between then: f: X→ Y
X and Y can be anything, objects, emotions, concepts, abstractions, skills, music, …
-To do that, nature (evolution) designed specialized cells, named neuros
A very big bunch of neurons is able to build functions! 15

/66
Why neurons?
-The universe is made up of sets (for CC, unordered lists without repetition)
Ans.: they build up functions (or maps, for CC)
A very big bunch of neurons is able to build functions! 16
First key concept:
1) An Artificial Neuron Network is a function;

/66
Why neurons?
-The universe is made up of sets (for CC, unordered lists without repetition)
Ans.: they build up functions (or maps, for CC)
A very big bunch of neurons is able to build functions!
Compared to numeric Math sets, smart beings deal with sets whose all elements
cannot be foreseen, not even exhaustively.
Math function: f: NI → IR; for example f(x) = xe
Open function: f: {all possible dogs} → {all known dog breeds}
Compared to numeric Math sets, smart beings deal with sets with domain's having a
number of unique elements that cannot be completely foreseen, not even exhaustively.
Math function: f: NI → IR; for example f(x) = xe
Open function: f: {all possible dogs} → {all known dog breeds}
Akita, Alaskan husky,
Bichon Frisé, Border
Terrier, Boxer, Brazilian
Mastiff, ….
?
17

/66
Principle - artiﬁcial neuron
18

/66
19

/66
20
In matrix form ⇒ Very important
● 1 input 1 x n feature vector:
● 1 processing n x 1 neuron:
0 ... n
0
...
n

/66
21
● j input 1 x n feature vectors:
● k processing n x 1 neurons:
0 ... n-1
0
...
n
0 ... n-1
0 ... n-1
...
0:
1:
j-1:
0
...
n
0
...
n
...
0: 1: k-1:
= Ij x n
= Mn x k

/66
22
● j input 1 x n feature vectors:
● k processing n x 1 neurons:
0 ... n-1
0
...
n
0 ... n-1
0 ... n-1
...
0:
1:
j-1:
0
...
n
0
...
n
...
0: 1: k-1:
= I
= M
Now, remember matrix dot product:
And the neuron principle:

/66
23
Suppose:
● j input 1 x n=10 feature vectors:
● k=5 processing neurons 10 x 5:
0 ... n-1
0
...
n
0 ... n-1
0 ... n-1
...
0:
1:
j-1:
0
...
n
0
...
n
...
0: 1: k-1:
= I⇒ j x 10 matrix
= M ⇒ 10 x 5 matrix
10 features
10 weights
5 neurons

/66
24
Suppose:
● k=5 processing 10 x 1 neurons:
0 ... n-1
0
...
n
0 ... n-1
0 ... n-1
...
0:
1:
j-1:
0
...
n
0
...
n
...
0: 1: k-1:
The processing of the j 1x10 vectors by the 10x5 neurons is
represented in the figure:
Which corresponds to the dot product Ijx10
.M10x5
The output is a matrix O corresponding to j new vectors, each with 5
transformed features, that is 0j x 5
Ijx10
M10x5

/66
Supervised learning
25
Training: I know the answer
→ Learning, building model
Testing: I do not know the answer
→ Evaluation, using model

/66
After all, an optimization problem
*Biases omitted for simplicity
26

/66
27

/66
28

/66
parameters
(mostly,
weights)
29

/66
30
Suppose:
● k=5 processing neurons 10 x 5:
0 ... n-1
0
...
n
0 ... n-1
0 ... n-1
...
0:
1:
j-1:
0
...
n
0
...
n
...
0: 1: k-1:
10 features
10 weights
5 neurons
This is the object
of the
optimization,
what weights
lead to the
desired output?

/66
parameters
31
Second key concept:
2) The training of an ANN is an optimization problem;

/66
parameters
32
Attention:
- This presentation is only about the basics; in fact, it covers concepts on
Artificial Neural Networks;
- When features extraction is involved, like in image, and audio
processing, the process is much more complex;
- Actually, the deepness of "Deep Learning" has to do with these more
complex problems;
- Nevertheless, the principles are the same.

/66
Overall (theoretical) process
1. Specify a structure and a loss function to guide the optimization;
2. Feed forward with matrix multiplication and non-linear activations;
3. while (not satisfactory results)
a. Compute the parameters’ adjust using gradient descent;
b. The network backpropagates using the multivariate chain rule;
c. Update the weights accordingly;
d. Classiﬁcation/Regression.
33

/66
Overall (theoretical) process
1. Specify a structure and a loss function to guide the optimization;
2. Feed forward with matrix multiplication and non-linear activations;
3. while (not satisfactory results)
a. Compute the parameters’ adjust using gradient descent;
b. The network backpropagates using the multivariate chain rule;
c. Update the weights accordingly;
d. Classiﬁcation/Regression.
34

/66
Error landscape
● The set of parameters deﬁnes an error landscape
● We want to move along this landscape to ﬁnd the best minimum
(preferably the global minimum)
35Error landscape

/66
How to converge to the proper parameters?
The standard solution is the gradient descent algorithm
1.Calculate the partial derivative
2.Backpropagate updating W as
3.Use the chain rule to propagate through all the layers
Loss
W
36

/66
The standard solution is the gradient descent algorithm
1.Calculate the partial derivative
2.Backpropagate updating W as
3.Use the chain rule to propagate through all the layers
Loss
W
37
The learning rate states how much to move
in the direction contrary to the gradient.

/66
38

/66
39
Third key concept:
3) Gradient descent is the ultimate method to move along the error landscape;

/66
Different gradient descent methods
● There are many gradient descent-based optimizers;
● They vary with respect to the speed of convergence, processing cost,
learning rate, and decay factor;
● Adadelta is the most robust and widely used;
○ It is stochastic, hence, more robust against local minima
40

/66
Different gradient descent methods
● There are many gradient descent-based optimizers;
● They vary with respect to the speed of convergence, processing cost,
learning rate, and decay factor;
● Adadelta is the most robust and widely used;
○ It is stochastic, hence, more robust against local minima
41
Adadelta uses adaptive learning
rate; the closer to a minimum, the
smaller the learning rate.

/66
Millions of parameters
● Warning: even for mid-sized networks, the number of weights sums up
to thousands, even millions;
● This is responsible for the high computational cost of deep learning
42

/66
Millions of parameters
● Warning: even for mid-sized networks, the number of weights sums up
to thousands, even millions;
● This is responsible for the high computational cost of deep learning
43

/66
Deep Learning Frameworks
Implementing all these concepts from scratch is very hard (really!);
To ease the process, academic and industrial players built frameworks that:
-make linear algebra expression as simple as scalar algebra expression;
-calculate partial derivatives automatically (one line of code);
-perform back propagation;
-distribute the computation over GPUs.
Main frameworks: Theano, Google TensorFlow, Microsoft Cognitive Toolkit,
PyTorch, Keras, Apache MXNet, NVIDIA Caﬀe, Chainer, and others. 44

/66
Deep Learning Frameworks
Oh, it is so easy! - NO!
You still have to:
-model the data input and output; a big deal of numbers organized in
multi-dimensional arrays;
-model the layers in terms of size and connectivity -- matrix dimensionality
will give you headaches;
-implement the neurons’ computations;
-implement the updating scheme;
-get used to symbolic coding.
45

/66
How to choose a Framework
● You are a PhD student, or Posdoc, on DL itself: Theano, TensorFlow, Torch
● You want to use DL only to get features: Keras, Caffe
● You work in industry: TensorFlow, Caffe
● You started your 2 months internship: Keras, Caffe
● You want to give practice works to your students: Keras, Caffe
● You are curious about deep learning: Caffe
● You don’t even know python: Keras, Torch
Source: https://project.inria.fr/deeplearning/files/2016/05/DLFrameworks.pdf
46

/66
How to choose a Framework
● You are a PhD student, or Posdoc, on DL itself: Theano, TensorFlow, Torch
● You want to use DL only to get features: Keras, Caffe
● You work in industry: TensorFlow, Caffe
● You started your 2 months internship: Keras, Caffe
● You want to give practise works to your students: Keras, Caffe
● You are curious about deep learning: Caffe
● You don’t even know python: Keras, Torch
Source: https://project.inria.fr/deeplearning/files/2016/05/DLFrameworks.pdf
47

/66
Pitfalls
● Proper pre-processing;
● Optimizing the structure can be a never ending process;
● Preventing over or under ﬁtting;
● Getting it to converge (to a high-quality local minimum);
● Making sure you have the right loss function;
● Doing data augmentation correctly.
Time-consuming
● Testing a single idea can take a week or more;
● Preprocessing large data takes long time;
● Symbolic programing is tough;
● Hyper-parameters + process variations ⇒ number of possible settings
explode. 48

/66
Further concepts beyond the introduction
● Regularization (L1, L2,...)
● Cost (Loss) Function (exponential, cross-entropy, hellinger, …)
● Activation Function (ReLU, Hyperbolic tangent, sigmoid, …)
● Output layer (softmax, linear, …)
● Linear algebra using broadcasting
● Specialized layers (convolution, pooling, embedding, ...)
● Dropout, masking, padding, ...
49

/66
The DL zoo
https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-3fb6f2367464 50

/66
Further concepts beyond the introduction
● Regularization (L1, L2,...)
● Cost (Loss) Function (exponential, cross-entropy, hellinger, …)
● Activation Function (ReLU, Hyperbolic tangent, sigmoid, …)
● Output layer (softmax, linear, …)
● Linear algebra using broadcasting
● Specialized layers (convolution, pooling, embedding, ...)
● Dropout, masking, padding, ...
51
Key concepts
3) Gradient descent is the ultimate method to move along the error landscape.

/66
That’s it for now
52

A gentle introduction to Deep Learning

Recommended

Recommended

More Related Content

Similar to A gentle introduction to Deep Learning

Similar to A gentle introduction to Deep Learning (20)

More from Universidade de São Paulo

More from Universidade de São Paulo (20)

Recently uploaded

Recently uploaded (20)

A gentle introduction to Deep Learning