AINL 2016: Filchenkov

Deep Neural Networks Tutorial
Andrey Filchenkov
Computer Technology Chair
Computer Technologies Lab
ITMO University
afilchenkov@corp.ifmo.ru
AINL FRUCT’ 6. No , 6, “t. Peters urg, Russia

Tutorial topics
Very brief introduction to AI, ML and ANN
What is ANN and how to learn it
DNN and standard DNN architectures
Beyond discriminative models
2 / 64

Next topic
3 / 64

Artificial intelligence
Strong AI (A general I): functionality is similar to the human
brain or better.
Weak AI: good in solving certain well-formulated tasks.
Machine learning is a part of Weak AI
Many people have been thinking that artificial neural
networks are a path to Strong AI
Many people are thinking that deep learning networks are a
path to Strong AI
4 / 64

Neural networks as a machine learning algorithm
Neural paradigm is not only about machine learning
(computer architecture, computations, etc.)
Machine learning is about creating algorithms which can learn
patterns, regularities and rules from given data.
The biggest part of machine learning is supervised learning:
we are given a set of objects, each has a label, and we want to
learn how to find these label for objects we have never seen.
5 / 64

Brief early history of artificial neural networks
1943 Artificial neuron by McCulloch and Pitts
1949 Neuron learning rule by Hebb
1957 Perceptron by Rosenblatt
1960 Perceptron learning rule by Widrow and Hoff
1968 Group Method of Data Handling to learn multilayered
networks by Ivakhnenko
1969 Perceptrons by Minski and Papert
1974 Back propagation algorithm by Webros and by Galushkin
7 / 64

Brief modern history of ANN
1980 Convolutional NN by Fukushima
1982 Recurrent NN by Hopfield
1991 Va ishi g gradient pro le was identified by Hochreiter
1997 Long short term memory network by Hochreiter and
Schmidhuber
1998 Gradient descent for convolutional NN by LeCun et al.
2006 Deep model by Hinton, Osindero and Teh
2012 DNN started to become mainstream in ML and AI
8 / 64

Next topic
9 / 64

Two sources of knowledge
Experts
we need to ask wisely and process
Data
we need to process and apply machine learning algorithms
How do we obtain knowledge?
10 / 64

Algorithms, performance of which grows with experience
The most popular task is prediction
Algorithms require data and labels (for predicting)
Learning of these algorithms is to minimize error rate in
prediction or maximize similar to the known answer
Machine learning
11 / 64

Each object is represented
as a feature vector. Each
object thus is a point in a
multidimensional space.
Vector representation of objects
12 / 64

Linear discriminant surface
Linear model (neuron)
13 / 64

Multilayer neural network
We just build a composition of
neurons (as functions)
15 / 64

Next topic
• Deep learning introduction and best practices
• Deep Boltzmann Machines (DBM) and Deep Belief Network
(DBN)
• Convolution Neural Network (CNN)
• Autoencoders
• Recurrental Neural Network (RNN) and Long-Short Term
Memory (LSTM)
16 / 64

Deep architecture
Definition: Deep architectures are composed of multiple levels of
non-linear operations, such as neural nets with many hidden
layers
Most machine learning algorithms have shallow (1–3 layers)
architecture (SVM, PCA, kNN, Logistic Regression, etc.)
Goal: Deep learning methods aim at:
Learning feature hierarchies, no more feature engineering!
Where features from higher levels of the hierarchy are formed
by lower level features.
17 / 64

Why to go deep?
Some functions cannot be efficiently represented (in terms of
number of tunable elements) by architectures that are too shallow
Functions that can be compactly represented by a depth k
architecture might require an exponential number of computational
elements to be represented by a depth k− ar hite ture
Deep Representations might allow non-local generalization and
comprehensibility
Deep learning gets state of the art results in many fields (vision,
audio, NLP, etc.)!
18 / 64

DNN best practices: ReLU, PReLU
ReLU
19 / 64
PReLU
Sigmoid, hyperbolic tangent activation functions have a problem
with vanishing of gradients and tend to overfitting

DNN best practices: Data augmentation
The easiest and most common method to reduce overfitting on
image data is to artificially enlarge the dataset using label-
preserving transformations.
Types of data augmentation:
Image translation
Horizontal/vertical reflections + cropping
Changing RGB intensities
20 / 64

DNN best practices: Dropout
Dropout: set the output of each hidden neuron to zero w.p. 0.5
The euro s hi h are dropped out i this a do ot o tri ute
to the forward pass and do not participate in backpropagation
So every time an input is presented, the neural network samples a
different architecture, but all these architectures share weights
This technique reduces complex co-adaptations of neurons, since a
neuron cannot rely on the presence of particular other neurons
It is, therefore, forced to learn more robust features that are useful
in conjunction with many different random subsets of the other
neurons
Without dropout, a network exhibits substantial overfitting
Dropout roughly doubles the number of iterations required to
converge
21 / 64

Greedy Layer-Wise Training (1/2)
1. Train first layer using your data without the labels (unsupervised)
• Since there are no targets at this level, labels don't help. Could
also use the more abundant unlabeled data which is not part of
the training set (i.e. self-taught learning).
2. Then freeze the first layer parameters and start training the second
layer using the output of the first layer as the unsupervised input
to the second layer
3. Repeat this for as many layers as desired
• This builds our set of robust features
4. Use the outputs of the final layer as inputs to a supervised
layer/model and train the last supervised layer(s) (leave early
weights frozen)
5. Unfreeze all weights and fine tune the full network by training with
a supervised approach, given the pre-processed weight settings
22 / 64

Greedy Layer-Wise Training (2/2)
Greedy layer-wise training avoids many of the problems of trying to
train a deep net in a supervised fashion:
• Each layer gets full learning focus in its turn since it is the only
current "top" layer
• Can take advantage of unlabeled data
• When you finally tune the entire network with supervised training
the network weights have already been adjusted so that you are in
a good error basin and just need fine tuning. This helps with
problems of:
• Ineffective early layer learning
• Deep network local minima
23 / 64

Restricted Boltzmann machine
Two types of nodes: hidden and visible.
We are minimizing system energy which is not to converge by
updating its weights with propagating new objects. Probability
distribution on visible and hidden layers is Gibbs distribution.
24 / 64

Deep Belief Network
25 / 64
First train unsupervisedly with several
(two) levels of RBM (or autoencoders).
Then train next layers supervisedly and
consecutively.
Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313:504-507.

Deep Boltzmann machine
26 / 64

Convolution Neural Network (CNN)
Core concepts:
Local perception – each neuron sees a small part of the
object. Use kernels (filters) to capture 1-D or 2-D structure of
objects. For instance, capture all pixel neighbors for an image.
Weight sharing – use small and the same sets of kernels for all
objects, this leads to reduction of number of adjusting
parameters in comparison with MLP
Subsampling/pooling – use dimensionality reduction for
images in order to provide invariance to scale
27 / 64

What can kernels do?
29 / 64
*
*
*
=
=
=
blur
edge
detection
sharpen

How do trained kernels look like?
31 / 64
low feature medium feature high feature
Each kernel composes a local patch of lower-level features
into high level representation

Levels of abstraction
Hierarchical Learning:
Natural progression from low
level to high level structure
as seen in natural complexity
Easier to monitor what is
being learnt and to guide the
machine to better subspaces
A good lower level
representation can be used
for many distinct tasks
32 / 64

LeNet
33 / 64
LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the
IEEE 86.11 (1998): 2278-2324.

GoogLeNet
34 / 64
Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint
arXiv:1409.4842 (2014).

Application of CNN
Image Recognition
Image Search (enhance search engine)
Visual Question Answering
NLP (Sentence Classification, etc.)
Speech Recognition
…
35 / 64

Visual question answering (1/2)
36 / 64

Visual question answering (2/2)
37 / 64

Application of CNN: Visual question answering
38 / 64

Autoencoders (1/3)
Autoencoder: a feed-forward neural
network trained to reproduce its input
at the output layer
Do non-linear dimensionality
reduction
Train via backpropagation
1-layer autoencoder gets similar
results as PCA
39 / 64

Autoencoders in bioinformatics
42 / 64
Fakoor, Rasool, et al. "Using deep learning to enhance cancer diagnosis and classification." Proceedings of the International
Conference on Machine Learning. 2013.

Deep autoencoders: document processing
We can use an autoencoder to find low-dimensional codes for
documents that allow fast and accurate retrieval of similar
documents from a large set.
We start with o erti g ea h do u e t i to a ag of
ords . This is a 2000 dimensional vector that contains the
counts for each of the 2000 commonest words.
43 / 64

Deep autoencoders: document retrieval
We train the neural network to
reproduce its input vector as
its output
This forces it to compress as
much information as possible
into the 10 numbers in the
central bottleneck.
These 10 numbers are then a
good way to compare
documents.
44 / 64
2000 reconstructed counts
500 neurons
250 neurons
10
250 neurons
500 neurons
2000 word counts

Deep autoencoders: document visualization (1/2)
Instead of using codes to
retrieve documents, we can
use 2-D codes to visualize sets
of documents.
This works much better than
2-D PCA
45 / 64
2000 reconstructed counts
500 neurons
250 neurons
2
250 neurons
500 neurons
2000 word counts

Deep autoencoders: document visualization (2/2)
46 / 64

Recurrent Neural Network (RNN)
RNN: a neural network with recurrent connections
Good for sequence data: time series, text, audio
47 / 64

Recurrent Neural Network (RNN): scheme
48 / 64

Long Short-Term Memory (LSTM)
LSTM: a special case of RNN capable of learning long-term
dependencies
There are four neural network layers in repeating module
49 / 64Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

LSTM: Cell state
Cell state: runs straight down the entire chain, with only
some minor linear interactions
LSTM does have the ability to remove or add information
to the cell state, carefully regulated by structures called
gates. Gates are a way to optionally let information
through.
The sigmoid layer outputs numbers
between zero and one, describing
how much of each component
should be let through. LSTM has
3 gates.
50 / 64

LSTM: Forget gate layer
It looks at ℎt−1 and �t, and outputs a number between 0 and
1 for each number in the cell state �t−1. 1 represents
o pletel keep this hile 0 represe ts o pletel get rid
of this .
51 / 64

LSTM: Input gate layer (1/2)
How to decide what new information to store in the cell state?
First, a sig oid la er alled the i put gate la er de ides
which values should be updated.
Next, a tanh layer creates a vector of new candidate values, �t,
that ould e added to the state. I the e t step, e’ll
combine these two to create an update to the state
52 / 64

LSTM: Input gate layer (2/2)
It’s o ti e to update the old ell state, ��−1, into the new
cell state ��. The previous steps already decided what to do,
we just need to actually do it
We multiply the old state by ��, forgetting the things we
decided to forget earlier. Then we add �� ⋅ ��−1. This is the
new candidate values, scaled by how much we decided to
update each state value
53 / 64

LSTM: Output gate layer
The output will be based on cell state, but will be a filtered
version. First, we run a sigmoid layer which decides what
parts of the ell state e’re goi g to output. The , e put the
cell state through tanh to push the alues to e et ee −
and 1) and multiply it by the output of the sigmoid gate, so
that we only output the parts we decided to.
54 / 64

Visual question answering: LSTM
55 / 64

Deep learning analysis: advantages
Extremely strong model, which can potentially solve any
problem of machine learning.
Already learnt model can be reused: multi-task support
Many good models are already known which are state-of-the-
art for many tasks:
• image recognition;
• speech recognition;
• natural language processing;
• etc.
56 / 64

Deep learning analysis: disadvantages
The deeper the net is
• the more data you need;
• the more time you need;
• the stronger processors you need.
Usually no intuition how it works exactly;
Usually you work with DNN as a black box;
Prone to overfitting: regularization must be used.
57 / 64

Next topic
58 / 64

Reverse the network and make it predict images given labels
Image synthesis
Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 1538-1546). 59 / 64

Keep i er represe tatio of a i age
(Gram matrix �� for convolutional layers)
Then create a new random network and
learn it no have similar inner
representation as the one we have kept.
Texture synthesis
Gatys, L., Ecker, A. S., & Bethge, M. (2015). Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing
Systems (pp. 262-270). 60 / 64

Style = texture.
Image = content and
is represented with
the last
convolutional layer.
We will learn an
image that is similar
both to image and
content.
Style transmission
Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.
61 / 64

DeetArt was created in 2015:
https://deepart.io/
They implemented the algorithm
described before.
DeepArt and Prisma
62 / 64
Prisma was created in June,
2016.
They made it optimized,
mobiles and with preselected
filters (instead of styles)

Materials
Presentation was prepared using:
1. http://avisingh599.github.io/deeplearning/visual-qa/
2. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
3. https://class.coursera.org/ml-003/lecture
4. K. Vorontsov Ma hi e lear i g ourse i Russia
63 / 64

Thanks for attention!
Questions?

AINL 2016: Filchenkov

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AINL 2016: Filchenkov

Similar to AINL 2016: Filchenkov (20)

More from Lidia Pivovarova

More from Lidia Pivovarova (12)

Recently uploaded

Recently uploaded (20)

AINL 2016: Filchenkov