Evolution of Deep Learning and new advancements

Evolution of Deep Learning:
New Methods and Applications
Chitta Ranjan, Ph.D.
Pandora Media.
Feb 15, 2018
nk.chitta.ranjan@gmail.com
1

Evolution of Deep Learning
Outline
• Background
• Challenges
• Solutions
2

How does our brain work?
• How do we know where the ball
will fall?
3

will fall?
• Do we solve these equations in our
head? No.
4
! =
#$
%
sin% )
2+
, =
#$
%
sin 2)
+
- =
2#$ sin )
+

will fall?
• Do we solve these equations in our
head? No.
• Perhaps we break the problem into
pieces and solve it.
5

Traditional block model
One model for the whole problem
6
• One solver to solve it all.
• Has limitation for complex
problems.
! =
#$
% sin% )
2+
, =
#$
% sin 2)
+
- =
2#$ sin )
+

Neural Network
7
• A neuron solves a piece of the big
problem.
• Understand the inter-relationships
between the pieces.
• Merge the small solutions to find
the solution.

Neural Network
8
• Can we have bidirectional
connections?

Neural Network
9
connections?
• Can we have edges connecting
neurons in the same layer?

Neural Network
10
connections?
• Can we have edges connecting
neurons in the same layer?
• Is Neural Network an Ensemble
model?

Birth of Neural Network
11

Perceptron (1958)
12
Rosenblatt, F. (1960). Perceptron simulation experiments. Proceedings of the IRE, 48(3), 301-309.

Perceptron (1958)
∑
!"
!#
!$
%"
%#
%$
= ∑%(!(
+1
−1
Non-linear
13
• A non-linear computation cell.
• Non-linear cells became the
building block of Neural Networks.
Rosenblatt, F. (1960). Perceptron simulation experiments. Proceedings of the IRE, 48(3), 301-309.

Multi-layer Perceptron (1986)
14
• Nodes are Perceptrons.
• Layers of Perceptrons.
• Relationships (weights on arcs)
found using newly-developed
Backpropagation.
The nonlinear part is critical. Without it, it is equivalent the big block model.
Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by
Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group.
(editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1:
Foundation. MIT Press, 1986.

15
Backpropagation.

16
Backpropagation.
Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by
Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group.
(editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1:
Foundation. MIT Press, 1986.

Some definitions
17
Activation
function
Neuron/node
Layer
Network depth
Networkwidth
Weight/
connection/arc
Input
Output

We learned..
18

So far we learned
• Problem to be broken into pieces (at
nodes).
• Non-linear decision makers.
19

Timeline
20

1980
Capsules
SeLU
2017
Dropout
2012
ReLU
ResNet, 152 layers
GoogLeNet, 22 layers*
VGG Net, 19 layers
AlexNet, 8 layers
Layers
Perceptron
1958
1969
Perceptron criticized—
XOR problem
∑
!"
!#
!$
%"
%#
%$
= ∑%(!(
+1
−1
1987
1986
Multilayer Perceptron—
Backpropagation
Inputs
Outputs
Forward direction
Backward direction
AI Winter I
(74-80)
2006
CNN for
handwritten image
1998
CNN—Neocognitron
AI Winter II
(87-93) 1997
LSTM
DBM—Faster
learning
*The overall number of layers (independent
building blocks) used for the construction of
the network is about 100.
21
MNIST

Challenges
Computation
GPU!
22

Challenges
Computation
GPU!
23

Challenges
Estimation
Overfitting
Vanishing gradient
Dropout
Activation functions
24

Dropout
25

Let’s take a step back..
26
⋮ ⋮ ⋮ ⋮ ⋮
• Learning becomes difficult in large
networks.
• Off-the-shelf L1/L2 regularization
was used.
• They did not work.

Silenced by L1 (L2)
• Regularization happens based on the
predictive/information capability of a node.
27

Silenced by L1 (L2)
• Regularization happens based on the
predictive/information capability of a node.
• The weak nodes are always
(deterministically) thrown out.
• Weak nodes do not get a say.
28
*Loosely speaking

Co-adaptation
• Nodes co-adapt.
• Rely on presence of another node.
• Few nodes do the heavy lifting while others
do nothing.
29

Evolution of Deep Learning 30
Wide networks doesn’t really help.

Dropout (2014)
• Presence of node is a matter of chance
31
Silencing Co-adaptation
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,& Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting. The Journal of
Machine Learning Research, 15(1), 1929-1958.

Dropout with Gaussian gate (2017)
• Regular dropout: multiply activations
with Bernoulli RVs.
• Generalization: Multiply with any RV.
32
!"
!#
!$
!%
~'(!)(+)
~'(!)(+)
~'(!)(+)
~'(!)(+)
Molchanov, D., Ashukha, A.,&Vetrov, D. (2017). Variational dropout sparsifies deep
neural networks. arXiv preprint arXiv:1701.05369.

Dropout with Gaussian gate (2017)
• Regular dropout: multiply activations
with Bernoulli RVs.
• Generalization: Multiply with any RV.
• Gaussian gates is found to improve
dropout’s performance.
33
!"
!#
!$
!%
~'(!)(+)
~'(!)(+)
~'(!)(+)
~'(!)(+)
~-(0,1)
~-(0,1)
~-(0,1)
~-(0,1)
Molchanov, D., Ashukha, A.,&Vetrov, D. (2017). Variational dropout sparsifies deep
neural networks. arXiv preprint arXiv:1701.05369.

Activation functions
34

Vanishing Gradient in Deep Networks
35
⋮ ⋮ ⋮ ⋮ ⋮
""""
• Learning was still difficult in large
networks.
• Activation functions at the time
caused the gradient to vanish in
lower layers.
• Difficult to learn weights.
Backpropagation

Evolution of Deep Learning 36
Deep networks doesn’t really help.

Vanishing gradient
• Because sigmoid and tanh functions had saturation regions on both
sides.
37
sigmoid tanh

New Activations
Resolving vanishing gradient
Rectified Linear Unit (ReLU), 2013
38
Maas, A. L., Hannun, A. Y.,&Ng, A. Y. (2013, June). Rectifier nonlinearities improve
neural network acoustic models. In Proc. icml (Vol. 30, No. 1, p. 3).
Clevert, D. A., Unterthiner, T.,&Hochreiter, S. (2015). Fast and accurate deep network learning
by exponential linear units (elus). arXiv preprint arXiv:1511.07289.
Exponential Linear Unit (ELU), 2016
Saturation region on only one side (left) for these activations.

We learned..
39

So far we learned
nodes).
• Challenges met
• Overfitting: Dropout
• Vanishing gradient: New
activations
40

Model types
41

Types of Models
• Unsupervised
• Deep Belief Networks (DBN)
• Supervised
• Feed-forward Neural Network (FNN)
• Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
42

Deep Belief Networks (DBN)
43

Restricted Boltzmann Machine (RBM)
• Has two layers
• Visible: Think of input data
• Hidden: Think of latent factors
• Learn features from data that can
generate the same training data.
44
HiddenVisible
FeaturesData Data

Restricted Boltzmann Machine (RBM)
• Has two layers
• Visible: Think of input data
• Hidden: Think of latent factors
• Learn features from data that can
generate the same training data.
• Bi-directional node relationship.
45
HiddenVisible
FeaturesData

Deep Belief Nets (2006)
Stacked RBMs/Autoencoders
46
• Fast greedy algorithm—learn one layer at a
time.
• Feature extraction and Unsupervised pre-
training.
• MNIST digit classification: Yielded much better
accuracy.
• Used in sensor data.
• Was dying technology after vanishing gradient
was resolved with new ReLU, ELU activations.
Hinton, G. E., Osindero, S.,&Teh, Y. W. (2006). A fast learning
algorithm for deep belief nets. Neural computation, 18(7),
1527-1554.

Multimodal Modeling (2012)
Comeback of DBN
Image data Text data
47
Yellow,
flower
+
• Used to create fused representations by
combining features across modalities.
• Representations useful for classification
and information retrieval.
• Works even if
• Some data modalities are missing, e.g. image-
text.
• Different observation frequencies, e.g. sensor
data.
Srivastava, N.,& Salakhutdinov, R. R. (2012). Multimodal learning with deep
boltzmann machines. In Advances in neural information processing systems (pp.
2222-2230).
Liu, Z., Zhang, W., Quek, T. Q.,&Lin, S. (2017, March). Deep fusion of heterogeneous
sensor data. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International
Conference on (pp. 5965-5969). IEEE.

Feed-forward Neural Network (FNN)
48

FNN
49
• One of the earliest type of NN—
Multilayer Perceptrons (MLP).
• No success story—learning more than 4
layer deep network was difficult.
• Typically only used as last (top) layers in
other networks.
• Then came SELU activation.

Scaled Exponential Linear Units (SELU), 2017
Self-normalizing Neural Networks. New life for FNNs.
50
Klambauer, G., Unterthiner, T., Mayr, A.,&Hochreiter, S. (2017). Self-normalizing neural networks.
In Advances in Neural Information Processing Systems (pp. 972-981).
• Activations automatically converge to zero
mean and unit variance.
• Converges in presence of noise and
perturbations.
• Allows
• train deep networks with many layers,
• employ strong regularization schemes, and
• to make learning highly robust.

Recurrent Neural Network (RNN)
51

RNN
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
52
• For temporal data.
• An RNN passes a message to a successor.

RNN
53
• Learns dependencies with past.

RNN
54
*Bengio, Y., Simard, P.,&Frasconi, P. (1994). Learning long-term dependencies with gradient
descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.
• Learns dependencies with past.
• Failed to learn long-term dependencies*.
• Then came LSTM.

Long short-term memory (LSTM), 1997
55
RNN
LSTM
Hochreiter, S.,&Schmidhuber, J. (1997). Long short-term memory. Neural
computation, 9(8), 1735-1780.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H.,&Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078.
• A special kind of RNN capable of learning long-term
dependencies.
• The added gates regulate addition or removal of
passing information.
• Found powerful in:
• natural language processing,
• unsegmented connected handwriting recognition
• speech recognition
• Gated Recurrent Units (GRUs), 2014
• Fewer parameters than LSTM.
• Performance comparable or lower than LSTM (so far).

Attention Based Model (2015)
• CNN together with LSTM.
• Automatically learns
• to fix gaze on salient objects.
• Object alignments.
• Object relationships with sequence of
words.
56
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ...&Bengio, Y.
(2015, June). Show, attend and tell: Neural image caption generation with visual
attention. In International Conference on Machine Learning (pp. 2048-2057).
Fig. 1. Attention model architecture.
Fig. 2. Examples of attending to the correct object (white
indicates the attended regions, underlines indicated the
corresponding word).

Convolutional Neural Network (CNN)
57

CNN
• The workhorse of Deep Learning
• CNN revolution started with LeCun
(1998)—outperformed other
methods on handwritten digit
MNIST data.
58
LeCun, Y., Bottou, L., Bengio, Y.,&Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Fig. 1. LeCun (1998) architecture.

CNN
• The workhorse of Deep Learning
• CNN revolution started with LeCun
(1998)—outperformed other
methods on handwritten digit
MNIST data.
• CNN learns object defining features.
59
LeCun, Y., Bottou, L., Bengio, Y.,& Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Fig. 1. LeCun (1998) architecture.
Fig. 2. Feature learning in CNN.

AlexNet (2012)
New estimation techniques
60
• Performed best on ImageNet data—
ILSVRC 2012 winner.
• A difficult dataset with more than 1000
categories (labels).
• Similar to LeNet-5 with 5 conv and 3
dense layers. But with
• Max Pooling
• ReLU nonlinearity
• Dropout regularization
• Data augmentation.
Krizhevsky, A., Sutskever, I.,& Hinton, G. E. (2012). Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems (pp. 1097-1105).

GoogLeNet (2014)
Inception module
• Introduced the idea that CNN layers can be
stacked in serial and parallel.
• Has 22 layer CNN and was the winner of
ILSVRC 2014.
• Let the model decide on the conv. size, e.g.
3x3 or 5x5.
• Puts each convolution in parallel
• Concatenate the resulting feature maps
before going to the next layer.
61
Image source: http://slazebni.cs.illinois.edu/spring17/lec01_cnn_architectures.pdfSzegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ...&Rabinovich, A.
(2015). Going deeper with convolutions (2014). arXiv preprint arXiv:1409.4842, 7.

Microsoft’s ResNet (2015)
Residual Network
• Went aggressive on adding layers.
• Evaluated depth up to 152 layers on
ImageNet—8x deeper than VGG nets but
still lower complexity.
• How deep can we go?
62

Microsoft’s ResNet (2015)
Residual Network
• How deep can we go? With more layers
• Training and test accuracy drops.
• Degradation due to difficulty in optimization.
• Introduced Residual Network
• Residual network idea: add additional information (the
conv transformation F(x)) in input data and pass to next
layer.
• Traditional CNNs: we learn a completely different
transformation F(x) and pass it on for more
transformation.
• The authors found residual network is easier to optimize
in very deep networks.
63
Fig. 1. Training error (left) and test error (right) on CIFAR-10
with 20- and 50- layer ”plain” networks. The deeper network
has higher training error, and thus test error.
Fig. 2. Residual learning: a building block.
He, K., Zhang, X., Ren, S.,&Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 770-778).

Capsules (2017)
Going to the next level
• CNNs do not understand spatial relationships
between features.
• Come Capsules
• preserves hierarchical pose relationships
between object parts.
• makes model understand a new view is just
another view of same thing.
• Performance
• Cut error rate by 45%.
• Used a fraction of the data compared to a CNN.
64
Fig. 1. For CNN, the position of features do not matter.
Image source: https://medium.com/ai³-theory-practice-business/understanding-
hintons-capsule-networks-part-i-intuition-b4b559d1159b
Sabour, S., Frosst, N.,&Hinton, G. E. (2017). Dynamic routing between capsules.
In Advances in Neural Information Processing Systems (pp. 3859-3869).
Fig. 2. Capsules understand all images are the same object.

We learned..
65

In summary, we learned
nodes).
• Challenges met
• Overfitting: Dropout
• Vanishing gradient: New activations
• Scaled Exponential Linear Units—will
bring FNN to forefront.
• Capsules—more closer to how brain
works.
66

In summary, we learned
• Multimodal models with DBM.
• LSTM+CNN for attention based model.
• Inception: Let model figure conv size.
• Residual network: Can learn deeper.
67
Yellow,
flower

Thank you!
68

Why is non-linear activation required?
69
!"
!#
!$
%&
Given !
' " = ) " ! + + "
, " = - " (' " )
' #
= ) #
, "
+ + #
, #
= - #
(' #
)
Layer-1
Layer-2
' # = ) # , " + + #
= ) # ' " + + #
= ) #
() "
! + + "
) + + #
= ) #
) "
! + () #
+ "
+ + #
)
= )′! + +′
⇒ ' #
~ !
⇒ ' 4 ~ !
⋮ Any number of layers
collapse to one.
Processed information
transfer due to non-linear
activation
If this activation is linear, i.e. , "
= ' "
,
then it becomes equivalent to passing
the original input ! to the next layer.

Evolution of Deep Learning and new advancements

More Related Content

What's hot

Similar to Evolution of Deep Learning and new advancements

Recently uploaded

Evolution of Deep Learning and new advancements