Deep Learning
Pierre de Lacaze
rpl@lispnyc.org
Lisp NYC
Tuesday, June 20th, 2017
Jane Street Capital
Overview
Principal Topics
1. Convolutional Neural Networks (CNNs)
2. Recurrent Neural Networks (RNNs)
Time permitting…
1. Generative Adversarial Networks (GANs)
2. Differentiable Neural Computers (DNCs)
3. Deep Reinforcement Learning (DRL)
Deep Neural Networks
• A deep neural network is a neural network with multiple
layers of hidden units.
– E.g. MLPs: Multi-Layered Perceptrons (MLPs)
• Convolutional Neural Nets (CNNs)
– Biologically-inspired variants of MLPs
– Successfully used in image recognition, speech recognition
• Recurrent Neural Nets (RNN)
– Cyclic graphs where next layers feeds into previous layers
– Allow for a window of time into past data
– Successfully used or Natural Language processing.
Application: Combining CNNs & RNNs
GENERATING IMAGE DESCRIPTIONS
Together with convolutional Neural Networks, RNNs have been used as part of a model to generate
descriptions for unlabeled images. It’s quite amazing how well this seems to work. The combined model even
aligns the generated words with features found in the images.
Deep Visual-Semantic Alignments for Generating Image Descriptions. Source: http://cs.stanford.edu/people/karpathy/deepimagesent
Part 0
ANN Review &
Multi-Layered Perceptrons
(MLPs)
Multi Layered Perceptrons (MLPs) are fully
connected feed forward networks with several
layers of hidden units.
Linear Units and Perceptrons
• Linear Unit: A linear combination of weighted inputs (real-valued)
• Perceptron: Thresholded Linear Unit (discrete-valued)
Note: w0 is a bias whose purpose is to move the threshold of the activation function.
Multi Layered Perceptrons
• These are fully connected Deep Feed Forward Networks
• Every output from previous layer is connected to every unit in the next layer
• They are typically trained using the Backprogation Algorithm
• Backprogation is effectively Gradient Descent applied to every unit in the network.
Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 2.
Gradient Descent Motivation
Weight Space Error Surface
ANN Backpropagation Algorithm
(Using incremental gradient descent)
1. Initial weights to small random numbers
2. Until termination criteria for each training example
a. Compute the network outputs for the training example
b. For each output unit k compute its error:
δk = ok (1 – ok) (tk – ok)
c. For each hidden unit h compute its error:
δh = oh (1 – oh) Σ (whk δk )
k
d. Update each network weight wij
wij = wij + η δh xij
Thoughtful Reminder Slide
Show Code
Examples
Identity Function Example
• Tom Mitchell, Machine Learning, Chpt 4., 1st edition.
(def if-td
[[[1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]]
[[0 1 0 0 0 0 0 0] [0 1 0 0 0 0 0 0]]
[[0 0 1 0 0 0 0 0] [0 0 1 0 0 0 0 0]]
[[0 0 0 1 0 0 0 0] [0 0 0 1 0 0 0 0]]
[[0 0 0 0 1 0 0 0] [0 0 0 0 1 0 0 0]]
[[0 0 0 0 0 1 0 0] [0 0 0 0 0 1 0 0]]
[[0 0 0 0 0 0 1 0] [0 0 0 0 0 0 1 0]]
[[0 0 0 0 0 0 0 1] [0 0 0 0 0 0 0 1]]])
• Ran 3 examples of MLPs on Identity function.
– A 1 hidden layer MLP: 8 x 3 x 8
– A 2 hidden layer MLP: 8 x 3 x 3 x 8
– A 3 hidden layer MLP: 8 x 3 x 3 x 3 x 8
MLP Training Comparisons
❶ MLP with 1 hidden layer of 3 hidden units: 4,500 iterations to converge
❷ MLP with 2 hidden layers of 3 hidden units: 28,000 iteration to converge
❸ MLP with 3 hidden layers of 3 hidden units: 1,000,000+ iterations to converge
Part 1
Convolutional Neural Nets
(CNNs)
Convolutional Neural Networks are
biologically-inspired variants of Multi Layered
Perceptrons (MLPs)
History of CNNs
• Research dates back to the 1970’s
• Seminal Paper on CNNs:
– Gradient-based learning applied to document recognition,
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, 1998
• Really took off in 2012
– ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
– 2012 ILSBRC: AlexNet , Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
– 2013 ILBSRC: ZF Net, Matthew Zeiler and Rob Fergus , NYU
– 2014: VGG Net, Karen Simonyan and Andrew Zisserman, University of Oxford
CNN Overview
• A CNN typically consists of one or more convolutional and sampling layers
followed by one or more fully connected layers.
• Specifically designed to exploit 2D input such an image or speech input
• Faster to to train than fully connected networks.
• Sparse Connectivity
– CNNs exploit spatially-local correlation using local connectivity pattern between units of adjacent layers.
– These are called local receptive fields
• Shared Weights
– Replicated units share the same parameterization (weight vector and bias) and form a feature map.
• Max Pooling
– A form of non-linear down-sampling. Max pooling partitions the input image into a set of non-overlapping
rectangles and, for each such sub-region, outputs the maximum value.
Local Receptive Fields
• In a fully connected network, every input in the input layer is connected to
every hidden unit.
• This prevents the network from learning spatial features of the image.
• The idea is to map (connect) small rectangular sections of the image (inputs)
to different hidden units.
• These hidden units are called local receptive fields and result in a sparse
connectivity between the input layer and the first hidden layer.
• The stride length is the amount by which we shift the rectangular sections.
Typically use rectangular sections shifted over 1 pixel
• Different sets of local receptive fields form feature maps each of which
represent a potentially different feature.
Feature Maps
• Each hidden unit shares the same set of weights and bias but
for a different spatial area of the input.
• This allows that layer to learn the same feature but for
different regions of the image.
• The complete hidden layer will in fact consist of several
feature maps. This is called a convolutional layer.
• The shared bias and weights in each feature map are often
called filters or kernels.
How Feature Maps Work
The amount by which
the local receptive field
is shifted is called the
stride length.
A stride length of 1 is
common.
All hidden units in a
feature map share the
same weights and bias.
This greatly reduces the
number of parameters in
a layer.
Image credit: Michael Nielsen’s Neural Networks and Deep Learning, Chapter 6.
Why Do Feature Maps Learn Different Features?
• From Quora: Andy Thomas
• Two reasons:
– The weights of the filters are randomly initialized
– Different feature maps reduce the cost function
• Random initialization of the weights will likely ensure each filter
converges to different local minima in the cost function. It is very
unlikely that each filter would begin to resemble other filters, as
that would almost certainly result in an increase of the cost
function and therefore no gradient descent algorithm would head
in that direction.
• Some feature maps may learn the same feature.
The Convolution Operator
• A Convolution is a simple mathematical operation common to many image
processing operators.
• Provides a way of “multiplying” two arrays of numbers of different sizes
but same dimensionality
• Input image has M rows and N columns, and the kernel has m rows and n
columns,
• The output image will have M - m + 1 rows, and N - n + 1 columns.
• The purpose of Convolution in a CNN is to extract features from the input image.
• Convolution preserves the spatial relationship between pixels by learning image
features using small squares of input data
Output of the Convolutional Layer
• For each hidden unit in each feature map, only take
into account pixels in the local receptive field (sparse
connectivity)
• For each feature map, for the jth ,kth hidden unit in
that feature map, assuming a 5x5 filter (aka kernel),
the output of that unit is given by:
–σ (b + ∑ l=0,4 ∑ m=0,4 wl,m a j+l,k+m)
Pooling
• A pooling layer typically follows a convolutional layer.
• Intuitively it is a down sampling of the previous layer.
• Max pooling is technique that selects the maximum
activation from a set of units from the convolutional
layer.
• Effectively take each feature map from convolutional
layer and produce a reduced feature map.
• Other pooling techniques:
– L2 Pooling
• Takes the square root of the sum of the squares of a set of units
How Pooling Works
• Pooling is a form of statistical aggregation or downsampling of the previous layer.
• Pooling layers do not learn anything
• While it is common, it is not required to have a pooling after a convolutional layer
Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6
Backpropagation in CNNs Overview
• Applying backprogation to a convolutional layer is very similar to
applying backprogation to a fully connected except that errors and
gradients are computed separately for each filter.
• Applying backpropagation to a pooling layer involves using an
upsampling function which propagates the error over the sampling
function using its derivatives.
• Backpropagation for a fully connected layer is exactly the same as
for MLPs.
• Yoshua Bengio on Quora: “There is a general recipe for obtaining a
back-propagation algorithm associated with ANY computational
graph. You can find it described in my book, for example, in the
feedforward nets (mlp) chapter (6): DEEP LEARNING”
Backpropagation in CNNs
• Error and gradient for fully connected layers
• Error and gradient for convolutional layer
• k indexes the filter number and upsample propagates error through pooling layer)
Slides from Hiroshi Kuwajima (visiting scholar at Stanford)
MNIST Data Set
• National Institute for Standards and Technology (NIST)
• Modified NIST Data Set maintained by Yan LeCun
• MNIST Data in CSV format
A Simple Architecture for MNIST
Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 6.
• Input layer: 764 inputs encode the MNIST image
• Convolutional layer: 1728 units representing 3 feature maps
• Max-Pooling layer: 432 units representing 3 feature maps
• Output layer: 10 units, one for each digit MNIST dataset
Shared Weights and Training CNNs
• CNN
– 28×28 = 784 input neurons
– 20 feature maps 20×26=520
– Total of 520 weights to learn.
• MLP
– 784=28×28 inputs,
– 30 hidden units,
– Total of 784×30 weights = 23520
– Total of 30 biases,
– Total of 23,550 weights to learn.
• A single fully-connected layer would have more than 40 times as
many weights as the convolutional layer.
A CNN Architecture for MNIST
Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6
• 9,967 Test images correctly classified out 10,000
• Very similar to LeNet-5 architecture
• Softmax Regression aka Multi-class Logistic Regression is a generalization of
logistic regression that is used for multi-class classification and based of the
softmax function.
Incorrectly Classified MNIST Images
Of the 10,000 MNIST test images 9,967 correctly classified, 33 incorrectly classified
What features are learned?
• The images above show the type of features the convolutional learns.
• Lighter regions mean a smaller, typically negative weight,
• Darker region mean a larger weight
• Many of the features have distinguishable sub-regions of light and dark
• It’s clear that it’s learning “stuff” related to spatial structure
Performance Enhancements
• Regularization Terms to help with overfitting
– Regularization is technique that allows you to penalize
your loss function.
• Ensemble methods
– Train several nets and have them vote on the output.
• Generative expanded data sets
– Basically apply distortions to original data set
– E.g. 50,000 images  250,000 images
Expanded Generated Data Sets
Image credit: Tijmen Tieleman, University of Toronto
CNN Summary
• There are four main operations in a CNN:
– Convolution
– Non Linearity (ReLU)
– Pooling or Sub Sampling
– Classification (Fully Connected Layer)
• These operations are the basic building blocks of every CNN.
• CNN’s Faster to train than MLPs because fewer parameters need to be learned.
• Work well with two-dimensional data in which locality is meaningful,
– e.g. object recognition in images.
• CNN can also be used with higher dimensional data
– e.g. MRI Images
• Addition convolutional layers provide higher level features (meta features)
• Pooling layers progressively reduce the spatial size of the representation to reduce the amount of features and the
computational complexity of the network
• Fully Connected layer at the end provides the classifier
• Rectified Linear Units (ReLU) typically outperform networks based on sigmoid activation functions (sigmoid or
tanh).
Part 2
Recurrent Neural Nets
(RNNs)
Recurrent Neural Networks are a family of
Neural Networks for procession sequential data.
Recurrent Neural Nets Overview
• Leverage the ideas
– unfolding computational graphs
– parameter-sharing to abstract away input position
• “In 2009 I visited Nepal” vs “I visited Nepal in 2009”
• RNNs represent cyclical graphs so information flows in both directions through the
network.
– They are networks with loops in them, allowing information to persist.
• Different flavors of RNNs
– An output at each time-step and recurrent connections between hidden units
– An output at each time-step and recurrent connections only from output units
– An output only after the entire sequence is fed into the network and connections between
hidden units.
• RNNs can simulate a Turing Machine and can represent any computable function
– Siegelman and Sontag, 1995.
– Used an RNN off finite size consisting 886 units
RNNs in Practice
• Types of RNN used in Practice
– Vanilla RNNs
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Long Short-Term Memory (LSTM)
• Practical Applications of RNNs
– Language Modeling And Generating Text
– Machine Translation
– Speech Recognition
– Generating Image Descriptions
Computational Graphs
• Computational Graph: Formalization of the
structure of a set of computations.
• Unfolding a recursive computation into a
graph with repetitive structure results in
parameter sharing across a deep network
structure.
• Any function involving a recurrence is an RNN
• Hidden Units in RNN:
– h(t) = f(h(t-1), x(t), θ)
– Notice that θ is the same at each time step.
Unfolding an RNN
Training RNNs
• Backpropagation in Computational Graphs
– Backprogation can be derived for any computational graph by recursively applying the chain
rule. (Deep Learning, Chapter 6)
– The backprogation algorithm consists of performing a Jacobian-gradient-product for each
operation in the graph
– In vector calculus, the Jacobian matrix is the matrix of all first-order partial derivatives of a
vector-valued function
• Backpropagation Through Time (BPTT).
– Gradient at each output depends not only on the calculations of the current time step, but
also the previous time steps.
– Vanilla RNNs trained with BPTT have difficulties learning long-term dependencies, i.e.
dependencies between (words) steps that are far apart)
• “I grew up in France… I speak fluent French”
– Suffers from vanishing/exploding gradient problem.
• Vanishing gradient: your gradients get smaller and smaller in magnitude as you backpropagate through earlier
layers (or through time).
• Activation functions like the sigmoid function produce gradients in range [-1,1] which easily causes the gradient
to vanish in earlier layers.
• Exploding gradient: more of an issue with recurrent networks, where the opposite happens due to a Jacobian
with determinant greater than 1.
– Certain types of RNNs (like LSTMs) were specifically designed to get around these problems.
Long Short Term Memory (LSTM)
• LSTMs are a special kind of RNN, capable of learning long-term dependencies.
• Successful in handwriting recognition, speech recognition, image captioning and machine
translation
• Type of gated network
• Introduced by Hochreiter & Schmidhuber (1997)
– Added self-loops which allowed gradient to flow for long durations.
– Weight on the self-loop based on context rather than fixed. (Gers et al., 2000)
– Based on the idea of creating paths through the network in which the gradient neither vanishes nor
explodes.
• Based on the idea of creating paths through the network in which the gradient neither vanishes nor
explodes.
• Leaky units allowed information to accumulate over a long duration
• LSTM’s generalize leaky units by allowing connection weights to change over time.
• LSTM’s allow the network to decide when to forget information.
• A single hidden unit in an LSTM is replaced with a recurrent network cell consisting of 4
components that interact with each other.
Gated Network Cells
• Gated network cells replace the hidden units of RNNs
• Input feature is computed using the ANN unit.
• The input can be accumulated if input gate allows it.
• The state has a self-loop controlled by the forget gate
• The output can be turned off by the output gate
28×28
LSTM in NLP Generation
Image credit: Google Research Blog
LSTM Summary
• A type of RNN architecture that addresses the
vanishing/exploding gradient problem.
• LSTM allow the learning of long-term
dependencies which is crucial for sequences
of inputs.
• Recently achieved state-of-the-art
performance in speech recognition, language
modeling, translation, image captioning
Additional Topics…
• Generalized Adversarial Networks (GANs)
• Deep Reinforcement Learning (DRL)
• Differentiable Neural Computers (DNCs)
Part 3
Generative Adversarial
Networks
(GANs)
Generative Adversarial Networks are an example of generative
models. GANs focus primarily on sample generation, though it is
possible to design GANs that can estimate the probability
distribution.
GAN Framework
• Based on the idea of a two player game
– Player 1: Generator
– Player 2: Discriminator
• The generator generates samples and tries to
fool the discriminator
• The discriminator determines if the generated
samples are real or fake
Why GANs are useful
• When predicting the next frame in a video, using the Mean Squared Error
(MSE) causes an averaging over many possible futures which causes the
ear to disappear and blurring of the eyes
• The adversarial version does a much better job preserving the ear and not
blurring the eyes.
Image credit: Ian Goodfellow, GANs Tutorial, NIPS 2016
GANs Summary
• GANs are generative models that use
supervised learning to approximate an
intractable cost function
• GANs requires finding Nash equilibria in high
dimensional, continuous, non-convex games.
• GANs are crucial to many different state of the
art image generation and manipulation
systems.
Part 4
Deep Reinforcement Learning
(DRL)
Deep Reinforcement Learning combines both Deep Learning and
Reinforcement Learning by using Deep Learning techniques to learn values
for the Q Function in Reinforcement Learning. This is described in Google
Deep Mind’s Atari paper and exemplified by the AlphaGo program
Deep Reinforcement Learning
• Combines Reinforcement Learning with Deep Learning
• A Form of model-free or unsupervised learning
• Uses Neural Nets to estimate Q Values.
• Very new field. No Wikipedia Page on this topic.
• Idea is to 3feed states and actions into the network to predict Q values.
• Neural networks are exceptionally good in coming up with good features
for highly structured data.
• This is the technology used by Google DeepMind’s AlphaGo program.
Reinforcement Learning Revisited
• Definitions
– Policy π is a way of selecting an action given a state
– Value function Qπ (s,a) is the expected total reward for
performing action a from state s given policy π
• Different Approaches
– Policy Based RL
• Search for the optimal policy in space of policies
– Value-based RL
• Estimate optimal value function Q*(s,a)
– Model-based RL
• Build a model of the environment and use look ahead
The Many States Problem
• In the Nature Deep Mind Atari paper:
• Take four last screen images, resize them to 84×84 and
convert then to gray scale with 256 gray levels.
• This yields 25684×84×4≈1067970 possible game states.
• This means 1067970 rows in our imaginary Q-table.
• That is more than the number of atoms in the known
universe!
Deep-Q Architecture
Deep Q-Learning Error & Gradient
• Represent Q function using a deep network.
• Error function
• Gradient
Strategies & Tricks
• Experience Relay
– During gameplay all the experiences <s,a,r,s′> are stored in a replay memory.
– When training the network, random samples from the replay memory are
used instead of the most recent transition.
– This breaks the similarity of subsequent training samples, which otherwise
might drive the network into a local minimum.
– Also experience replay makes the training task more similar to usual
supervised learning, which simplifies debugging and testing the algorithm.
– One could actually collect all those experiences from human gameplay and the
train network on these.
• Exploration-Exploitation
– ε-greedy exploration
– with probability ε choose a random action, otherwise go with the “greedy”
action with the highest Q-value.
Deep Q-Learning Algorithm
DeepMind Atari Deep-Q Network
References (1)
• Neural Nets & Deep Learning
– http://neuralnetworksanddeeplearning.com/chap2.html
– http://deeplearning.net/tutorial/deeplearning.pdf
• Convolutional Neural Networks
– http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
– http://neuralnetworksanddeeplearning.com/chap6.html
– http://cs231n.github.io/convolutional-networks/
– Visualizing and Understanding Convolutional Networks
– Convolutional Neural Networks backpropagation: from intuition to derivation
– An Intuitive Explanation of Convolutional Neural Networks
– Backpropagation in Convolutional Neural Networks
• Recurrent Neural Nets
– http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
– http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
– http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-
model-rnn-with-python-numpy-and-theano/
– http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-
time-and-vanishing-gradients/
– http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-
with-python-and-theano/
References (2)
• Generative Adversarial Networks
– NIPS 2016 Tutorial: Generative Adversarial Networks
• Deep Reinforcement Learning
– http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/deep_rl.pdf
– http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/
• Differentiable Neural Computers
– https://deepmind.com/blog/differentiable-neural-computers/
• Google DeepMind DRL Atari Paper
– https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
Questions
• Goodfellow quote on BP on Quora
• Vanishing / exploding gradient

Deep Learning

  • 1.
    Deep Learning Pierre deLacaze rpl@lispnyc.org Lisp NYC Tuesday, June 20th, 2017 Jane Street Capital
  • 2.
    Overview Principal Topics 1. ConvolutionalNeural Networks (CNNs) 2. Recurrent Neural Networks (RNNs) Time permitting… 1. Generative Adversarial Networks (GANs) 2. Differentiable Neural Computers (DNCs) 3. Deep Reinforcement Learning (DRL)
  • 3.
    Deep Neural Networks •A deep neural network is a neural network with multiple layers of hidden units. – E.g. MLPs: Multi-Layered Perceptrons (MLPs) • Convolutional Neural Nets (CNNs) – Biologically-inspired variants of MLPs – Successfully used in image recognition, speech recognition • Recurrent Neural Nets (RNN) – Cyclic graphs where next layers feeds into previous layers – Allow for a window of time into past data – Successfully used or Natural Language processing.
  • 4.
    Application: Combining CNNs& RNNs GENERATING IMAGE DESCRIPTIONS Together with convolutional Neural Networks, RNNs have been used as part of a model to generate descriptions for unlabeled images. It’s quite amazing how well this seems to work. The combined model even aligns the generated words with features found in the images. Deep Visual-Semantic Alignments for Generating Image Descriptions. Source: http://cs.stanford.edu/people/karpathy/deepimagesent
  • 5.
    Part 0 ANN Review& Multi-Layered Perceptrons (MLPs) Multi Layered Perceptrons (MLPs) are fully connected feed forward networks with several layers of hidden units.
  • 6.
    Linear Units andPerceptrons • Linear Unit: A linear combination of weighted inputs (real-valued) • Perceptron: Thresholded Linear Unit (discrete-valued) Note: w0 is a bias whose purpose is to move the threshold of the activation function.
  • 7.
    Multi Layered Perceptrons •These are fully connected Deep Feed Forward Networks • Every output from previous layer is connected to every unit in the next layer • They are typically trained using the Backprogation Algorithm • Backprogation is effectively Gradient Descent applied to every unit in the network. Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 2.
  • 8.
  • 9.
    ANN Backpropagation Algorithm (Usingincremental gradient descent) 1. Initial weights to small random numbers 2. Until termination criteria for each training example a. Compute the network outputs for the training example b. For each output unit k compute its error: δk = ok (1 – ok) (tk – ok) c. For each hidden unit h compute its error: δh = oh (1 – oh) Σ (whk δk ) k d. Update each network weight wij wij = wij + η δh xij
  • 10.
  • 11.
    Identity Function Example •Tom Mitchell, Machine Learning, Chpt 4., 1st edition. (def if-td [[[1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]] [[0 1 0 0 0 0 0 0] [0 1 0 0 0 0 0 0]] [[0 0 1 0 0 0 0 0] [0 0 1 0 0 0 0 0]] [[0 0 0 1 0 0 0 0] [0 0 0 1 0 0 0 0]] [[0 0 0 0 1 0 0 0] [0 0 0 0 1 0 0 0]] [[0 0 0 0 0 1 0 0] [0 0 0 0 0 1 0 0]] [[0 0 0 0 0 0 1 0] [0 0 0 0 0 0 1 0]] [[0 0 0 0 0 0 0 1] [0 0 0 0 0 0 0 1]]]) • Ran 3 examples of MLPs on Identity function. – A 1 hidden layer MLP: 8 x 3 x 8 – A 2 hidden layer MLP: 8 x 3 x 3 x 8 – A 3 hidden layer MLP: 8 x 3 x 3 x 3 x 8
  • 12.
    MLP Training Comparisons ❶MLP with 1 hidden layer of 3 hidden units: 4,500 iterations to converge ❷ MLP with 2 hidden layers of 3 hidden units: 28,000 iteration to converge ❸ MLP with 3 hidden layers of 3 hidden units: 1,000,000+ iterations to converge
  • 13.
    Part 1 Convolutional NeuralNets (CNNs) Convolutional Neural Networks are biologically-inspired variants of Multi Layered Perceptrons (MLPs)
  • 14.
    History of CNNs •Research dates back to the 1970’s • Seminal Paper on CNNs: – Gradient-based learning applied to document recognition, Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, 1998 • Really took off in 2012 – ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) – 2012 ILSBRC: AlexNet , Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton – 2013 ILBSRC: ZF Net, Matthew Zeiler and Rob Fergus , NYU – 2014: VGG Net, Karen Simonyan and Andrew Zisserman, University of Oxford
  • 15.
    CNN Overview • ACNN typically consists of one or more convolutional and sampling layers followed by one or more fully connected layers. • Specifically designed to exploit 2D input such an image or speech input • Faster to to train than fully connected networks. • Sparse Connectivity – CNNs exploit spatially-local correlation using local connectivity pattern between units of adjacent layers. – These are called local receptive fields • Shared Weights – Replicated units share the same parameterization (weight vector and bias) and form a feature map. • Max Pooling – A form of non-linear down-sampling. Max pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
  • 16.
    Local Receptive Fields •In a fully connected network, every input in the input layer is connected to every hidden unit. • This prevents the network from learning spatial features of the image. • The idea is to map (connect) small rectangular sections of the image (inputs) to different hidden units. • These hidden units are called local receptive fields and result in a sparse connectivity between the input layer and the first hidden layer. • The stride length is the amount by which we shift the rectangular sections. Typically use rectangular sections shifted over 1 pixel • Different sets of local receptive fields form feature maps each of which represent a potentially different feature.
  • 17.
    Feature Maps • Eachhidden unit shares the same set of weights and bias but for a different spatial area of the input. • This allows that layer to learn the same feature but for different regions of the image. • The complete hidden layer will in fact consist of several feature maps. This is called a convolutional layer. • The shared bias and weights in each feature map are often called filters or kernels.
  • 18.
    How Feature MapsWork The amount by which the local receptive field is shifted is called the stride length. A stride length of 1 is common. All hidden units in a feature map share the same weights and bias. This greatly reduces the number of parameters in a layer. Image credit: Michael Nielsen’s Neural Networks and Deep Learning, Chapter 6.
  • 19.
    Why Do FeatureMaps Learn Different Features? • From Quora: Andy Thomas • Two reasons: – The weights of the filters are randomly initialized – Different feature maps reduce the cost function • Random initialization of the weights will likely ensure each filter converges to different local minima in the cost function. It is very unlikely that each filter would begin to resemble other filters, as that would almost certainly result in an increase of the cost function and therefore no gradient descent algorithm would head in that direction. • Some feature maps may learn the same feature.
  • 20.
    The Convolution Operator •A Convolution is a simple mathematical operation common to many image processing operators. • Provides a way of “multiplying” two arrays of numbers of different sizes but same dimensionality • Input image has M rows and N columns, and the kernel has m rows and n columns, • The output image will have M - m + 1 rows, and N - n + 1 columns. • The purpose of Convolution in a CNN is to extract features from the input image. • Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data
  • 21.
    Output of theConvolutional Layer • For each hidden unit in each feature map, only take into account pixels in the local receptive field (sparse connectivity) • For each feature map, for the jth ,kth hidden unit in that feature map, assuming a 5x5 filter (aka kernel), the output of that unit is given by: –σ (b + ∑ l=0,4 ∑ m=0,4 wl,m a j+l,k+m)
  • 22.
    Pooling • A poolinglayer typically follows a convolutional layer. • Intuitively it is a down sampling of the previous layer. • Max pooling is technique that selects the maximum activation from a set of units from the convolutional layer. • Effectively take each feature map from convolutional layer and produce a reduced feature map. • Other pooling techniques: – L2 Pooling • Takes the square root of the sum of the squares of a set of units
  • 23.
    How Pooling Works •Pooling is a form of statistical aggregation or downsampling of the previous layer. • Pooling layers do not learn anything • While it is common, it is not required to have a pooling after a convolutional layer Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6
  • 24.
    Backpropagation in CNNsOverview • Applying backprogation to a convolutional layer is very similar to applying backprogation to a fully connected except that errors and gradients are computed separately for each filter. • Applying backpropagation to a pooling layer involves using an upsampling function which propagates the error over the sampling function using its derivatives. • Backpropagation for a fully connected layer is exactly the same as for MLPs. • Yoshua Bengio on Quora: “There is a general recipe for obtaining a back-propagation algorithm associated with ANY computational graph. You can find it described in my book, for example, in the feedforward nets (mlp) chapter (6): DEEP LEARNING”
  • 25.
    Backpropagation in CNNs •Error and gradient for fully connected layers • Error and gradient for convolutional layer • k indexes the filter number and upsample propagates error through pooling layer)
  • 26.
    Slides from HiroshiKuwajima (visiting scholar at Stanford)
  • 27.
    MNIST Data Set •National Institute for Standards and Technology (NIST) • Modified NIST Data Set maintained by Yan LeCun • MNIST Data in CSV format
  • 28.
    A Simple Architecturefor MNIST Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 6. • Input layer: 764 inputs encode the MNIST image • Convolutional layer: 1728 units representing 3 feature maps • Max-Pooling layer: 432 units representing 3 feature maps • Output layer: 10 units, one for each digit MNIST dataset
  • 29.
    Shared Weights andTraining CNNs • CNN – 28×28 = 784 input neurons – 20 feature maps 20×26=520 – Total of 520 weights to learn. • MLP – 784=28×28 inputs, – 30 hidden units, – Total of 784×30 weights = 23520 – Total of 30 biases, – Total of 23,550 weights to learn. • A single fully-connected layer would have more than 40 times as many weights as the convolutional layer.
  • 30.
    A CNN Architecturefor MNIST Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6 • 9,967 Test images correctly classified out 10,000 • Very similar to LeNet-5 architecture • Softmax Regression aka Multi-class Logistic Regression is a generalization of logistic regression that is used for multi-class classification and based of the softmax function.
  • 31.
    Incorrectly Classified MNISTImages Of the 10,000 MNIST test images 9,967 correctly classified, 33 incorrectly classified
  • 32.
    What features arelearned? • The images above show the type of features the convolutional learns. • Lighter regions mean a smaller, typically negative weight, • Darker region mean a larger weight • Many of the features have distinguishable sub-regions of light and dark • It’s clear that it’s learning “stuff” related to spatial structure
  • 33.
    Performance Enhancements • RegularizationTerms to help with overfitting – Regularization is technique that allows you to penalize your loss function. • Ensemble methods – Train several nets and have them vote on the output. • Generative expanded data sets – Basically apply distortions to original data set – E.g. 50,000 images  250,000 images
  • 34.
    Expanded Generated DataSets Image credit: Tijmen Tieleman, University of Toronto
  • 35.
    CNN Summary • Thereare four main operations in a CNN: – Convolution – Non Linearity (ReLU) – Pooling or Sub Sampling – Classification (Fully Connected Layer) • These operations are the basic building blocks of every CNN. • CNN’s Faster to train than MLPs because fewer parameters need to be learned. • Work well with two-dimensional data in which locality is meaningful, – e.g. object recognition in images. • CNN can also be used with higher dimensional data – e.g. MRI Images • Addition convolutional layers provide higher level features (meta features) • Pooling layers progressively reduce the spatial size of the representation to reduce the amount of features and the computational complexity of the network • Fully Connected layer at the end provides the classifier • Rectified Linear Units (ReLU) typically outperform networks based on sigmoid activation functions (sigmoid or tanh).
  • 36.
    Part 2 Recurrent NeuralNets (RNNs) Recurrent Neural Networks are a family of Neural Networks for procession sequential data.
  • 37.
    Recurrent Neural NetsOverview • Leverage the ideas – unfolding computational graphs – parameter-sharing to abstract away input position • “In 2009 I visited Nepal” vs “I visited Nepal in 2009” • RNNs represent cyclical graphs so information flows in both directions through the network. – They are networks with loops in them, allowing information to persist. • Different flavors of RNNs – An output at each time-step and recurrent connections between hidden units – An output at each time-step and recurrent connections only from output units – An output only after the entire sequence is fed into the network and connections between hidden units. • RNNs can simulate a Turing Machine and can represent any computable function – Siegelman and Sontag, 1995. – Used an RNN off finite size consisting 886 units
  • 38.
    RNNs in Practice •Types of RNN used in Practice – Vanilla RNNs – Bidirectional RNNs – Deep Bidirectional RNNs – Long Short-Term Memory (LSTM) • Practical Applications of RNNs – Language Modeling And Generating Text – Machine Translation – Speech Recognition – Generating Image Descriptions
  • 39.
    Computational Graphs • ComputationalGraph: Formalization of the structure of a set of computations. • Unfolding a recursive computation into a graph with repetitive structure results in parameter sharing across a deep network structure. • Any function involving a recurrence is an RNN • Hidden Units in RNN: – h(t) = f(h(t-1), x(t), θ) – Notice that θ is the same at each time step.
  • 40.
  • 41.
    Training RNNs • Backpropagationin Computational Graphs – Backprogation can be derived for any computational graph by recursively applying the chain rule. (Deep Learning, Chapter 6) – The backprogation algorithm consists of performing a Jacobian-gradient-product for each operation in the graph – In vector calculus, the Jacobian matrix is the matrix of all first-order partial derivatives of a vector-valued function • Backpropagation Through Time (BPTT). – Gradient at each output depends not only on the calculations of the current time step, but also the previous time steps. – Vanilla RNNs trained with BPTT have difficulties learning long-term dependencies, i.e. dependencies between (words) steps that are far apart) • “I grew up in France… I speak fluent French” – Suffers from vanishing/exploding gradient problem. • Vanishing gradient: your gradients get smaller and smaller in magnitude as you backpropagate through earlier layers (or through time). • Activation functions like the sigmoid function produce gradients in range [-1,1] which easily causes the gradient to vanish in earlier layers. • Exploding gradient: more of an issue with recurrent networks, where the opposite happens due to a Jacobian with determinant greater than 1. – Certain types of RNNs (like LSTMs) were specifically designed to get around these problems.
  • 42.
    Long Short TermMemory (LSTM) • LSTMs are a special kind of RNN, capable of learning long-term dependencies. • Successful in handwriting recognition, speech recognition, image captioning and machine translation • Type of gated network • Introduced by Hochreiter & Schmidhuber (1997) – Added self-loops which allowed gradient to flow for long durations. – Weight on the self-loop based on context rather than fixed. (Gers et al., 2000) – Based on the idea of creating paths through the network in which the gradient neither vanishes nor explodes. • Based on the idea of creating paths through the network in which the gradient neither vanishes nor explodes. • Leaky units allowed information to accumulate over a long duration • LSTM’s generalize leaky units by allowing connection weights to change over time. • LSTM’s allow the network to decide when to forget information. • A single hidden unit in an LSTM is replaced with a recurrent network cell consisting of 4 components that interact with each other.
  • 43.
    Gated Network Cells •Gated network cells replace the hidden units of RNNs • Input feature is computed using the ANN unit. • The input can be accumulated if input gate allows it. • The state has a self-loop controlled by the forget gate • The output can be turned off by the output gate 28×28
  • 44.
    LSTM in NLPGeneration Image credit: Google Research Blog
  • 45.
    LSTM Summary • Atype of RNN architecture that addresses the vanishing/exploding gradient problem. • LSTM allow the learning of long-term dependencies which is crucial for sequences of inputs. • Recently achieved state-of-the-art performance in speech recognition, language modeling, translation, image captioning
  • 46.
    Additional Topics… • GeneralizedAdversarial Networks (GANs) • Deep Reinforcement Learning (DRL) • Differentiable Neural Computers (DNCs)
  • 47.
    Part 3 Generative Adversarial Networks (GANs) GenerativeAdversarial Networks are an example of generative models. GANs focus primarily on sample generation, though it is possible to design GANs that can estimate the probability distribution.
  • 48.
    GAN Framework • Basedon the idea of a two player game – Player 1: Generator – Player 2: Discriminator • The generator generates samples and tries to fool the discriminator • The discriminator determines if the generated samples are real or fake
  • 49.
    Why GANs areuseful • When predicting the next frame in a video, using the Mean Squared Error (MSE) causes an averaging over many possible futures which causes the ear to disappear and blurring of the eyes • The adversarial version does a much better job preserving the ear and not blurring the eyes. Image credit: Ian Goodfellow, GANs Tutorial, NIPS 2016
  • 50.
    GANs Summary • GANsare generative models that use supervised learning to approximate an intractable cost function • GANs requires finding Nash equilibria in high dimensional, continuous, non-convex games. • GANs are crucial to many different state of the art image generation and manipulation systems.
  • 51.
    Part 4 Deep ReinforcementLearning (DRL) Deep Reinforcement Learning combines both Deep Learning and Reinforcement Learning by using Deep Learning techniques to learn values for the Q Function in Reinforcement Learning. This is described in Google Deep Mind’s Atari paper and exemplified by the AlphaGo program
  • 52.
    Deep Reinforcement Learning •Combines Reinforcement Learning with Deep Learning • A Form of model-free or unsupervised learning • Uses Neural Nets to estimate Q Values. • Very new field. No Wikipedia Page on this topic. • Idea is to 3feed states and actions into the network to predict Q values. • Neural networks are exceptionally good in coming up with good features for highly structured data. • This is the technology used by Google DeepMind’s AlphaGo program.
  • 53.
    Reinforcement Learning Revisited •Definitions – Policy π is a way of selecting an action given a state – Value function Qπ (s,a) is the expected total reward for performing action a from state s given policy π • Different Approaches – Policy Based RL • Search for the optimal policy in space of policies – Value-based RL • Estimate optimal value function Q*(s,a) – Model-based RL • Build a model of the environment and use look ahead
  • 54.
    The Many StatesProblem • In the Nature Deep Mind Atari paper: • Take four last screen images, resize them to 84×84 and convert then to gray scale with 256 gray levels. • This yields 25684×84×4≈1067970 possible game states. • This means 1067970 rows in our imaginary Q-table. • That is more than the number of atoms in the known universe!
  • 55.
  • 56.
    Deep Q-Learning Error& Gradient • Represent Q function using a deep network. • Error function • Gradient
  • 57.
    Strategies & Tricks •Experience Relay – During gameplay all the experiences <s,a,r,s′> are stored in a replay memory. – When training the network, random samples from the replay memory are used instead of the most recent transition. – This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. – Also experience replay makes the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm. – One could actually collect all those experiences from human gameplay and the train network on these. • Exploration-Exploitation – ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
  • 58.
  • 59.
  • 60.
    References (1) • NeuralNets & Deep Learning – http://neuralnetworksanddeeplearning.com/chap2.html – http://deeplearning.net/tutorial/deeplearning.pdf • Convolutional Neural Networks – http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf – http://neuralnetworksanddeeplearning.com/chap6.html – http://cs231n.github.io/convolutional-networks/ – Visualizing and Understanding Convolutional Networks – Convolutional Neural Networks backpropagation: from intuition to derivation – An Intuitive Explanation of Convolutional Neural Networks – Backpropagation in Convolutional Neural Networks • Recurrent Neural Nets – http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf – http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ – http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language- model-rnn-with-python-numpy-and-theano/ – http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through- time-and-vanishing-gradients/ – http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn- with-python-and-theano/
  • 61.
    References (2) • GenerativeAdversarial Networks – NIPS 2016 Tutorial: Generative Adversarial Networks • Deep Reinforcement Learning – http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/deep_rl.pdf – http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ • Differentiable Neural Computers – https://deepmind.com/blog/differentiable-neural-computers/ • Google DeepMind DRL Atari Paper – https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
  • 62.
    Questions • Goodfellow quoteon BP on Quora • Vanishing / exploding gradient