SlideShare a Scribd company logo
1 of 128
Presented by
Dr. Gnanasekar.A.K
Professor & Head
Department of Computer Science & Engineering
P T Lee Chengalvaraya Naicker College of Engineering & Technology, Kanchipuram
hodcse@ptleecncet.com
9677338977
 Neurons and the Brain
 Neural Networks
 Perceptions
 Multi-layer Networks
 Applications
 The Hopfield Network
 Multilayer Neural Networks
 Deep Convolutional Nets
 Recurrent Nets
 Deep Belief Networks
 Dropout
 A model of reasoning based on the human brain
 Complex networks of simple computing elements
 Capable of learning from examples
 with appropriate learning methods
 Collection of simple elements performs high-level
operations
 The human brain incorporates nearly 10 billion
neurons and 60 trillion connections between them.
 Our brain can be considered as a highly complex,
non-linear and parallel information-processing
system.
 Learning is a fundamental and essential
characteristic of biological neural networks.
[Russell & Norvig, 1995]
Year Neural Network Designer Description
1943 McCulloch and Pitts
Neuron
Mc Culloch Pitts Logic gates
1949 Hebb Hebb Strength increases
if neurons are
active
1958-1988 Perceptron Rosenblatt Weights of path
can be adjusted
1960 Adaline Widrow and Hoff Mean squared error
1972 SOM(self-organizing map) Kohonen Clustering
1982 Hopfield John Hopfield Associative
memory nets
1986 Back Propagation Rumelhard Multilayer
1987-90 ART(Adaptive Resonance
Theory)
Carpenter Used for both
binary and analog
[Russell & Norvig, 1995]
•Basic parts: Body(soma), Dendrites(inputs), Axons (outputs),
Synapses (connections)
[Russell & Norvig, 1995]
•Input summing function
•Activation function
•Weighted inputs
•Output
[Russell & Norvig, 1995]
[Russell & Norvig, 1995]
[Russell & Norvig, 1995]
 Recall:- processing phase for a NN and its
objective is to retrieve the information. The
process of computing o for a given x
 Basic forms of neural information
processing
 Auto association
 Hetero association
 Classification
 Set of patterns can be
stored in the network
 If a pattern similar to
a member of the
stored set is
presented, an
association with the
input of closest stored
pattern is made
 Associations between
pairs of patterns are
stored
 Distorted input
pattern may cause
correct
heteroassociation at
the output
 Set of input patterns
is divided into a
number of classes or
categories
 In response to an
input pattern from
the set, the classifier
is supposed to recall
the information
regarding class
membership of the
input pattern.
 Weights
 Bias
 Threshold
 Learning rate
 Momentum factor
 Vigilance parameter
 Notations used in ANN
 Each neuron is connected to every other
neuron by means of directed links
 Links are associated with weights
 Weights contain information about the input
signal and is represented as a matrix
 Weight matrix also called connection matrix
W=
1
2
3
.
.
.
.
.
T
T
T
T
n
w
w
w
w
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11 12 13 1
21 22 23 2
1 2 3
...
...
..................
...................
...
m
m
n n n nm
w w w w
w w w w
w w w w
 
 
 
 
 
 
 
 
 
=
 wij –is the weight from processing element ”i”
(source node) to processing element “j” (destination node)
X1
1
Xi
Yj
Xn
w
1j
wij
wnj
bj
0
0 0 1 1 2 2
0
1
1
....
n
i ij
inj
i
j j j n nj
n
j i ij
i
n
j i ij
inj
i
y xw
x w xw x w x w
w xw
y b xw




    
 
 



 Used to calculate the output response of
a neuron.
 Sum of the weighted input signal is
applied with an activation to obtain the
response.
 Activation functions can be linear or non
linear
 Already dealt
 Identity function
 Single/binary step function
 Discrete/continuous sigmoidal function.
 Bias is like another weight. Its included by
adding a component x0=1 to the input vector
X.
 X=(1,X1,X2…Xi,…Xn)
 Bias is of two types
 Positive bias: increase the net input
 Negative bias: decrease the net input
 The relationship between input and output given by
the equation of straight line y=mx+c
X Y
Input
C(bias)
y=mx+C
 Set value based upon which the final output of
the network may be calculated
 Used in activation function
 The activation function using threshold can be
defined as












ifnet
ifnet
net
f
1
1
)
(
 Denoted by α.
 Used to control the amount of weight
adjustment at each step of training
 Learning rate ranging from 0 to 1 determines
the rate of learning in each time step
oSupervised:
oAdaline, Madaline
oPerceptron
oBack Propagation
omultilayer perceptrons
oRadial Basis Function Networks
oUnsupervised
oCompetitive Learning
oKohonen self organizing map
oLearning vector quantization
oHebbian learning
No one‐fits‐all formula
Over fitting can occur if a “good” training set is
not chosen
What constitutes a “good” training set?
 Samples must represent the general
population.
Samples must contain members of each class.
Samples in each class must contain a wide
range of variations or noise effect.
The size of the training set is related to the
number of hidden neurons
Supervised learning
Unsupervised learning
Reinforcement learning
This is what we have seen so far!
A network is fed with a set of training
samples (inputs and corresponding
output), and it uses these samples to
learn the general relationship between
the inputs and the outputs.
This relationship is represented by the
values of the weights of the trained
network.
No desired output is associated with
the training data!
Faster than supervised learning
Used to find out structures within data:
 Clustering
 Compression
Like supervised learning, but:
Weights adjusting is not directly related to
the error value.
The error value is used to randomly, shuffle
weights!
Relatively slow learning due to
‘randomness’.
Function approximation
including time series prediction and modelling.
Classification
including patterns and sequences recognition,
novelty detection and sequential decision making.
(radar systems, face identification, handwritten text
recognition)
Data processing
including filtering, clustering blinds source
separation and compression.
(data mining, e-mail Spam filtering)
Advantages
Adapt to unknown situations
Powerful, it can model complex functions.
Ease of use, learns by example, and very
little user domain‐specific expertise needed
Disadvantages
Forgets
Not exact
Large complexity of the network structure
[Russell & Norvig, 1995]
This vastly simplified model of real neurons is also
known as a Threshold Logic Unit
A set of connections brings in activations from other
neurons.
A processing unit sums the inputs, and then applies a non-
linear activation function (i.e. squashing/ Transfer/
threshold function).
An output line transmits the result to other neurons
y = F(w1x1+ w2x2 - b)
[Russell & Norvig, 1995]
Binary
perceptrons
Continuous
perceptrons
1
2
Y
1
2
Y
1st Neural Network: AND function
Threshold(Y) = 2
X1
Y
X2
1
1
1st Neural Network: OR function
X1
Y
X2
2
2
Threshold(Y) = 2
1
2
Y
1
Y
2
1st Neural Network: AND not function
Threshold(Y) = 2
X1
Y
X2
2
-1
1st Neural Network: XOR function
X1
Y
X2
2
-1
Z2
Z1
2
-1
2
2
Threshold(Y) = 2
 Weighted inputs are summed up by the input function
 The (nonlinear) activation function calculates the activation value, which
determines the output
[Russell & Norvig, 1995]
 Stept(x) = 1 if x >= t, else 0
 Sign(x) = +1 if x >= 0, else –1
 Sigmoid(x)= 1/(1+e-x)
[Russell & Norvig, 1995]
 simple neurons can act as logic gates
 appropriate choice of activation function, threshold, and
weights
 step function as activation function
[Russell & Norvig, 1995]
layered structures
 networks are arranged into layers
 interconnections mostly between two layers
 some networks may have feedback
connections
 single layer, feed-
forward network
 historically one of
the first types of
neural networks
 late 1950s
 the output is
calculated as a step
function applied to
the weighted sum of
inputs
 capable of learning
simple functions
 linearly separable
[Russell & Norvig, 1995]
[Russell & Norvig, 1995]
 perceptrons can deal with linearly separable functions
 some simple functions are not linearly separable
 XOR function
0,0
0,1
1,0
1,1
0,0
0,1
1,0
1,1
AND XOR
 linear separability can be extended to more than two dimensions
 more difficult to visualize
[Russell & Norvig, 1995]
 This is done by making small adjustments in the weights
 to reduce the difference between the actual and desired
outputs of the Perceptron.
 The initial weights are randomly assigned
 usually in the range [0.5, 0.5], or [0, 1]
 Then the they are updated to obtain the output
consistent with the training examples.
 perceptrons can learn from examples through a simple
learning rule. For each example row (iteration), do the
following:
 calculate the error of a unit Erri as the difference between
the correct output Ti and the calculated output Oi
Erri = Ti - Oi
 adjust the weight Wj of the input Ij such that the error
decreases
Wij = Wij +  *Iij * Errij
  is the learning rate, a positive constant less than unity.
 this is a gradient descent search through the weight space
 The network consists of an input layer of source
neurons, at least one middle or hidden layer of
computational neurons, and an output layer of
computational neurons.
 The input signals are propagated in a forward
direction on a layer-by-layer basis
 feedforward neural network
 the back-propagation learning algorithm can be
used for learning in multi-layer networks
 two-layer network
 input units Ik
 usually not counted as a
separate layer
 hidden units aj
 output units Oi
 usually all nodes of one
layer have weighted
connections to all nodes
of the next layer
Ik
aj
Oi
Wji
Wkj
 Learning in a multilayer network proceeds the
same way as for a perceptron.
 A training set of input patterns is presented to
the network.
 The network computes its output pattern, and if
there is an error  or in other words a difference
between actual and desired output patterns 
the weights are adjusted to reduce this error.
 proceeds from the output layer to the hidden layer(s)
 updates the weights of the units leading to the layer
 In a back-propagation neural network, the learning
algorithm has two phases.
 First, a training input pattern is presented to the
network input layer. The network propagates the input
pattern from layer to layer until the output pattern is
generated by the output layer.
 If this pattern is different from the desired output, an
error is calculated and then propagated backwards
through the network from the output layer to the input
layer. The weights are modified as the error is
propagated.
Three-layer network for solving the
Exclusive-OR Operation
Final results of three-layer network learning
Inputs
x1 x2
1
0
1
0
1
1
0
0
0
1
1
Desired
output
yd
0
0.0155
Actual
output
y5
Y
Error
e
Sum of
squared
errors
e
0.9849
0.9849
0.0175
0.0155
0.0151
0.0151
0.0175
0.0010
Network for solving the Exclusive-OR operation
(a) Decision boundary constructed by hidden neuron 3;
(b) Decision boundary constructed by hidden neuron 4;
(c) Decision boundaries constructed by the complete
three-layer network
x1
x2
1
(a)
1
x2
1
1
(b)
0
0
x1 + x2 – 1.5 = 0 x1 + x2 – 0.5 = 0
x1 x1
x2
1
1
(c)
0
Decision boundaries
 expressiveness
 weaker than predicate logic
 good for continuous inputs and outputs
 computational efficiency
 training time can be exponential in the number of inputs
 depends critically on parameters like the learning rate
 local minima are problematic
 can be overcome by simulated annealing, at additional cost
 generalization
 works reasonably well for some functions (classes of
problems)
 no formal characterization of these functions
 sensitivity to noise
 very tolerant
 they perform nonlinear regression
 transparency
 neural networks are essentially black boxes
 there is no explanation or trace for a particular answer
 tools for the analysis of networks are very limited
 some limited methods to extract rules from networks
 prior knowledge
 very difficult to integrate since the internal representation of
the networks is not easily accessible
 domains and tasks where neural networks
are successfully used
 recognition
 control problems
 series prediction
 weather, financial forecasting
 categorization
 sorting of items (fruit, characters, …)
 Neural networks were designed on analogy with
the brain.
 The brain’s memory, however, works by
association.
 For example, we can recognise a familiar face even in an
unfamiliar environment within 100-200 ms.
 We can also recall a complete sensory experience,
including sounds and scenes, when we hear only a few bars
of music.
 The brain routinely associates one thing with
another.
The Hopfield Network
 Multilayer neural networks trained with the back-
propagation algorithm are used for pattern
recognition problems.
 However, to emulate the human memory’s
associative characteristics we need a different type
of network: a recurrent neural network.
 A recurrent neural network has feedback loops
from its outputs to its inputs.
Single-layer n-neuron Hopfield network
xi
x1
x2
xn
I
n
p
u
t
S
i
g
n
a
l
s
yi
y1
y2
yn
1
2
i
n
O
u
t
p
u
t
S
i
g
n
a
l
s
 The stability of recurrent networks was solved only in
1982, when John Hopfield formulated the physical
principle of storing information in a dynamically
stable network.
[Russell & Norvig, 1995]
o Objective function, averaged over all
training examples is a hilly landscape in
high-dimensional space of weight values
o Negative gradient vector indicates
direction of steepest descent taking it
closer to a minimum
• Goal is to learn the weights w from a
labelled set of training samples
• Learning procedure has two stages
1. Evaluate derivatives of error function
∇E(w) with respect to weights w1,..wT
2. Use derivatives to compute adjustments to
weights
w(1)
 w( )
E(w( )
)
• Gradient descent
• Update the weights using
• Where the gradient vector ∇ E (w(τ)
) consists of the vector of
derivatives evaluated using back-propagation
w(τ + 1)
= w(τ )
− η∇ E (w(τ )
)
∇ E (w) =
d
dw
∂E
11
∂w(1)
.
MD
(1)
∂E
11
∂w(2)
.
K M
⎢
⎢
⎢
⎢
⎢
⎡ ⎤
⎢
⎢
E (w) = ⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎢
∂E
⎥
⎥
⎥
⎥
⎥
⎥ There are W= M(D+1)+K(M+1) elements
in the vector
⎥
⎥
⎢ ∂w ⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎢ ∂E ⎥
⎥
⎢ ∂w(2) ⎥
⎥
⎦
is a W x 1 vector
(τ )
Gradient ∇ E (w )
Most practitioners use SGD for DL
Consists of showing the input vector for a few examples,
computing the outputs and the errors,
Computing the average gradient for those examples,
and adjusting the weights accordingly.
Process repeated for many small sets of examples
from the training set until the average of the objective function stops decreasing.
Called stochastic because each small set of examples
gives a noisy estimate of the average gradient over all examples.
Usually finds a good set of weights quickly compared to elaborate optimization techniques.
After training, performance is measured
on a different test set:
tests generalization ability of the machine
— its ability to produce sensible answers on new inputs
never seen during training.
Flucutations in objective
as gradient steps are taken
in mini batches
4. Use
• Backpropagation Formula
• Value of  for a particular
hidden unit can be obtained
by propagating the  ’s
backward from units higher-
up in the network
j  h'(aj )wkjk
k
1. Apply input vector xn to network and
forward propagate through network using
i
aj  w jizi and zj=h(aj)
k
2. Evaluate k for all output units using
k=yk-tk
3. Backpropagate the ’s using
j  h'(aj )wkjk
to obtain j for each hidden unit
En
w ji
to evaluate required derivatives
  z
j i
Unit
j
Unit
k
• Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak
• Hidden units have logistic sigmoid
activation function
h(a)=tanh (a)
where
simple form for derivative
tanh(a) 
ea
 ea
ea
 ea
h'(a) 1 h(a)2
Standard Sum of Squared Error
yk: activation of output unit k tk
: corresponding target
for input xk
1
2 k1
En  yk  tk 
K 2
• Forward Propagation
• Backward Propagation (s for hidden
units)
• Derivatives wrt first layer and second layer
weights
• Batch method
j
(1)

a  w x
ji i
k
(2)
j 0
i 0
z j  tanh(aj )
M

y  w z
kj j
• Output differences
k  yk  tk
Simple Example: Forward and Backward Prop
For each input in training set:
D
j j
2
  (1 z ) w 
kj k
k1
K

En En
w(1)
ji w(2)
kj
  x   z
j i k j
j  h'(aj )wkjk
k
h'(a) 1 h(a)2
∂E
= ∑
∂En
∂wji
∂wji
n
2 Hidden Layers, 1 Output Layer
Each layer is a module through which one can back-propagate
gradients.
At each layer, we
first compute the
total input z to each
unit,
which is a weighted sum of
the outputs
of the units in the layer
below.
Then a non-linear function f(.)
is applied to z to get the
output of the unit.
A small change Δxin x gets transformed
first into a small change Δ
y in y by getting
multiplied
by ∂y/∂x (that is, the definition of
partial derivative). Similarly, the
change Δ
y creates a change Δ
z in z.
Substituting one equation into the other
gives the chain rule of derivatives —
how Δxgets turned into Δzthrough
multiplication by the product of ∂y/∂x
and ∂z/∂x.
It also works when x, y and z are vectors
(and the derivatives are Jacobian
matrices).
Tells us how two small effects (that of a small change of x on y, and that of y on z)
are composed.
Two hidden layers Layers
are labelled i,j,k,l
At each hidden layer we compute
the error derivative wrt the output of each unit,
which is a weighted sum of the error derivatives
wrt the total inputs to the units in the layer above.
Convert error derivative wrt the output into the error
derivative wrt the input
by multiplying it by the gradient of f(z).
At the output layer, the error derivative wrt
the output of a unit is computed
by differentiating the cost function.
This gives yl − tl if the cost function for unit l is
l
0.5(yl − t )2,
where tl is the target value.
Once the ∂E/∂zk is known,
the error-derivative for the weight wjk
on the connection from
unit j in the layer below is just yj ∂E/∂zk.
• Most popular today is Rectified Linear Unit(ReLU)
• a half-wave rectifier
f(z)=max(z,0)
• In past decades, neural nets used smoother
non- linearities
tanh(z) or 1/(1+exp(-z))
• ReLU learns faster in networks with many layers
• Allowing training of deep supervised network
without unsupervised pre-training
• Rectified linear unit (ReLU)
f(z) = max(0,z)
commonly used in recent years,
• More conventional sigmoids:
• hyberbolic tangent,
f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and
• logistic function,
f(z) = 1/(1 + exp(−z))
with 2 Inputs, 2 HiddenUnits, 1 Output unit
• By distorting the input space
• Note that grid is also distorted
x1
x2
x1 y1
x2 y2
• Input Data (red and blue curves) are not linearly separable
• Network makes them linearly separable
z
• Linear classifiers can only carve the input space
into very simple regions
• Image and speech recognition require input-output
function to be insensitive to irrelevant variations of
the input,
• e.g., position, orientation and illumination of an object
• Variations in pitch or accent of speech
• While being sensitive to minute variations, e.g., white wolf
and breed of wolf-like white dog called Samoyed
• At pixel level two Samoyeds in different positions may be
very different, whereas a Samoyed and a wolf in the
same position and background may be very similar
• Shallow classifiers need a good feature extractor
• One that produces representations that are:
• selective to aspects of image important for
discrimination
• but invariant to irrelevant aspects such as pose of the
animal
• Generic features (e.g., Gaussian kernel)
do not generalize well far from training
examples
• Hand-designing good feature extractors
requires engineering skill and domain
expertise
• Deep learning learns features automatically
• Multilayer stack of simple modules
• All modules (or most) subject to :
• Learning
• Non-linear input-output mappings
• Each module transforms input to improve both
selectivity and invariance of the
representation
• With depth of 5 to 20 layers can implement
extremely intricate functions of input
• Sensitive to minute details
• Distinguish Samoyeds from white wolves
• Insensitive ti irrelevant variations
• Background, pose, lighting, surrounding objects
• It was thought that simple gradient descent would get
trapped in local minima
• Rarely a problem with large networks
• Regardless of initial conditions, system always reaches
solutions of similar quality
• Landscape is packed with a combinatorially large number of
saddle points where gradient is zero
• Almost all have similar values of objective function
w(τ+1)
= w(τ )
− η∇ E (w(τ )
)
Srihari
• Unsupervised learning to create layers of
feature detectors
• No need for labelled data
• Objective of learning in each layer of
feature detectors:
• Ability to reconstruct or model activities of feature detectors
(or raw inputs) in layer below
• By pre-training weights of a deep network could be initialized
to sensible values
• Final layer of output units added at top of
network and whole deep system fine-tuned using
back- propagation
• Worked well in handwritten digit recognition
• When data was limited
• First major application was in speech recognition
• Made possible by advent of fast GPUs
• Allowed networks to be trained 10 or 20 times faster
• Record-breaking results on speech reco
benchmark
• It turned out that pre-training stage was
only needed for small data sets
• Convolutional neural networks
• Type of deep feedforward network
• Much easier to train and generalized
much better than networks with full
connectivity between adjacent layers
• Designed to process data that come in the form
of multiple arrays
• E.g., a color image composed of three 2D arrays of pixel
intensities in three color channels
• Many data modalities are in the form of
multiple arrays:
• 1D for signals and sequences, including language
• 2D for images and audio spectrograms
• 3D for video or volumetric images
• Take advantage of properties of natural signals
1. Local connections
2. Shared weights
3. Pooling
4. Use of many layers
• Need substantial number of training samples
• Slow learning (convergence times)
• Inadequate parameter selection techniques that
lead to poor minima
• Network should exhibit invariance to
translation, scaling and elastic
deformations
• A large training set can take care of this
• It ignores a key property of images
• Nearby pixels are more strongly correlated than distant
ones
• Modern computer vision approaches exploit this
property
• Information can be merged at later stages
to get higher order features and about
whole image
• Classic notions of simple cells and complex cells
• Architecture similar to LGN-V1-V2-V4-IT hierarchy in
visual cortex ventral pathway
• LGN: lateral geniculate nucleus receives input from retina
• 30 different areas of visual cortex: V1 and V2 are principal
• Infero-Temporal cortex performs object recognition
1. Local Receptive Fields
2. Subsampling
3. Weight Sharing
• Instead of treating input to a fully
connected network
• Two layers of Neural networks are
used
1. Layer of convolutional units
• which consider
overlapping regions
2. Layer of subsampling units
• Several feature maps and sub-
sampling
• Gradual reduction of spatial resolution
compensated by increasing no. of
features
• Final layer has softmax output
• Whole network trained
using backpropagation
Each pixel
patch is 5 x 5
This plane
has 10 x
10=100
neural network units
(called a feature
map). Weights are
same for different
planes.
So only 25
weights are
needed.
Due to weight
sharing this is
equivalent
to convolution.
Different features
have different
feature maps
10 x
10
units
5 x 5
pixel
s
2 x 2
units
Input
imag
e
5 x 5
units
• Reducing the complexity of a network
• Encouraging groups of weights to have similar
values
• Only applicable when form of the network
can be specified in advance
• Division of weights into groups, mean weight
value for each group and spread of values are
determined during the learning process
Outputs (not filters) of each layer (horizontally).
Each rectangular image is a feature map corrresponding to output for one of the learned features,
detected at each of the image positions.
Lower-level features act as oriented edge detectors
• Structured as a series of stages
• First few stages are composed of two types of
layers and a non-linearity:
1. Convolutional layer
• To detect local conjunctions of features from previous
layer
2. Non-linearity
• ReLU
3. Pooling layer
• To merge semantically similar features into one
• Organized in feature maps
• Each unit connected to local patches in
feature maps of previous layer through
weights (called a filter bank)
• Result is passed through a ReLU
• Computes maximum of a local patch of units
in one feature map
• Neighboring pooling units take input from
patches that are shifted by more than one
row or column
• Thereby reducing the dimension of the
representation
• Creating invariance to small shifts and distortions
• As simple as through a regular
deep network
• Allow all weights in all filter
banks to be trained
• Neocognitron
• Similar architecture
• Did not have end-to-end supervised
learning using Backprop
• ConvNet with probabilistic model
was used for OCR and handwritten
check reading
• Applied with great success to images: detection,
segmentation and recognition of objects and regions
• Tasks where abundant data was available
• Traffic sign recognition
• Segmentation of biological images
• Connectomics
• Detection of faces, text, pedestrians, human bodies in natural
images
• Major recent practical success is face recognition
• Images can be labeled at pixel level
• Applications in autonomous mobile robots, and self-driving cars
• Other applications gaining importance
• Natural language understanding
• Speech recognition
• Data set of 1 million images from the web
• Contained 1,000 different classes
• Error rate
• Halved the error rate of best competing computer vision approaches
• Success came from:
• Efficient use of GPUs
• ReLUs
• New regularization technique called dropout
• Techniques to generate new training samples by deforming existing ones
• Combining ConvNets and Recurrent Net Modules
• Caption generated by a recurrent neural network (RNN)
taking as input:
1. Representation generated by a deep CNN
2. RNN trained to translate high-level representations of
images into captions
• Different focus (lighter patches given more
attention)
• As it generates each word (bold), it exploits it
to achieve better translation of images to
captions
1. 10 to 20 layers of ReLUs
2. Hundreds of millions of weights
3. Billions of connections between units
4. Training time
• Would have taken weeks a couple of years ago
• Advances in hardware, software and parallelization
reduces it to a few hours
• ConvNets are easily amenable to efficient
hardware implementations
• NVIDIA, Mobileye, Intel, Qualcomm and Samsung are
developing ConvNet chips for smartphones, cameras,
robots and self-driving cars
• Deep nets have two different exponential advantages
over classic learning algorithms. Both advantages
arise from
• power of composition
• Depend on underlying data-generating distribution having an
appropriate compositional structure
1. Learning distributed representations enable
generalization to new combinations of the values of
learned features beyond those seen during training
1. E.g., 2n combinations are possible with n binary features
2. Composing layers of representations brings another
advantage: exponential in depth
• Predicting the next word from local context of earlier
words
• Each word presented as a 1-of-N vector
• In the first layer each word creates a different word vector
• In the language model, other next layers learn to convert
input word vector to output word vector for the predicted word
Words Phrases
5
0
Word
representation for
modeling
language,
non-linearly
projected to 2-D
using t-SNE
algorithm.
Semantically
similar words are
mapped nearby.
2-D representation of
phrases learnt by
English-French
encoder-decoder
learnt by RNN
Learnt using backpropagation that jointly
learns representation for each word and
function that predicts a target quantity (next
word or sequence of words for translation)
• Logic-inspired paradigm uses
• Instance of a symbol is something for which the
only property is that is either identical or non-
identical to other symbol instances
• It has no internal structure relevant to its use
• To reason with symbols they must be bound to
variables in judiciously chosen rules of inference
• Neural networks use
• big activity vectors, big weight matrices and
scalar non- linearities
• to perform fast intuitive inference that
underpins commonsense reasoning
 Standard statistical models count frequencies
of short symbol sequences of length upto N
 No of possible sequences is VN, where
V is vocabulary size
 So taking context of more than a handful
of words would require very large corpora
 N-grams treat each word as an atomic unit,
so they cannot generalize across
semantically related sequences
 Neural models can because they associate
each word with a vector of real-valued features
• Exciting early use of backpropagation was for training RNNs
• For tasks involving sequential inputs
• e.g., speech and language, it is better to use RNNs
• RNNs process input sequence one element at a time
• maintain in their hidden units a state vector that implicitly contains history
of past elements in the sequence
• Same parameters (matrices U, V, W) are used at each time step
Backpropagation algorithm is applied to the unfolded graph of computational
network To compute derivative of total error (log-probability of generating
right sequence) wrt to states si and all the parameters
• DBNs are Generative Models
• Provide estimates of both p(x|Ck) and p(Ck|x)
• Conventional neural networks are
discriminative
• Directly estimate p(Ck|x)
• Consist of several layers of Restricted
Boltzmann Machines (RBM)
• RBM
• A form of Markov Random Field
• Named after Boltzmann Distribution (Or Gibbs
Distribution)
• Gives probability that a system will be in a certain state given its
temperature
• Where E is energy and kT (constant of the distribution) is a product of
Boltzmann’s constant and thermodynamic temperature
• Energy of Boltzmann network
• Training
• Generative stochastic ANN
• What is dropout?
• Dropout as an ensemble method
• Mask for dropout training
• Bagging vs Dropout
• Prediction intractability
• Deep nets have many non-linear hidden layers
– Making them very expressive to learn complicated
relationships between inputs and outputs
– But with limited training data, many complicated
relationships will be the result of training noise
• So they will exist in the training set and not in test
set even if drawn from same distribution
• Many methods developed to reduce overfitting
– Early stopping with a validation set
– Weight penalties (L1 and L2 regularization)
– Soft weight sharing
• Best way to regularize a fixed size model is:
– Average the predictions of all possible settings of
the parameters
– Weighting each setting with the posterior probability
given the training data
• This would be the Bayesian approach
• Dropout does this using considerably less
computation
– By approximating an equally weighted geometric
mean of the predictions of an exponential number
of learned models that share parameters
• Bagging is a method of averaging over several
models to improve generalization
• Impractical to train many neural networks since
it is expensive in time and memory
– Dropout makes it practical to apply bagging to very
many large neural networks
• It is a method of bagging applied to neural networks
• Dropout is an inexpensive but powerful method
of regularizing a broad family of models
• Dropout trains an ensemble of all subnetworks
– Subnetworks formed by removing non-output units
from an underlying base network
• We can effectively remove units by multiplying
its output value by zero
– For networks based on performing a series of affine
transformations or on-linearities
– Needs some modification for radial basis functions
based on difference between unit state and a
reference value
• A simple way to prevent neural net overfitting
(a) A standard neural net with
two hidden layers
(b) A thinned net produced by
applying dropout, crossed units
have been dropped
Drop hidden and
visible units from net,
i.e., temporarily remove
it from the network with
all input/output connections.
Choice of units to drop is
random, determined by a
probability p, chosen by a
validation set, or equal to 0.5
Deep net in Keras
Validate on CIFAR-10 dataset
Network built had three convolution layers of
size 64, 128 and 256 followed by two densely
connected layers of size 512 and an output layer
dense layer of size 10
Accuracy vs dropout Loss vs dropout
• In bagging we define k different models,
construct k different data sets by sampling from
the dataset with replacement, and train model i
on dataset i
• Dropout aims to approximate this process, but
with an exponentially large no. of neural
networks
• Remove non-output units
from base network.
• Remaining 4 units
yield 16 networks
• Here many networks have
no path from input to output
• Problem insignificant with
large networks
• To train with dropout we use minibatch
based learning algorithm that takes
small steps such as SGD
• At each step randomly sample a binary
mask
– Probability of including a unit is a
hyperparameter
• 0.5 for hidden units and 0.8 for input units
• We run forward & backward
propagation as usual
Feed-
forward
network
• Network with binary vector μ
whose elements correspond to
input and hidden units
• Elements of μ
• With probability
of 1 being a
hyperparameter
• 0.5 for hidden
• 0.8 for input
• Each unit is
• Multiplied by
corresponding mask
• Forward prop as usual
• Equivalent to randomly selecting one of the
subnetworks of previous slide
• Suppose that mask vector μ specifies which
units to include
• Cost of the model is specified by J(θ,μ)
• Drop training consists of minimizing Eμ(J(θ,μ))
• Expected value contains exponential no. of
terms
• We can get an unbiased estimate of its gradient
by sampling values of μ
• Dropout training not same as bagging training
– In bagging, the models are all independent
– In dropout, models share parameters
• Models inherit subsets of parameters from parent
network
• Parameter sharing allows an exponential no. of
models with a tractable amount of memory
• In bagging each model is trained to
convergence on its respective training set
– In dropout, most models are not explicitly trained
• Fraction of sub-networks are trained for a single step
• Parameter sharing allows good parameter settings
• Bagging:
– Ensemble accumulates votes of members
– Process is referred to as inference
• Assume model needs to output a probability distribution
• In bagging, model i produces p(i)(y|x)
• Prediction of ensemble is the mean
• Dropout:
– Submodel defined by mask vector μ defines a
probability distribution p(y|x,μ)
– Arithmetic mean over all masks is
• Where p(μ) is the distribution used to sample μ at
training time
k i=1
k
1
p(i)
(y |x )

p(y |x,)
• Dropout prediction is
 It is intractable to evaluate due to
an exponential no. of terms
 We can approximate inference using sampling
 By averaging together the output from many
masks
 10-20 masks are sufficient for good performance
 Even better approach, at the cost of a single
forward propagation:
 use geometric mean rather than arithmetic mean
of the ensemble member’s predicted distributions 18

p(y |x,)
THANK YOU!

More Related Content

Similar to ACUMENS ON NEURAL NET AKG 20 7 23.pptx

ANNs have been widely used in various domains for: Pattern recognition Funct...
ANNs have been widely used in various domains for: Pattern recognition  Funct...ANNs have been widely used in various domains for: Pattern recognition  Funct...
ANNs have been widely used in various domains for: Pattern recognition Funct...
vijaym148
 
Neuralnetwork 101222074552-phpapp02
Neuralnetwork 101222074552-phpapp02Neuralnetwork 101222074552-phpapp02
Neuralnetwork 101222074552-phpapp02
Deepu Gupta
 

Similar to ACUMENS ON NEURAL NET AKG 20 7 23.pptx (20)

NEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdfNEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
MNN
MNNMNN
MNN
 
Neural-Networks.ppt
Neural-Networks.pptNeural-Networks.ppt
Neural-Networks.ppt
 
071bct537 lab4
071bct537 lab4071bct537 lab4
071bct537 lab4
 
Artificial neural networks (2)
Artificial neural networks (2)Artificial neural networks (2)
Artificial neural networks (2)
 
ANNs have been widely used in various domains for: Pattern recognition Funct...
ANNs have been widely used in various domains for: Pattern recognition  Funct...ANNs have been widely used in various domains for: Pattern recognition  Funct...
ANNs have been widely used in various domains for: Pattern recognition Funct...
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
6
66
6
 
Ffnn
FfnnFfnn
Ffnn
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
Artificial Neural Networks ppt.pptx for final sem cse
Artificial Neural Networks  ppt.pptx for final sem cseArtificial Neural Networks  ppt.pptx for final sem cse
Artificial Neural Networks ppt.pptx for final sem cse
 
8_Neural Networks in artificial intelligence.ppt
8_Neural Networks in artificial intelligence.ppt8_Neural Networks in artificial intelligence.ppt
8_Neural Networks in artificial intelligence.ppt
 
Neuralnetwork 101222074552-phpapp02
Neuralnetwork 101222074552-phpapp02Neuralnetwork 101222074552-phpapp02
Neuralnetwork 101222074552-phpapp02
 
10-Perceptron.pdf
10-Perceptron.pdf10-Perceptron.pdf
10-Perceptron.pdf
 
Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...
 
Multi Layer Network
Multi Layer NetworkMulti Layer Network
Multi Layer Network
 
Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics Course
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Anfis (1)
Anfis (1)Anfis (1)
Anfis (1)
 

More from gnans Kgnanshek

More from gnans Kgnanshek (19)

MICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptx
MICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptxMICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptx
MICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptx
 
EDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptx
EDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptxEDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptx
EDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptx
 
types of research.pptx
types of research.pptxtypes of research.pptx
types of research.pptx
 
CS8601 4 MC NOTES.pdf
CS8601 4 MC NOTES.pdfCS8601 4 MC NOTES.pdf
CS8601 4 MC NOTES.pdf
 
CS8601 3 MC NOTES.pdf
CS8601 3 MC NOTES.pdfCS8601 3 MC NOTES.pdf
CS8601 3 MC NOTES.pdf
 
CS8601 2 MC NOTES.pdf
CS8601 2 MC NOTES.pdfCS8601 2 MC NOTES.pdf
CS8601 2 MC NOTES.pdf
 
CS8601 1 MC NOTES.pdf
CS8601 1 MC NOTES.pdfCS8601 1 MC NOTES.pdf
CS8601 1 MC NOTES.pdf
 
Lecture_3_Gradient_Descent.pptx
Lecture_3_Gradient_Descent.pptxLecture_3_Gradient_Descent.pptx
Lecture_3_Gradient_Descent.pptx
 
Batch_Normalization.pptx
Batch_Normalization.pptxBatch_Normalization.pptx
Batch_Normalization.pptx
 
33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf
 
11_NeuralNets.pdf
11_NeuralNets.pdf11_NeuralNets.pdf
11_NeuralNets.pdf
 
NN-Ch7.PDF
NN-Ch7.PDFNN-Ch7.PDF
NN-Ch7.PDF
 
NN-Ch6.PDF
NN-Ch6.PDFNN-Ch6.PDF
NN-Ch6.PDF
 
NN-Ch5.PDF
NN-Ch5.PDFNN-Ch5.PDF
NN-Ch5.PDF
 
NN-Ch3.PDF
NN-Ch3.PDFNN-Ch3.PDF
NN-Ch3.PDF
 
NN-Ch2.PDF
NN-Ch2.PDFNN-Ch2.PDF
NN-Ch2.PDF
 
unit-1 MANAGEMENT AND ORGANIZATIONS.pptx
unit-1 MANAGEMENT AND ORGANIZATIONS.pptxunit-1 MANAGEMENT AND ORGANIZATIONS.pptx
unit-1 MANAGEMENT AND ORGANIZATIONS.pptx
 
POM all 5 Units.pptx
POM all 5 Units.pptxPOM all 5 Units.pptx
POM all 5 Units.pptx
 
3 AGRI WORK final review slides.pptx
3 AGRI WORK final review slides.pptx3 AGRI WORK final review slides.pptx
3 AGRI WORK final review slides.pptx
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 

Recently uploaded (20)

What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptx
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food Additives
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 

ACUMENS ON NEURAL NET AKG 20 7 23.pptx

  • 1. Presented by Dr. Gnanasekar.A.K Professor & Head Department of Computer Science & Engineering P T Lee Chengalvaraya Naicker College of Engineering & Technology, Kanchipuram hodcse@ptleecncet.com 9677338977
  • 2.  Neurons and the Brain  Neural Networks  Perceptions  Multi-layer Networks  Applications  The Hopfield Network  Multilayer Neural Networks  Deep Convolutional Nets  Recurrent Nets  Deep Belief Networks  Dropout
  • 3.  A model of reasoning based on the human brain  Complex networks of simple computing elements  Capable of learning from examples  with appropriate learning methods  Collection of simple elements performs high-level operations
  • 4.  The human brain incorporates nearly 10 billion neurons and 60 trillion connections between them.  Our brain can be considered as a highly complex, non-linear and parallel information-processing system.  Learning is a fundamental and essential characteristic of biological neural networks.
  • 5. [Russell & Norvig, 1995] Year Neural Network Designer Description 1943 McCulloch and Pitts Neuron Mc Culloch Pitts Logic gates 1949 Hebb Hebb Strength increases if neurons are active 1958-1988 Perceptron Rosenblatt Weights of path can be adjusted 1960 Adaline Widrow and Hoff Mean squared error 1972 SOM(self-organizing map) Kohonen Clustering 1982 Hopfield John Hopfield Associative memory nets 1986 Back Propagation Rumelhard Multilayer 1987-90 ART(Adaptive Resonance Theory) Carpenter Used for both binary and analog
  • 6. [Russell & Norvig, 1995] •Basic parts: Body(soma), Dendrites(inputs), Axons (outputs), Synapses (connections)
  • 7. [Russell & Norvig, 1995] •Input summing function •Activation function •Weighted inputs •Output
  • 11.  Recall:- processing phase for a NN and its objective is to retrieve the information. The process of computing o for a given x  Basic forms of neural information processing  Auto association  Hetero association  Classification
  • 12.  Set of patterns can be stored in the network  If a pattern similar to a member of the stored set is presented, an association with the input of closest stored pattern is made
  • 13.  Associations between pairs of patterns are stored  Distorted input pattern may cause correct heteroassociation at the output
  • 14.  Set of input patterns is divided into a number of classes or categories  In response to an input pattern from the set, the classifier is supposed to recall the information regarding class membership of the input pattern.
  • 15.  Weights  Bias  Threshold  Learning rate  Momentum factor  Vigilance parameter  Notations used in ANN
  • 16.  Each neuron is connected to every other neuron by means of directed links  Links are associated with weights  Weights contain information about the input signal and is represented as a matrix  Weight matrix also called connection matrix
  • 17. W= 1 2 3 . . . . . T T T T n w w w w                                   11 12 13 1 21 22 23 2 1 2 3 ... ... .................. ................... ... m m n n n nm w w w w w w w w w w w w                   =
  • 18.  wij –is the weight from processing element ”i” (source node) to processing element “j” (destination node) X1 1 Xi Yj Xn w 1j wij wnj bj 0 0 0 1 1 2 2 0 1 1 .... n i ij inj i j j j n nj n j i ij i n j i ij inj i y xw x w xw x w x w w xw y b xw                
  • 19.  Used to calculate the output response of a neuron.  Sum of the weighted input signal is applied with an activation to obtain the response.  Activation functions can be linear or non linear  Already dealt  Identity function  Single/binary step function  Discrete/continuous sigmoidal function.
  • 20.  Bias is like another weight. Its included by adding a component x0=1 to the input vector X.  X=(1,X1,X2…Xi,…Xn)  Bias is of two types  Positive bias: increase the net input  Negative bias: decrease the net input
  • 21.  The relationship between input and output given by the equation of straight line y=mx+c X Y Input C(bias) y=mx+C
  • 22.  Set value based upon which the final output of the network may be calculated  Used in activation function  The activation function using threshold can be defined as             ifnet ifnet net f 1 1 ) (
  • 23.  Denoted by α.  Used to control the amount of weight adjustment at each step of training  Learning rate ranging from 0 to 1 determines the rate of learning in each time step
  • 24. oSupervised: oAdaline, Madaline oPerceptron oBack Propagation omultilayer perceptrons oRadial Basis Function Networks oUnsupervised oCompetitive Learning oKohonen self organizing map oLearning vector quantization oHebbian learning
  • 25. No one‐fits‐all formula Over fitting can occur if a “good” training set is not chosen What constitutes a “good” training set?  Samples must represent the general population. Samples must contain members of each class. Samples in each class must contain a wide range of variations or noise effect. The size of the training set is related to the number of hidden neurons
  • 27. This is what we have seen so far! A network is fed with a set of training samples (inputs and corresponding output), and it uses these samples to learn the general relationship between the inputs and the outputs. This relationship is represented by the values of the weights of the trained network.
  • 28. No desired output is associated with the training data! Faster than supervised learning Used to find out structures within data:  Clustering  Compression
  • 29. Like supervised learning, but: Weights adjusting is not directly related to the error value. The error value is used to randomly, shuffle weights! Relatively slow learning due to ‘randomness’.
  • 30. Function approximation including time series prediction and modelling. Classification including patterns and sequences recognition, novelty detection and sequential decision making. (radar systems, face identification, handwritten text recognition) Data processing including filtering, clustering blinds source separation and compression. (data mining, e-mail Spam filtering)
  • 31. Advantages Adapt to unknown situations Powerful, it can model complex functions. Ease of use, learns by example, and very little user domain‐specific expertise needed Disadvantages Forgets Not exact Large complexity of the network structure
  • 32. [Russell & Norvig, 1995] This vastly simplified model of real neurons is also known as a Threshold Logic Unit A set of connections brings in activations from other neurons. A processing unit sums the inputs, and then applies a non- linear activation function (i.e. squashing/ Transfer/ threshold function). An output line transmits the result to other neurons y = F(w1x1+ w2x2 - b)
  • 33. [Russell & Norvig, 1995] Binary perceptrons Continuous perceptrons
  • 34. 1 2 Y 1 2 Y 1st Neural Network: AND function Threshold(Y) = 2 X1 Y X2 1 1 1st Neural Network: OR function X1 Y X2 2 2 Threshold(Y) = 2
  • 35. 1 2 Y 1 Y 2 1st Neural Network: AND not function Threshold(Y) = 2 X1 Y X2 2 -1 1st Neural Network: XOR function X1 Y X2 2 -1 Z2 Z1 2 -1 2 2 Threshold(Y) = 2
  • 36.
  • 37.  Weighted inputs are summed up by the input function  The (nonlinear) activation function calculates the activation value, which determines the output [Russell & Norvig, 1995]
  • 38.  Stept(x) = 1 if x >= t, else 0  Sign(x) = +1 if x >= 0, else –1  Sigmoid(x)= 1/(1+e-x) [Russell & Norvig, 1995]
  • 39.  simple neurons can act as logic gates  appropriate choice of activation function, threshold, and weights  step function as activation function [Russell & Norvig, 1995]
  • 40. layered structures  networks are arranged into layers  interconnections mostly between two layers  some networks may have feedback connections
  • 41.  single layer, feed- forward network  historically one of the first types of neural networks  late 1950s  the output is calculated as a step function applied to the weighted sum of inputs  capable of learning simple functions  linearly separable [Russell & Norvig, 1995]
  • 42. [Russell & Norvig, 1995]  perceptrons can deal with linearly separable functions  some simple functions are not linearly separable  XOR function 0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 AND XOR
  • 43.  linear separability can be extended to more than two dimensions  more difficult to visualize [Russell & Norvig, 1995]
  • 44.  This is done by making small adjustments in the weights  to reduce the difference between the actual and desired outputs of the Perceptron.  The initial weights are randomly assigned  usually in the range [0.5, 0.5], or [0, 1]  Then the they are updated to obtain the output consistent with the training examples.
  • 45.  perceptrons can learn from examples through a simple learning rule. For each example row (iteration), do the following:  calculate the error of a unit Erri as the difference between the correct output Ti and the calculated output Oi Erri = Ti - Oi  adjust the weight Wj of the input Ij such that the error decreases Wij = Wij +  *Iij * Errij   is the learning rate, a positive constant less than unity.  this is a gradient descent search through the weight space
  • 46.
  • 47.  The network consists of an input layer of source neurons, at least one middle or hidden layer of computational neurons, and an output layer of computational neurons.  The input signals are propagated in a forward direction on a layer-by-layer basis  feedforward neural network  the back-propagation learning algorithm can be used for learning in multi-layer networks
  • 48.  two-layer network  input units Ik  usually not counted as a separate layer  hidden units aj  output units Oi  usually all nodes of one layer have weighted connections to all nodes of the next layer Ik aj Oi Wji Wkj
  • 49.
  • 50.  Learning in a multilayer network proceeds the same way as for a perceptron.  A training set of input patterns is presented to the network.  The network computes its output pattern, and if there is an error  or in other words a difference between actual and desired output patterns  the weights are adjusted to reduce this error.  proceeds from the output layer to the hidden layer(s)  updates the weights of the units leading to the layer
  • 51.  In a back-propagation neural network, the learning algorithm has two phases.  First, a training input pattern is presented to the network input layer. The network propagates the input pattern from layer to layer until the output pattern is generated by the output layer.  If this pattern is different from the desired output, an error is calculated and then propagated backwards through the network from the output layer to the input layer. The weights are modified as the error is propagated.
  • 52.
  • 53. Three-layer network for solving the Exclusive-OR Operation
  • 54. Final results of three-layer network learning Inputs x1 x2 1 0 1 0 1 1 0 0 0 1 1 Desired output yd 0 0.0155 Actual output y5 Y Error e Sum of squared errors e 0.9849 0.9849 0.0175 0.0155 0.0151 0.0151 0.0175 0.0010
  • 55. Network for solving the Exclusive-OR operation
  • 56. (a) Decision boundary constructed by hidden neuron 3; (b) Decision boundary constructed by hidden neuron 4; (c) Decision boundaries constructed by the complete three-layer network x1 x2 1 (a) 1 x2 1 1 (b) 0 0 x1 + x2 – 1.5 = 0 x1 + x2 – 0.5 = 0 x1 x1 x2 1 1 (c) 0 Decision boundaries
  • 57.  expressiveness  weaker than predicate logic  good for continuous inputs and outputs  computational efficiency  training time can be exponential in the number of inputs  depends critically on parameters like the learning rate  local minima are problematic  can be overcome by simulated annealing, at additional cost  generalization  works reasonably well for some functions (classes of problems)  no formal characterization of these functions
  • 58.  sensitivity to noise  very tolerant  they perform nonlinear regression  transparency  neural networks are essentially black boxes  there is no explanation or trace for a particular answer  tools for the analysis of networks are very limited  some limited methods to extract rules from networks  prior knowledge  very difficult to integrate since the internal representation of the networks is not easily accessible
  • 59.  domains and tasks where neural networks are successfully used  recognition  control problems  series prediction  weather, financial forecasting  categorization  sorting of items (fruit, characters, …)
  • 60.  Neural networks were designed on analogy with the brain.  The brain’s memory, however, works by association.  For example, we can recognise a familiar face even in an unfamiliar environment within 100-200 ms.  We can also recall a complete sensory experience, including sounds and scenes, when we hear only a few bars of music.  The brain routinely associates one thing with another. The Hopfield Network
  • 61.  Multilayer neural networks trained with the back- propagation algorithm are used for pattern recognition problems.  However, to emulate the human memory’s associative characteristics we need a different type of network: a recurrent neural network.  A recurrent neural network has feedback loops from its outputs to its inputs.
  • 62. Single-layer n-neuron Hopfield network xi x1 x2 xn I n p u t S i g n a l s yi y1 y2 yn 1 2 i n O u t p u t S i g n a l s  The stability of recurrent networks was solved only in 1982, when John Hopfield formulated the physical principle of storing information in a dynamically stable network.
  • 63. [Russell & Norvig, 1995] o Objective function, averaged over all training examples is a hilly landscape in high-dimensional space of weight values o Negative gradient vector indicates direction of steepest descent taking it closer to a minimum
  • 64. • Goal is to learn the weights w from a labelled set of training samples • Learning procedure has two stages 1. Evaluate derivatives of error function ∇E(w) with respect to weights w1,..wT 2. Use derivatives to compute adjustments to weights w(1)  w( ) E(w( ) )
  • 65. • Gradient descent • Update the weights using • Where the gradient vector ∇ E (w(τ) ) consists of the vector of derivatives evaluated using back-propagation w(τ + 1) = w(τ ) − η∇ E (w(τ ) ) ∇ E (w) = d dw ∂E 11 ∂w(1) . MD (1) ∂E 11 ∂w(2) . K M ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ⎤ ⎢ ⎢ E (w) = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎢ ∂E ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ There are W= M(D+1)+K(M+1) elements in the vector ⎥ ⎥ ⎢ ∂w ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎢ ∂E ⎥ ⎥ ⎢ ∂w(2) ⎥ ⎥ ⎦ is a W x 1 vector (τ ) Gradient ∇ E (w )
  • 66. Most practitioners use SGD for DL Consists of showing the input vector for a few examples, computing the outputs and the errors, Computing the average gradient for those examples, and adjusting the weights accordingly. Process repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. Called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. Usually finds a good set of weights quickly compared to elaborate optimization techniques. After training, performance is measured on a different test set: tests generalization ability of the machine — its ability to produce sensible answers on new inputs never seen during training. Flucutations in objective as gradient steps are taken in mini batches
  • 67. 4. Use • Backpropagation Formula • Value of  for a particular hidden unit can be obtained by propagating the  ’s backward from units higher- up in the network j  h'(aj )wkjk k 1. Apply input vector xn to network and forward propagate through network using i aj  w jizi and zj=h(aj) k 2. Evaluate k for all output units using k=yk-tk 3. Backpropagate the ’s using j  h'(aj )wkjk to obtain j for each hidden unit En w ji to evaluate required derivatives   z j i Unit j Unit k
  • 68.
  • 69. • Two-layer network • Sum-of-squared error • Output units: linear activation functions, i.e., multiple regression yk=ak • Hidden units have logistic sigmoid activation function h(a)=tanh (a) where simple form for derivative tanh(a)  ea  ea ea  ea h'(a) 1 h(a)2 Standard Sum of Squared Error yk: activation of output unit k tk : corresponding target for input xk 1 2 k1 En  yk  tk  K 2
  • 70. • Forward Propagation • Backward Propagation (s for hidden units) • Derivatives wrt first layer and second layer weights • Batch method j (1)  a  w x ji i k (2) j 0 i 0 z j  tanh(aj ) M  y  w z kj j • Output differences k  yk  tk Simple Example: Forward and Backward Prop For each input in training set: D j j 2   (1 z ) w  kj k k1 K  En En w(1) ji w(2) kj   x   z j i k j j  h'(aj )wkjk k h'(a) 1 h(a)2 ∂E = ∑ ∂En ∂wji ∂wji n
  • 71. 2 Hidden Layers, 1 Output Layer Each layer is a module through which one can back-propagate gradients. At each layer, we first compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function f(.) is applied to z to get the output of the unit.
  • 72. A small change Δxin x gets transformed first into a small change Δ y in y by getting multiplied by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change Δ y creates a change Δ z in z. Substituting one equation into the other gives the chain rule of derivatives — how Δxgets turned into Δzthrough multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, y and z are vectors (and the derivatives are Jacobian matrices). Tells us how two small effects (that of a small change of x on y, and that of y on z) are composed.
  • 73. Two hidden layers Layers are labelled i,j,k,l At each hidden layer we compute the error derivative wrt the output of each unit, which is a weighted sum of the error derivatives wrt the total inputs to the units in the layer above. Convert error derivative wrt the output into the error derivative wrt the input by multiplying it by the gradient of f(z). At the output layer, the error derivative wrt the output of a unit is computed by differentiating the cost function. This gives yl − tl if the cost function for unit l is l 0.5(yl − t )2, where tl is the target value. Once the ∂E/∂zk is known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ∂E/∂zk.
  • 74. • Most popular today is Rectified Linear Unit(ReLU) • a half-wave rectifier f(z)=max(z,0) • In past decades, neural nets used smoother non- linearities tanh(z) or 1/(1+exp(-z)) • ReLU learns faster in networks with many layers • Allowing training of deep supervised network without unsupervised pre-training
  • 75. • Rectified linear unit (ReLU) f(z) = max(0,z) commonly used in recent years, • More conventional sigmoids: • hyberbolic tangent, f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and • logistic function, f(z) = 1/(1 + exp(−z))
  • 76. with 2 Inputs, 2 HiddenUnits, 1 Output unit • By distorting the input space • Note that grid is also distorted x1 x2 x1 y1 x2 y2 • Input Data (red and blue curves) are not linearly separable • Network makes them linearly separable z
  • 77. • Linear classifiers can only carve the input space into very simple regions • Image and speech recognition require input-output function to be insensitive to irrelevant variations of the input, • e.g., position, orientation and illumination of an object • Variations in pitch or accent of speech • While being sensitive to minute variations, e.g., white wolf and breed of wolf-like white dog called Samoyed • At pixel level two Samoyeds in different positions may be very different, whereas a Samoyed and a wolf in the same position and background may be very similar
  • 78. • Shallow classifiers need a good feature extractor • One that produces representations that are: • selective to aspects of image important for discrimination • but invariant to irrelevant aspects such as pose of the animal • Generic features (e.g., Gaussian kernel) do not generalize well far from training examples • Hand-designing good feature extractors requires engineering skill and domain expertise • Deep learning learns features automatically
  • 79. • Multilayer stack of simple modules • All modules (or most) subject to : • Learning • Non-linear input-output mappings • Each module transforms input to improve both selectivity and invariance of the representation • With depth of 5 to 20 layers can implement extremely intricate functions of input • Sensitive to minute details • Distinguish Samoyeds from white wolves • Insensitive ti irrelevant variations • Background, pose, lighting, surrounding objects
  • 80. • It was thought that simple gradient descent would get trapped in local minima • Rarely a problem with large networks • Regardless of initial conditions, system always reaches solutions of similar quality • Landscape is packed with a combinatorially large number of saddle points where gradient is zero • Almost all have similar values of objective function w(τ+1) = w(τ ) − η∇ E (w(τ ) )
  • 81. Srihari • Unsupervised learning to create layers of feature detectors • No need for labelled data • Objective of learning in each layer of feature detectors: • Ability to reconstruct or model activities of feature detectors (or raw inputs) in layer below • By pre-training weights of a deep network could be initialized to sensible values • Final layer of output units added at top of network and whole deep system fine-tuned using back- propagation • Worked well in handwritten digit recognition • When data was limited
  • 82. • First major application was in speech recognition • Made possible by advent of fast GPUs • Allowed networks to be trained 10 or 20 times faster • Record-breaking results on speech reco benchmark
  • 83. • It turned out that pre-training stage was only needed for small data sets • Convolutional neural networks • Type of deep feedforward network • Much easier to train and generalized much better than networks with full connectivity between adjacent layers
  • 84. • Designed to process data that come in the form of multiple arrays • E.g., a color image composed of three 2D arrays of pixel intensities in three color channels • Many data modalities are in the form of multiple arrays: • 1D for signals and sequences, including language • 2D for images and audio spectrograms • 3D for video or volumetric images
  • 85. • Take advantage of properties of natural signals 1. Local connections 2. Shared weights 3. Pooling 4. Use of many layers
  • 86. • Need substantial number of training samples • Slow learning (convergence times) • Inadequate parameter selection techniques that lead to poor minima
  • 87. • Network should exhibit invariance to translation, scaling and elastic deformations • A large training set can take care of this • It ignores a key property of images • Nearby pixels are more strongly correlated than distant ones • Modern computer vision approaches exploit this property • Information can be merged at later stages to get higher order features and about whole image
  • 88. • Classic notions of simple cells and complex cells • Architecture similar to LGN-V1-V2-V4-IT hierarchy in visual cortex ventral pathway • LGN: lateral geniculate nucleus receives input from retina • 30 different areas of visual cortex: V1 and V2 are principal • Infero-Temporal cortex performs object recognition
  • 89. 1. Local Receptive Fields 2. Subsampling 3. Weight Sharing
  • 90. • Instead of treating input to a fully connected network • Two layers of Neural networks are used 1. Layer of convolutional units • which consider overlapping regions 2. Layer of subsampling units • Several feature maps and sub- sampling • Gradual reduction of spatial resolution compensated by increasing no. of features • Final layer has softmax output • Whole network trained using backpropagation Each pixel patch is 5 x 5 This plane has 10 x 10=100 neural network units (called a feature map). Weights are same for different planes. So only 25 weights are needed. Due to weight sharing this is equivalent to convolution. Different features have different feature maps 10 x 10 units 5 x 5 pixel s 2 x 2 units Input imag e 5 x 5 units
  • 91. • Reducing the complexity of a network • Encouraging groups of weights to have similar values • Only applicable when form of the network can be specified in advance • Division of weights into groups, mean weight value for each group and spread of values are determined during the learning process
  • 92. Outputs (not filters) of each layer (horizontally). Each rectangular image is a feature map corrresponding to output for one of the learned features, detected at each of the image positions. Lower-level features act as oriented edge detectors
  • 93. • Structured as a series of stages • First few stages are composed of two types of layers and a non-linearity: 1. Convolutional layer • To detect local conjunctions of features from previous layer 2. Non-linearity • ReLU 3. Pooling layer • To merge semantically similar features into one
  • 94. • Organized in feature maps • Each unit connected to local patches in feature maps of previous layer through weights (called a filter bank) • Result is passed through a ReLU
  • 95. • Computes maximum of a local patch of units in one feature map • Neighboring pooling units take input from patches that are shifted by more than one row or column • Thereby reducing the dimension of the representation • Creating invariance to small shifts and distortions
  • 96. • As simple as through a regular deep network • Allow all weights in all filter banks to be trained
  • 97. • Neocognitron • Similar architecture • Did not have end-to-end supervised learning using Backprop • ConvNet with probabilistic model was used for OCR and handwritten check reading
  • 98. • Applied with great success to images: detection, segmentation and recognition of objects and regions • Tasks where abundant data was available • Traffic sign recognition • Segmentation of biological images • Connectomics • Detection of faces, text, pedestrians, human bodies in natural images • Major recent practical success is face recognition • Images can be labeled at pixel level • Applications in autonomous mobile robots, and self-driving cars • Other applications gaining importance • Natural language understanding • Speech recognition
  • 99. • Data set of 1 million images from the web • Contained 1,000 different classes • Error rate • Halved the error rate of best competing computer vision approaches • Success came from: • Efficient use of GPUs • ReLUs • New regularization technique called dropout • Techniques to generate new training samples by deforming existing ones
  • 100. • Combining ConvNets and Recurrent Net Modules • Caption generated by a recurrent neural network (RNN) taking as input: 1. Representation generated by a deep CNN 2. RNN trained to translate high-level representations of images into captions
  • 101. • Different focus (lighter patches given more attention) • As it generates each word (bold), it exploits it to achieve better translation of images to captions
  • 102. 1. 10 to 20 layers of ReLUs 2. Hundreds of millions of weights 3. Billions of connections between units 4. Training time • Would have taken weeks a couple of years ago • Advances in hardware, software and parallelization reduces it to a few hours • ConvNets are easily amenable to efficient hardware implementations • NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips for smartphones, cameras, robots and self-driving cars
  • 103. • Deep nets have two different exponential advantages over classic learning algorithms. Both advantages arise from • power of composition • Depend on underlying data-generating distribution having an appropriate compositional structure 1. Learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training 1. E.g., 2n combinations are possible with n binary features 2. Composing layers of representations brings another advantage: exponential in depth
  • 104. • Predicting the next word from local context of earlier words • Each word presented as a 1-of-N vector • In the first layer each word creates a different word vector • In the language model, other next layers learn to convert input word vector to output word vector for the predicted word Words Phrases 5 0 Word representation for modeling language, non-linearly projected to 2-D using t-SNE algorithm. Semantically similar words are mapped nearby. 2-D representation of phrases learnt by English-French encoder-decoder learnt by RNN Learnt using backpropagation that jointly learns representation for each word and function that predicts a target quantity (next word or sequence of words for translation)
  • 105. • Logic-inspired paradigm uses • Instance of a symbol is something for which the only property is that is either identical or non- identical to other symbol instances • It has no internal structure relevant to its use • To reason with symbols they must be bound to variables in judiciously chosen rules of inference • Neural networks use • big activity vectors, big weight matrices and scalar non- linearities • to perform fast intuitive inference that underpins commonsense reasoning
  • 106.  Standard statistical models count frequencies of short symbol sequences of length upto N  No of possible sequences is VN, where V is vocabulary size  So taking context of more than a handful of words would require very large corpora  N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences  Neural models can because they associate each word with a vector of real-valued features
  • 107. • Exciting early use of backpropagation was for training RNNs • For tasks involving sequential inputs • e.g., speech and language, it is better to use RNNs • RNNs process input sequence one element at a time • maintain in their hidden units a state vector that implicitly contains history of past elements in the sequence • Same parameters (matrices U, V, W) are used at each time step Backpropagation algorithm is applied to the unfolded graph of computational network To compute derivative of total error (log-probability of generating right sequence) wrt to states si and all the parameters
  • 108. • DBNs are Generative Models • Provide estimates of both p(x|Ck) and p(Ck|x) • Conventional neural networks are discriminative • Directly estimate p(Ck|x) • Consist of several layers of Restricted Boltzmann Machines (RBM) • RBM • A form of Markov Random Field
  • 109.
  • 110. • Named after Boltzmann Distribution (Or Gibbs Distribution) • Gives probability that a system will be in a certain state given its temperature • Where E is energy and kT (constant of the distribution) is a product of Boltzmann’s constant and thermodynamic temperature • Energy of Boltzmann network • Training
  • 112. • What is dropout? • Dropout as an ensemble method • Mask for dropout training • Bagging vs Dropout • Prediction intractability
  • 113. • Deep nets have many non-linear hidden layers – Making them very expressive to learn complicated relationships between inputs and outputs – But with limited training data, many complicated relationships will be the result of training noise • So they will exist in the training set and not in test set even if drawn from same distribution • Many methods developed to reduce overfitting – Early stopping with a validation set – Weight penalties (L1 and L2 regularization) – Soft weight sharing
  • 114. • Best way to regularize a fixed size model is: – Average the predictions of all possible settings of the parameters – Weighting each setting with the posterior probability given the training data • This would be the Bayesian approach • Dropout does this using considerably less computation – By approximating an equally weighted geometric mean of the predictions of an exponential number of learned models that share parameters
  • 115. • Bagging is a method of averaging over several models to improve generalization • Impractical to train many neural networks since it is expensive in time and memory – Dropout makes it practical to apply bagging to very many large neural networks • It is a method of bagging applied to neural networks • Dropout is an inexpensive but powerful method of regularizing a broad family of models
  • 116. • Dropout trains an ensemble of all subnetworks – Subnetworks formed by removing non-output units from an underlying base network • We can effectively remove units by multiplying its output value by zero – For networks based on performing a series of affine transformations or on-linearities – Needs some modification for radial basis functions based on difference between unit state and a reference value
  • 117. • A simple way to prevent neural net overfitting (a) A standard neural net with two hidden layers (b) A thinned net produced by applying dropout, crossed units have been dropped Drop hidden and visible units from net, i.e., temporarily remove it from the network with all input/output connections. Choice of units to drop is random, determined by a probability p, chosen by a validation set, or equal to 0.5
  • 118.
  • 119. Deep net in Keras Validate on CIFAR-10 dataset Network built had three convolution layers of size 64, 128 and 256 followed by two densely connected layers of size 512 and an output layer dense layer of size 10 Accuracy vs dropout Loss vs dropout
  • 120. • In bagging we define k different models, construct k different data sets by sampling from the dataset with replacement, and train model i on dataset i • Dropout aims to approximate this process, but with an exponentially large no. of neural networks
  • 121. • Remove non-output units from base network. • Remaining 4 units yield 16 networks • Here many networks have no path from input to output • Problem insignificant with large networks
  • 122. • To train with dropout we use minibatch based learning algorithm that takes small steps such as SGD • At each step randomly sample a binary mask – Probability of including a unit is a hyperparameter • 0.5 for hidden units and 0.8 for input units • We run forward & backward propagation as usual
  • 123. Feed- forward network • Network with binary vector μ whose elements correspond to input and hidden units • Elements of μ • With probability of 1 being a hyperparameter • 0.5 for hidden • 0.8 for input • Each unit is • Multiplied by corresponding mask • Forward prop as usual • Equivalent to randomly selecting one of the subnetworks of previous slide
  • 124. • Suppose that mask vector μ specifies which units to include • Cost of the model is specified by J(θ,μ) • Drop training consists of minimizing Eμ(J(θ,μ)) • Expected value contains exponential no. of terms • We can get an unbiased estimate of its gradient by sampling values of μ
  • 125. • Dropout training not same as bagging training – In bagging, the models are all independent – In dropout, models share parameters • Models inherit subsets of parameters from parent network • Parameter sharing allows an exponential no. of models with a tractable amount of memory • In bagging each model is trained to convergence on its respective training set – In dropout, most models are not explicitly trained • Fraction of sub-networks are trained for a single step • Parameter sharing allows good parameter settings
  • 126. • Bagging: – Ensemble accumulates votes of members – Process is referred to as inference • Assume model needs to output a probability distribution • In bagging, model i produces p(i)(y|x) • Prediction of ensemble is the mean • Dropout: – Submodel defined by mask vector μ defines a probability distribution p(y|x,μ) – Arithmetic mean over all masks is • Where p(μ) is the distribution used to sample μ at training time k i=1 k 1 p(i) (y |x )  p(y |x,)
  • 127. • Dropout prediction is  It is intractable to evaluate due to an exponential no. of terms  We can approximate inference using sampling  By averaging together the output from many masks  10-20 masks are sufficient for good performance  Even better approach, at the cost of a single forward propagation:  use geometric mean rather than arithmetic mean of the ensemble member’s predicted distributions 18  p(y |x,)