ACUMENS ON NEURAL NET AKG 20 7 23.pptx

Presented by
Dr. Gnanasekar.A.K
Professor & Head
Department of Computer Science & Engineering
P T Lee Chengalvaraya Naicker College of Engineering & Technology, Kanchipuram
hodcse@ptleecncet.com
9677338977

 Neurons and the Brain
 Neural Networks
 Perceptions
 Multi-layer Networks
 Applications
 The Hopfield Network
 Multilayer Neural Networks
 Deep Convolutional Nets
 Recurrent Nets
 Deep Belief Networks
 Dropout

 A model of reasoning based on the human brain
 Complex networks of simple computing elements
 Capable of learning from examples
 with appropriate learning methods
 Collection of simple elements performs high-level
operations

 The human brain incorporates nearly 10 billion
neurons and 60 trillion connections between them.
 Our brain can be considered as a highly complex,
non-linear and parallel information-processing
system.
 Learning is a fundamental and essential
characteristic of biological neural networks.

[Russell & Norvig, 1995]
Year Neural Network Designer Description
1943 McCulloch and Pitts
Neuron
Mc Culloch Pitts Logic gates
1949 Hebb Hebb Strength increases
if neurons are
active
1958-1988 Perceptron Rosenblatt Weights of path
can be adjusted
1960 Adaline Widrow and Hoff Mean squared error
1972 SOM(self-organizing map) Kohonen Clustering
1982 Hopfield John Hopfield Associative
memory nets
1986 Back Propagation Rumelhard Multilayer
1987-90 ART(Adaptive Resonance
Theory)
Carpenter Used for both
binary and analog

•Basic parts: Body(soma), Dendrites(inputs), Axons (outputs),
Synapses (connections)

•Input summing function
•Activation function
•Weighted inputs
•Output

 Recall:- processing phase for a NN and its
objective is to retrieve the information. The
process of computing o for a given x
 Basic forms of neural information
processing
 Auto association
 Hetero association
 Classification

 Set of patterns can be
stored in the network
 If a pattern similar to
a member of the
stored set is
presented, an
association with the
input of closest stored
pattern is made

 Associations between
pairs of patterns are
stored
 Distorted input
pattern may cause
correct
heteroassociation at
the output

 Set of input patterns
is divided into a
number of classes or
categories
 In response to an
input pattern from
the set, the classifier
is supposed to recall
the information
regarding class
membership of the
input pattern.

 Weights
 Bias
 Threshold
 Learning rate
 Momentum factor
 Vigilance parameter
 Notations used in ANN

 Each neuron is connected to every other
neuron by means of directed links
 Links are associated with weights
 Weights contain information about the input
signal and is represented as a matrix
 Weight matrix also called connection matrix

W=
1
2
3
.
.
.
.
.
T
T
T
T
n
w
w
w
w
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11 12 13 1
21 22 23 2
1 2 3
...
...
..................
...................
...
m
m
n n n nm
w w w w
w w w w
w w w w
 
 
 
 
 
 
 
 
 
=

 wij –is the weight from processing element ”i”
(source node) to processing element “j” (destination node)
X1
1
Xi
Yj
Xn
w
1j
wij
wnj
bj
0
0 0 1 1 2 2
0
1
1
....
n
i ij
inj
i
j j j n nj
n
j i ij
i
n
j i ij
inj
i
y xw
x w xw x w x w
w xw
y b xw




    
 
 




 Used to calculate the output response of
a neuron.
 Sum of the weighted input signal is
applied with an activation to obtain the
response.
 Activation functions can be linear or non
linear
 Already dealt
 Identity function
 Single/binary step function
 Discrete/continuous sigmoidal function.

 Bias is like another weight. Its included by
adding a component x0=1 to the input vector
X.
 X=(1,X1,X2…Xi,…Xn)
 Bias is of two types
 Positive bias: increase the net input
 Negative bias: decrease the net input

 The relationship between input and output given by
the equation of straight line y=mx+c
X Y
Input
C(bias)
y=mx+C

 Set value based upon which the final output of
the network may be calculated
 Used in activation function
 The activation function using threshold can be
defined as












ifnet
ifnet
net
f
1
1
)
(

 Denoted by α.
 Used to control the amount of weight
adjustment at each step of training
 Learning rate ranging from 0 to 1 determines
the rate of learning in each time step

oSupervised:
oAdaline, Madaline
oPerceptron
oBack Propagation
omultilayer perceptrons
oRadial Basis Function Networks
oUnsupervised
oCompetitive Learning
oKohonen self organizing map
oLearning vector quantization
oHebbian learning

No one‐fits‐all formula
Over fitting can occur if a “good” training set is
not chosen
What constitutes a “good” training set?
 Samples must represent the general
population.
Samples must contain members of each class.
Samples in each class must contain a wide
range of variations or noise effect.
The size of the training set is related to the
number of hidden neurons

Supervised learning
Unsupervised learning
Reinforcement learning

This is what we have seen so far!
A network is fed with a set of training
samples (inputs and corresponding
output), and it uses these samples to
learn the general relationship between
the inputs and the outputs.
This relationship is represented by the
values of the weights of the trained
network.

No desired output is associated with
the training data!
Faster than supervised learning
Used to find out structures within data:
 Clustering
 Compression

Like supervised learning, but:
Weights adjusting is not directly related to
the error value.
The error value is used to randomly, shuffle
weights!
Relatively slow learning due to
‘randomness’.

Function approximation
including time series prediction and modelling.
Classification
including patterns and sequences recognition,
novelty detection and sequential decision making.
(radar systems, face identification, handwritten text
recognition)
Data processing
including filtering, clustering blinds source
separation and compression.
(data mining, e-mail Spam filtering)

Advantages
Adapt to unknown situations
Powerful, it can model complex functions.
Ease of use, learns by example, and very
little user domain‐specific expertise needed
Disadvantages
Forgets
Not exact
Large complexity of the network structure

This vastly simplified model of real neurons is also
known as a Threshold Logic Unit
A set of connections brings in activations from other
neurons.
A processing unit sums the inputs, and then applies a non-
linear activation function (i.e. squashing/ Transfer/
threshold function).
An output line transmits the result to other neurons
y = F(w1x1+ w2x2 - b)

Binary
perceptrons
Continuous
perceptrons

1
2
Y
1
2
Y
1st Neural Network: AND function
Threshold(Y) = 2
X1
Y
X2
1
1
1st Neural Network: OR function
X1
Y
X2
2
2
Threshold(Y) = 2

1
2
Y
1
Y
2
1st Neural Network: AND not function
Threshold(Y) = 2
X1
Y
X2
2
-1
1st Neural Network: XOR function
X1
Y
X2
2
-1
Z2
Z1
2
-1
2
2
Threshold(Y) = 2

 Weighted inputs are summed up by the input function
 The (nonlinear) activation function calculates the activation value, which
determines the output

 Stept(x) = 1 if x >= t, else 0
 Sign(x) = +1 if x >= 0, else –1
 Sigmoid(x)= 1/(1+e-x)

 simple neurons can act as logic gates
 appropriate choice of activation function, threshold, and
weights
 step function as activation function

layered structures
 networks are arranged into layers
 interconnections mostly between two layers
 some networks may have feedback
connections

 single layer, feed-
forward network
 historically one of
the first types of
neural networks
 late 1950s
 the output is
calculated as a step
function applied to
the weighted sum of
inputs
 capable of learning
simple functions
 linearly separable

 perceptrons can deal with linearly separable functions
 some simple functions are not linearly separable
 XOR function
0,0
0,1
1,0
1,1
0,0
0,1
1,0
1,1
AND XOR

 linear separability can be extended to more than two dimensions
 more difficult to visualize

 This is done by making small adjustments in the weights
 to reduce the difference between the actual and desired
outputs of the Perceptron.
 The initial weights are randomly assigned
 usually in the range [0.5, 0.5], or [0, 1]
 Then the they are updated to obtain the output
consistent with the training examples.

 perceptrons can learn from examples through a simple
learning rule. For each example row (iteration), do the
following:
 calculate the error of a unit Erri as the difference between
the correct output Ti and the calculated output Oi
Erri = Ti - Oi
 adjust the weight Wj of the input Ij such that the error
decreases
Wij = Wij +  *Iij * Errij
  is the learning rate, a positive constant less than unity.
 this is a gradient descent search through the weight space

 The network consists of an input layer of source
neurons, at least one middle or hidden layer of
computational neurons, and an output layer of
computational neurons.
 The input signals are propagated in a forward
direction on a layer-by-layer basis
 feedforward neural network
 the back-propagation learning algorithm can be
used for learning in multi-layer networks

 two-layer network
 input units Ik
 usually not counted as a
separate layer
 hidden units aj
 output units Oi
 usually all nodes of one
layer have weighted
connections to all nodes
of the next layer
Ik
aj
Oi
Wji
Wkj

 Learning in a multilayer network proceeds the
same way as for a perceptron.
 A training set of input patterns is presented to
the network.
 The network computes its output pattern, and if
there is an error  or in other words a difference
between actual and desired output patterns 
the weights are adjusted to reduce this error.
 proceeds from the output layer to the hidden layer(s)
 updates the weights of the units leading to the layer

 In a back-propagation neural network, the learning
algorithm has two phases.
 First, a training input pattern is presented to the
network input layer. The network propagates the input
pattern from layer to layer until the output pattern is
generated by the output layer.
 If this pattern is different from the desired output, an
error is calculated and then propagated backwards
through the network from the output layer to the input
layer. The weights are modified as the error is
propagated.

Three-layer network for solving the
Exclusive-OR Operation

Final results of three-layer network learning
Inputs
x1 x2
1
0
1
0
1
1
0
0
0
1
1
Desired
output
yd
0
0.0155
Actual
output
y5
Y
Error
e
Sum of
squared
errors
e
0.9849
0.9849
0.0175
0.0155
0.0151
0.0151
0.0175
0.0010

Network for solving the Exclusive-OR operation

(a) Decision boundary constructed by hidden neuron 3;
(b) Decision boundary constructed by hidden neuron 4;
(c) Decision boundaries constructed by the complete
three-layer network
x1
x2
1
(a)
1
x2
1
1
(b)
0
0
x1 + x2 – 1.5 = 0 x1 + x2 – 0.5 = 0
x1 x1
x2
1
1
(c)
0
Decision boundaries

 expressiveness
 weaker than predicate logic
 good for continuous inputs and outputs
 computational efficiency
 training time can be exponential in the number of inputs
 depends critically on parameters like the learning rate
 local minima are problematic
 can be overcome by simulated annealing, at additional cost
 generalization
 works reasonably well for some functions (classes of
problems)
 no formal characterization of these functions

 sensitivity to noise
 very tolerant
 they perform nonlinear regression
 transparency
 neural networks are essentially black boxes
 there is no explanation or trace for a particular answer
 tools for the analysis of networks are very limited
 some limited methods to extract rules from networks
 prior knowledge
 very difficult to integrate since the internal representation of
the networks is not easily accessible

 domains and tasks where neural networks
are successfully used
 recognition
 control problems
 series prediction
 weather, financial forecasting
 categorization
 sorting of items (fruit, characters, …)

 Neural networks were designed on analogy with
the brain.
 The brain’s memory, however, works by
association.
 For example, we can recognise a familiar face even in an
unfamiliar environment within 100-200 ms.
 We can also recall a complete sensory experience,
including sounds and scenes, when we hear only a few bars
of music.
 The brain routinely associates one thing with
another.
The Hopfield Network

 Multilayer neural networks trained with the back-
propagation algorithm are used for pattern
recognition problems.
 However, to emulate the human memory’s
associative characteristics we need a different type
of network: a recurrent neural network.
 A recurrent neural network has feedback loops
from its outputs to its inputs.

Single-layer n-neuron Hopfield network
xi
x1
x2
xn
I
n
p
u
t
S
i
g
n
a
l
s
yi
y1
y2
yn
1
2
i
n
O
u
t
p
u
t
S
i
g
n
a
l
s
 The stability of recurrent networks was solved only in
1982, when John Hopfield formulated the physical
principle of storing information in a dynamically
stable network.

o Objective function, averaged over all
training examples is a hilly landscape in
high-dimensional space of weight values
o Negative gradient vector indicates
direction of steepest descent taking it
closer to a minimum

• Goal is to learn the weights w from a
labelled set of training samples
• Learning procedure has two stages
1. Evaluate derivatives of error function
∇E(w) with respect to weights w1,..wT
2. Use derivatives to compute adjustments to
weights
w(1)
 w( )
E(w( )
)

• Gradient descent
• Update the weights using
• Where the gradient vector ∇ E (w(τ)
) consists of the vector of
derivatives evaluated using back-propagation
w(τ + 1)
= w(τ )
− η∇ E (w(τ )
)
∇ E (w) =
d
dw
∂E
11
∂w(1)
.
MD
(1)
∂E
11
∂w(2)
.
K M
⎢
⎢
⎢
⎢
⎢
⎡ ⎤
⎢
⎢
E (w) = ⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎢
∂E
⎥
⎥
⎥
⎥
⎥
⎥ There are W= M(D+1)+K(M+1) elements
in the vector
⎥
⎥
⎢ ∂w ⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎢ ∂E ⎥
⎥
⎢ ∂w(2) ⎥
⎥
⎦
is a W x 1 vector
(τ )
Gradient ∇ E (w )

Most practitioners use SGD for DL
Consists of showing the input vector for a few examples,
computing the outputs and the errors,
Computing the average gradient for those examples,
and adjusting the weights accordingly.
Process repeated for many small sets of examples
from the training set until the average of the objective function stops decreasing.
Called stochastic because each small set of examples
gives a noisy estimate of the average gradient over all examples.
Usually finds a good set of weights quickly compared to elaborate optimization techniques.
After training, performance is measured
on a different test set:
tests generalization ability of the machine
— its ability to produce sensible answers on new inputs
never seen during training.
Flucutations in objective
as gradient steps are taken
in mini batches

4. Use
• Backpropagation Formula
• Value of  for a particular
hidden unit can be obtained
by propagating the  ’s
backward from units higher-
up in the network
j  h'(aj )wkjk
k
1. Apply input vector xn to network and
forward propagate through network using
i
aj  w jizi and zj=h(aj)
k
2. Evaluate k for all output units using
k=yk-tk
3. Backpropagate the ’s using
to obtain j for each hidden unit
En
w ji
to evaluate required derivatives
  z
j i
Unit
j
Unit
k

• Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak
• Hidden units have logistic sigmoid
activation function
h(a)=tanh (a)
where
simple form for derivative
tanh(a) 
ea
 ea
ea
 ea
h'(a) 1 h(a)2
Standard Sum of Squared Error
yk: activation of output unit k tk
: corresponding target
for input xk
1
2 k1
En  yk  tk 
K 2

• Forward Propagation
• Backward Propagation (s for hidden
units)
• Derivatives wrt first layer and second layer
weights
• Batch method
j
(1)

a  w x
ji i
k
(2)
j 0
i 0
z j  tanh(aj )
M

y  w z
kj j
• Output differences
k  yk  tk
Simple Example: Forward and Backward Prop
For each input in training set:
D
j j
2
  (1 z ) w 
kj k
k1
K

En En
w(1)
ji w(2)
kj
  x   z
j i k j
k
h'(a) 1 h(a)2
∂E
= ∑
∂En
∂wji
∂wji
n

2 Hidden Layers, 1 Output Layer
Each layer is a module through which one can back-propagate
gradients.
At each layer, we
first compute the
total input z to each
unit,
which is a weighted sum of
the outputs
of the units in the layer
below.
Then a non-linear function f(.)
is applied to z to get the
output of the unit.

A small change Δxin x gets transformed
first into a small change Δ
y in y by getting
multiplied
by ∂y/∂x (that is, the definition of
partial derivative). Similarly, the
change Δ
y creates a change Δ
z in z.
Substituting one equation into the other
gives the chain rule of derivatives —
how Δxgets turned into Δzthrough
multiplication by the product of ∂y/∂x
and ∂z/∂x.
It also works when x, y and z are vectors
(and the derivatives are Jacobian
matrices).
Tells us how two small effects (that of a small change of x on y, and that of y on z)
are composed.

Two hidden layers Layers
are labelled i,j,k,l
At each hidden layer we compute
the error derivative wrt the output of each unit,
which is a weighted sum of the error derivatives
wrt the total inputs to the units in the layer above.
Convert error derivative wrt the output into the error
derivative wrt the input
by multiplying it by the gradient of f(z).
At the output layer, the error derivative wrt
the output of a unit is computed
by differentiating the cost function.
This gives yl − tl if the cost function for unit l is
l
0.5(yl − t )2,
where tl is the target value.
Once the ∂E/∂zk is known,
the error-derivative for the weight wjk
on the connection from
unit j in the layer below is just yj ∂E/∂zk.

• Most popular today is Rectified Linear Unit(ReLU)
• a half-wave rectifier
f(z)=max(z,0)
• In past decades, neural nets used smoother
non- linearities
tanh(z) or 1/(1+exp(-z))
• ReLU learns faster in networks with many layers
• Allowing training of deep supervised network
without unsupervised pre-training

• Rectified linear unit (ReLU)
f(z) = max(0,z)
commonly used in recent years,
• More conventional sigmoids:
• hyberbolic tangent,
f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and
• logistic function,
f(z) = 1/(1 + exp(−z))

with 2 Inputs, 2 HiddenUnits, 1 Output unit
• By distorting the input space
• Note that grid is also distorted
x1
x2
x1 y1
x2 y2
• Input Data (red and blue curves) are not linearly separable
• Network makes them linearly separable
z

• Linear classifiers can only carve the input space
into very simple regions
• Image and speech recognition require input-output
function to be insensitive to irrelevant variations of
the input,
• e.g., position, orientation and illumination of an object
• Variations in pitch or accent of speech
• While being sensitive to minute variations, e.g., white wolf
and breed of wolf-like white dog called Samoyed
• At pixel level two Samoyeds in different positions may be
very different, whereas a Samoyed and a wolf in the
same position and background may be very similar

• Shallow classifiers need a good feature extractor
• One that produces representations that are:
• selective to aspects of image important for
discrimination
• but invariant to irrelevant aspects such as pose of the
animal
• Generic features (e.g., Gaussian kernel)
do not generalize well far from training
examples
• Hand-designing good feature extractors
requires engineering skill and domain
expertise
• Deep learning learns features automatically

• Multilayer stack of simple modules
• All modules (or most) subject to :
• Learning
• Non-linear input-output mappings
• Each module transforms input to improve both
selectivity and invariance of the
representation
• With depth of 5 to 20 layers can implement
extremely intricate functions of input
• Sensitive to minute details
• Distinguish Samoyeds from white wolves
• Insensitive ti irrelevant variations
• Background, pose, lighting, surrounding objects

• It was thought that simple gradient descent would get
trapped in local minima
• Rarely a problem with large networks
• Regardless of initial conditions, system always reaches
solutions of similar quality
• Landscape is packed with a combinatorially large number of
saddle points where gradient is zero
• Almost all have similar values of objective function
w(τ+1)
= w(τ )
− η∇ E (w(τ )
)

Srihari
• Unsupervised learning to create layers of
feature detectors
• No need for labelled data
• Objective of learning in each layer of
feature detectors:
• Ability to reconstruct or model activities of feature detectors
(or raw inputs) in layer below
• By pre-training weights of a deep network could be initialized
to sensible values
• Final layer of output units added at top of
network and whole deep system fine-tuned using
backpropagation
• Worked well in handwritten digit recognition
• When data was limited

• First major application was in speech recognition
• Made possible by advent of fast GPUs
• Allowed networks to be trained 10 or 20 times faster
• Record-breaking results on speech reco
benchmark

• It turned out that pre-training stage was
only needed for small data sets
• Convolutional neural networks
• Type of deep feedforward network
• Much easier to train and generalized
much better than networks with full
connectivity between adjacent layers

• Designed to process data that come in the form
of multiple arrays
• E.g., a color image composed of three 2D arrays of pixel
intensities in three color channels
• Many data modalities are in the form of
multiple arrays:
• 1D for signals and sequences, including language
• 2D for images and audio spectrograms
• 3D for video or volumetric images

• Take advantage of properties of natural signals
1. Local connections
2. Shared weights
3. Pooling
4. Use of many layers

• Need substantial number of training samples
• Slow learning (convergence times)
• Inadequate parameter selection techniques that
lead to poor minima

• Network should exhibit invariance to
translation, scaling and elastic
deformations
• A large training set can take care of this
• It ignores a key property of images
• Nearby pixels are more strongly correlated than distant
ones
• Modern computer vision approaches exploit this
property
• Information can be merged at later stages
to get higher order features and about
whole image

• Classic notions of simple cells and complex cells
• Architecture similar to LGN-V1-V2-V4-IT hierarchy in
visual cortex ventral pathway
• LGN: lateral geniculate nucleus receives input from retina
• 30 different areas of visual cortex: V1 and V2 are principal
• Infero-Temporal cortex performs object recognition

1. Local Receptive Fields
2. Subsampling
3. Weight Sharing

• Instead of treating input to a fully
connected network
• Two layers of Neural networks are
used
1. Layer of convolutional units
• which consider
overlapping regions
2. Layer of subsampling units
• Several feature maps and sub-
sampling
• Gradual reduction of spatial resolution
compensated by increasing no. of
features
• Final layer has softmax output
• Whole network trained
using backpropagation
Each pixel
patch is 5 x 5
This plane
has 10 x
10=100
neural network units
(called a feature
map). Weights are
same for different
planes.
So only 25
weights are
needed.
Due to weight
sharing this is
equivalent
to convolution.
Different features
have different
feature maps
10 x
10
units
5 x 5
pixel
s
2 x 2
units
Input
imag
e
5 x 5
units

• Reducing the complexity of a network
• Encouraging groups of weights to have similar
values
• Only applicable when form of the network
can be specified in advance
• Division of weights into groups, mean weight
value for each group and spread of values are
determined during the learning process

Outputs (not filters) of each layer (horizontally).
Each rectangular image is a feature map corrresponding to output for one of the learned features,
detected at each of the image positions.
Lower-level features act as oriented edge detectors

• Structured as a series of stages
• First few stages are composed of two types of
layers and a non-linearity:
1. Convolutional layer
• To detect local conjunctions of features from previous
layer
2. Non-linearity
• ReLU
3. Pooling layer
• To merge semantically similar features into one

• Organized in feature maps
• Each unit connected to local patches in
feature maps of previous layer through
weights (called a filter bank)
• Result is passed through a ReLU

• Computes maximum of a local patch of units
in one feature map
• Neighboring pooling units take input from
patches that are shifted by more than one
row or column
• Thereby reducing the dimension of the
representation
• Creating invariance to small shifts and distortions

• As simple as through a regular
deep network
• Allow all weights in all filter
banks to be trained

• Neocognitron
• Similar architecture
• Did not have end-to-end supervised
learning using Backprop
• ConvNet with probabilistic model
was used for OCR and handwritten
check reading

• Applied with great success to images: detection,
segmentation and recognition of objects and regions
• Tasks where abundant data was available
• Traffic sign recognition
• Segmentation of biological images
• Connectomics
• Detection of faces, text, pedestrians, human bodies in natural
images
• Major recent practical success is face recognition
• Images can be labeled at pixel level
• Applications in autonomous mobile robots, and self-driving cars
• Other applications gaining importance
• Natural language understanding
• Speech recognition

• Data set of 1 million images from the web
• Contained 1,000 different classes
• Error rate
• Halved the error rate of best competing computer vision approaches
• Success came from:
• Efficient use of GPUs
• ReLUs
• New regularization technique called dropout
• Techniques to generate new training samples by deforming existing ones

• Combining ConvNets and Recurrent Net Modules
• Caption generated by a recurrent neural network (RNN)
taking as input:
1. Representation generated by a deep CNN
2. RNN trained to translate high-level representations of
images into captions

• Different focus (lighter patches given more
attention)
• As it generates each word (bold), it exploits it
to achieve better translation of images to
captions

1. 10 to 20 layers of ReLUs
2. Hundreds of millions of weights
3. Billions of connections between units
4. Training time
• Would have taken weeks a couple of years ago
• Advances in hardware, software and parallelization
reduces it to a few hours
• ConvNets are easily amenable to efficient
hardware implementations
• NVIDIA, Mobileye, Intel, Qualcomm and Samsung are
developing ConvNet chips for smartphones, cameras,
robots and self-driving cars

• Deep nets have two different exponential advantages
over classic learning algorithms. Both advantages
arise from
• power of composition
• Depend on underlying data-generating distribution having an
appropriate compositional structure
1. Learning distributed representations enable
generalization to new combinations of the values of
learned features beyond those seen during training
1. E.g., 2n combinations are possible with n binary features
2. Composing layers of representations brings another
advantage: exponential in depth

• Predicting the next word from local context of earlier
words
• Each word presented as a 1-of-N vector
• In the first layer each word creates a different word vector
• In the language model, other next layers learn to convert
input word vector to output word vector for the predicted word
Words Phrases
5
0
Word
representation for
modeling
language,
non-linearly
projected to 2-D
using t-SNE
algorithm.
Semantically
similar words are
mapped nearby.
2-D representation of
phrases learnt by
English-French
encoder-decoder
learnt by RNN
Learnt using backpropagation that jointly
learns representation for each word and
function that predicts a target quantity (next
word or sequence of words for translation)

• Logic-inspired paradigm uses
• Instance of a symbol is something for which the
only property is that is either identical or non-
identical to other symbol instances
• It has no internal structure relevant to its use
• To reason with symbols they must be bound to
variables in judiciously chosen rules of inference
• Neural networks use
• big activity vectors, big weight matrices and
scalar non- linearities
• to perform fast intuitive inference that
underpins commonsense reasoning

 Standard statistical models count frequencies
of short symbol sequences of length upto N
 No of possible sequences is VN, where
V is vocabulary size
 So taking context of more than a handful
of words would require very large corpora
 N-grams treat each word as an atomic unit,
so they cannot generalize across
semantically related sequences
 Neural models can because they associate
each word with a vector of real-valued features

• Exciting early use of backpropagation was for training RNNs
• For tasks involving sequential inputs
• e.g., speech and language, it is better to use RNNs
• RNNs process input sequence one element at a time
• maintain in their hidden units a state vector that implicitly contains history
of past elements in the sequence
• Same parameters (matrices U, V, W) are used at each time step
Backpropagation algorithm is applied to the unfolded graph of computational
network To compute derivative of total error (log-probability of generating
right sequence) wrt to states si and all the parameters

• DBNs are Generative Models
• Provide estimates of both p(x|Ck) and p(Ck|x)
• Conventional neural networks are
discriminative
• Directly estimate p(Ck|x)
• Consist of several layers of Restricted
Boltzmann Machines (RBM)
• RBM
• A form of Markov Random Field

• Named after Boltzmann Distribution (Or Gibbs
Distribution)
• Gives probability that a system will be in a certain state given its
temperature
• Where E is energy and kT (constant of the distribution) is a product of
Boltzmann’s constant and thermodynamic temperature
• Energy of Boltzmann network
• Training

• What is dropout?
• Dropout as an ensemble method
• Mask for dropout training
• Bagging vs Dropout
• Prediction intractability

• Deep nets have many non-linear hidden layers
– Making them very expressive to learn complicated
relationships between inputs and outputs
– But with limited training data, many complicated
relationships will be the result of training noise
• So they will exist in the training set and not in test
set even if drawn from same distribution
• Many methods developed to reduce overfitting
– Early stopping with a validation set
– Weight penalties (L1 and L2 regularization)
– Soft weight sharing

• Best way to regularize a fixed size model is:
– Average the predictions of all possible settings of
the parameters
– Weighting each setting with the posterior probability
given the training data
• This would be the Bayesian approach
• Dropout does this using considerably less
computation
– By approximating an equally weighted geometric
mean of the predictions of an exponential number
of learned models that share parameters

• Bagging is a method of averaging over several
models to improve generalization
• Impractical to train many neural networks since
it is expensive in time and memory
– Dropout makes it practical to apply bagging to very
many large neural networks
• It is a method of bagging applied to neural networks
• Dropout is an inexpensive but powerful method
of regularizing a broad family of models

• Dropout trains an ensemble of all subnetworks
– Subnetworks formed by removing non-output units
from an underlying base network
• We can effectively remove units by multiplying
its output value by zero
– For networks based on performing a series of affine
transformations or on-linearities
– Needs some modification for radial basis functions
based on difference between unit state and a
reference value

• A simple way to prevent neural net overfitting
(a) A standard neural net with
two hidden layers
(b) A thinned net produced by
applying dropout, crossed units
have been dropped
Drop hidden and
visible units from net,
i.e., temporarily remove
it from the network with
all input/output connections.
Choice of units to drop is
random, determined by a
probability p, chosen by a
validation set, or equal to 0.5

Deep net in Keras
Validate on CIFAR-10 dataset
Network built had three convolution layers of
size 64, 128 and 256 followed by two densely
connected layers of size 512 and an output layer
dense layer of size 10
Accuracy vs dropout Loss vs dropout

• In bagging we define k different models,
construct k different data sets by sampling from
the dataset with replacement, and train model i
on dataset i
• Dropout aims to approximate this process, but
with an exponentially large no. of neural
networks

• Remove non-output units
from base network.
• Remaining 4 units
yield 16 networks
• Here many networks have
no path from input to output
• Problem insignificant with
large networks

• To train with dropout we use minibatch
based learning algorithm that takes
small steps such as SGD
• At each step randomly sample a binary
mask
– Probability of including a unit is a
hyperparameter
• 0.5 for hidden units and 0.8 for input units
• We run forward & backward
propagation as usual

Feed-
forward
network
• Network with binary vector μ
whose elements correspond to
input and hidden units
• Elements of μ
• With probability
of 1 being a
hyperparameter
• 0.5 for hidden
• 0.8 for input
• Each unit is
• Multiplied by
corresponding mask
• Forward prop as usual
• Equivalent to randomly selecting one of the
subnetworks of previous slide

• Suppose that mask vector μ specifies which
units to include
• Cost of the model is specified by J(θ,μ)
• Drop training consists of minimizing Eμ(J(θ,μ))
• Expected value contains exponential no. of
terms
• We can get an unbiased estimate of its gradient
by sampling values of μ

• Dropout training not same as bagging training
– In bagging, the models are all independent
– In dropout, models share parameters
• Models inherit subsets of parameters from parent
network
• Parameter sharing allows an exponential no. of
models with a tractable amount of memory
• In bagging each model is trained to
convergence on its respective training set
– In dropout, most models are not explicitly trained
• Fraction of sub-networks are trained for a single step
• Parameter sharing allows good parameter settings

• Bagging:
– Ensemble accumulates votes of members
– Process is referred to as inference
• Assume model needs to output a probability distribution
• In bagging, model i produces p(i)(y|x)
• Prediction of ensemble is the mean
• Dropout:
– Submodel defined by mask vector μ defines a
probability distribution p(y|x,μ)
– Arithmetic mean over all masks is
• Where p(μ) is the distribution used to sample μ at
training time
k i=1
k
1
p(i)
(y |x )

p(y |x,)

• Dropout prediction is
 It is intractable to evaluate due to
an exponential no. of terms
 We can approximate inference using sampling
 By averaging together the output from many
masks
 10-20 masks are sufficient for good performance
 Even better approach, at the cost of a single
forward propagation:
 use geometric mean rather than arithmetic mean
of the ensemble member’s predicted distributions 18

p(y |x,)

ACUMENS ON NEURAL NET AKG 20 7 23.pptx

Recommended

Recommended

More Related Content

Similar to ACUMENS ON NEURAL NET AKG 20 7 23.pptx

Similar to ACUMENS ON NEURAL NET AKG 20 7 23.pptx (20)

More from gnans Kgnanshek

More from gnans Kgnanshek (19)

Recently uploaded

Recently uploaded (20)

ACUMENS ON NEURAL NET AKG 20 7 23.pptx