This document discusses neural networks and their applications. It begins with an overview of neurons and the brain, then describes the basic components of neural networks including layers, nodes, weights, and learning algorithms. Examples are given of early neural network designs from the 1940s-1980s and their applications. The document also summarizes backpropagation learning in multi-layer networks and discusses common network architectures like perceptrons, Hopfield networks, and convolutional networks. In closing, it notes the strengths and limitations of neural networks along with domains where they have proven useful, such as recognition, control, prediction, and categorization tasks.
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
1. Presented by
Dr. Gnanasekar.A.K
Professor & Head
Department of Computer Science & Engineering
P T Lee Chengalvaraya Naicker College of Engineering & Technology, Kanchipuram
hodcse@ptleecncet.com
9677338977
2. Neurons and the Brain
Neural Networks
Perceptions
Multi-layer Networks
Applications
The Hopfield Network
Multilayer Neural Networks
Deep Convolutional Nets
Recurrent Nets
Deep Belief Networks
Dropout
3. A model of reasoning based on the human brain
Complex networks of simple computing elements
Capable of learning from examples
with appropriate learning methods
Collection of simple elements performs high-level
operations
4. The human brain incorporates nearly 10 billion
neurons and 60 trillion connections between them.
Our brain can be considered as a highly complex,
non-linear and parallel information-processing
system.
Learning is a fundamental and essential
characteristic of biological neural networks.
5. [Russell & Norvig, 1995]
Year Neural Network Designer Description
1943 McCulloch and Pitts
Neuron
Mc Culloch Pitts Logic gates
1949 Hebb Hebb Strength increases
if neurons are
active
1958-1988 Perceptron Rosenblatt Weights of path
can be adjusted
1960 Adaline Widrow and Hoff Mean squared error
1972 SOM(self-organizing map) Kohonen Clustering
1982 Hopfield John Hopfield Associative
memory nets
1986 Back Propagation Rumelhard Multilayer
1987-90 ART(Adaptive Resonance
Theory)
Carpenter Used for both
binary and analog
11. Recall:- processing phase for a NN and its
objective is to retrieve the information. The
process of computing o for a given x
Basic forms of neural information
processing
Auto association
Hetero association
Classification
12. Set of patterns can be
stored in the network
If a pattern similar to
a member of the
stored set is
presented, an
association with the
input of closest stored
pattern is made
13. Associations between
pairs of patterns are
stored
Distorted input
pattern may cause
correct
heteroassociation at
the output
14. Set of input patterns
is divided into a
number of classes or
categories
In response to an
input pattern from
the set, the classifier
is supposed to recall
the information
regarding class
membership of the
input pattern.
15. Weights
Bias
Threshold
Learning rate
Momentum factor
Vigilance parameter
Notations used in ANN
16. Each neuron is connected to every other
neuron by means of directed links
Links are associated with weights
Weights contain information about the input
signal and is represented as a matrix
Weight matrix also called connection matrix
17. W=
1
2
3
.
.
.
.
.
T
T
T
T
n
w
w
w
w
11 12 13 1
21 22 23 2
1 2 3
...
...
..................
...................
...
m
m
n n n nm
w w w w
w w w w
w w w w
=
18. wij –is the weight from processing element ”i”
(source node) to processing element “j” (destination node)
X1
1
Xi
Yj
Xn
w
1j
wij
wnj
bj
0
0 0 1 1 2 2
0
1
1
....
n
i ij
inj
i
j j j n nj
n
j i ij
i
n
j i ij
inj
i
y xw
x w xw x w x w
w xw
y b xw
19. Used to calculate the output response of
a neuron.
Sum of the weighted input signal is
applied with an activation to obtain the
response.
Activation functions can be linear or non
linear
Already dealt
Identity function
Single/binary step function
Discrete/continuous sigmoidal function.
20. Bias is like another weight. Its included by
adding a component x0=1 to the input vector
X.
X=(1,X1,X2…Xi,…Xn)
Bias is of two types
Positive bias: increase the net input
Negative bias: decrease the net input
21. The relationship between input and output given by
the equation of straight line y=mx+c
X Y
Input
C(bias)
y=mx+C
22. Set value based upon which the final output of
the network may be calculated
Used in activation function
The activation function using threshold can be
defined as
ifnet
ifnet
net
f
1
1
)
(
23. Denoted by α.
Used to control the amount of weight
adjustment at each step of training
Learning rate ranging from 0 to 1 determines
the rate of learning in each time step
25. No one‐fits‐all formula
Over fitting can occur if a “good” training set is
not chosen
What constitutes a “good” training set?
Samples must represent the general
population.
Samples must contain members of each class.
Samples in each class must contain a wide
range of variations or noise effect.
The size of the training set is related to the
number of hidden neurons
27. This is what we have seen so far!
A network is fed with a set of training
samples (inputs and corresponding
output), and it uses these samples to
learn the general relationship between
the inputs and the outputs.
This relationship is represented by the
values of the weights of the trained
network.
28. No desired output is associated with
the training data!
Faster than supervised learning
Used to find out structures within data:
Clustering
Compression
29. Like supervised learning, but:
Weights adjusting is not directly related to
the error value.
The error value is used to randomly, shuffle
weights!
Relatively slow learning due to
‘randomness’.
30. Function approximation
including time series prediction and modelling.
Classification
including patterns and sequences recognition,
novelty detection and sequential decision making.
(radar systems, face identification, handwritten text
recognition)
Data processing
including filtering, clustering blinds source
separation and compression.
(data mining, e-mail Spam filtering)
31. Advantages
Adapt to unknown situations
Powerful, it can model complex functions.
Ease of use, learns by example, and very
little user domain‐specific expertise needed
Disadvantages
Forgets
Not exact
Large complexity of the network structure
32. [Russell & Norvig, 1995]
This vastly simplified model of real neurons is also
known as a Threshold Logic Unit
A set of connections brings in activations from other
neurons.
A processing unit sums the inputs, and then applies a non-
linear activation function (i.e. squashing/ Transfer/
threshold function).
An output line transmits the result to other neurons
y = F(w1x1+ w2x2 - b)
34. 1
2
Y
1
2
Y
1st Neural Network: AND function
Threshold(Y) = 2
X1
Y
X2
1
1
1st Neural Network: OR function
X1
Y
X2
2
2
Threshold(Y) = 2
35. 1
2
Y
1
Y
2
1st Neural Network: AND not function
Threshold(Y) = 2
X1
Y
X2
2
-1
1st Neural Network: XOR function
X1
Y
X2
2
-1
Z2
Z1
2
-1
2
2
Threshold(Y) = 2
36.
37. Weighted inputs are summed up by the input function
The (nonlinear) activation function calculates the activation value, which
determines the output
[Russell & Norvig, 1995]
38. Stept(x) = 1 if x >= t, else 0
Sign(x) = +1 if x >= 0, else –1
Sigmoid(x)= 1/(1+e-x)
[Russell & Norvig, 1995]
39. simple neurons can act as logic gates
appropriate choice of activation function, threshold, and
weights
step function as activation function
[Russell & Norvig, 1995]
40. layered structures
networks are arranged into layers
interconnections mostly between two layers
some networks may have feedback
connections
41. single layer, feed-
forward network
historically one of
the first types of
neural networks
late 1950s
the output is
calculated as a step
function applied to
the weighted sum of
inputs
capable of learning
simple functions
linearly separable
[Russell & Norvig, 1995]
42. [Russell & Norvig, 1995]
perceptrons can deal with linearly separable functions
some simple functions are not linearly separable
XOR function
0,0
0,1
1,0
1,1
0,0
0,1
1,0
1,1
AND XOR
43. linear separability can be extended to more than two dimensions
more difficult to visualize
[Russell & Norvig, 1995]
44. This is done by making small adjustments in the weights
to reduce the difference between the actual and desired
outputs of the Perceptron.
The initial weights are randomly assigned
usually in the range [0.5, 0.5], or [0, 1]
Then the they are updated to obtain the output
consistent with the training examples.
45. perceptrons can learn from examples through a simple
learning rule. For each example row (iteration), do the
following:
calculate the error of a unit Erri as the difference between
the correct output Ti and the calculated output Oi
Erri = Ti - Oi
adjust the weight Wj of the input Ij such that the error
decreases
Wij = Wij + *Iij * Errij
is the learning rate, a positive constant less than unity.
this is a gradient descent search through the weight space
46.
47. The network consists of an input layer of source
neurons, at least one middle or hidden layer of
computational neurons, and an output layer of
computational neurons.
The input signals are propagated in a forward
direction on a layer-by-layer basis
feedforward neural network
the back-propagation learning algorithm can be
used for learning in multi-layer networks
48. two-layer network
input units Ik
usually not counted as a
separate layer
hidden units aj
output units Oi
usually all nodes of one
layer have weighted
connections to all nodes
of the next layer
Ik
aj
Oi
Wji
Wkj
49.
50. Learning in a multilayer network proceeds the
same way as for a perceptron.
A training set of input patterns is presented to
the network.
The network computes its output pattern, and if
there is an error or in other words a difference
between actual and desired output patterns
the weights are adjusted to reduce this error.
proceeds from the output layer to the hidden layer(s)
updates the weights of the units leading to the layer
51. In a back-propagation neural network, the learning
algorithm has two phases.
First, a training input pattern is presented to the
network input layer. The network propagates the input
pattern from layer to layer until the output pattern is
generated by the output layer.
If this pattern is different from the desired output, an
error is calculated and then propagated backwards
through the network from the output layer to the input
layer. The weights are modified as the error is
propagated.
57. expressiveness
weaker than predicate logic
good for continuous inputs and outputs
computational efficiency
training time can be exponential in the number of inputs
depends critically on parameters like the learning rate
local minima are problematic
can be overcome by simulated annealing, at additional cost
generalization
works reasonably well for some functions (classes of
problems)
no formal characterization of these functions
58. sensitivity to noise
very tolerant
they perform nonlinear regression
transparency
neural networks are essentially black boxes
there is no explanation or trace for a particular answer
tools for the analysis of networks are very limited
some limited methods to extract rules from networks
prior knowledge
very difficult to integrate since the internal representation of
the networks is not easily accessible
59. domains and tasks where neural networks
are successfully used
recognition
control problems
series prediction
weather, financial forecasting
categorization
sorting of items (fruit, characters, …)
60. Neural networks were designed on analogy with
the brain.
The brain’s memory, however, works by
association.
For example, we can recognise a familiar face even in an
unfamiliar environment within 100-200 ms.
We can also recall a complete sensory experience,
including sounds and scenes, when we hear only a few bars
of music.
The brain routinely associates one thing with
another.
The Hopfield Network
61. Multilayer neural networks trained with the back-
propagation algorithm are used for pattern
recognition problems.
However, to emulate the human memory’s
associative characteristics we need a different type
of network: a recurrent neural network.
A recurrent neural network has feedback loops
from its outputs to its inputs.
62. Single-layer n-neuron Hopfield network
xi
x1
x2
xn
I
n
p
u
t
S
i
g
n
a
l
s
yi
y1
y2
yn
1
2
i
n
O
u
t
p
u
t
S
i
g
n
a
l
s
The stability of recurrent networks was solved only in
1982, when John Hopfield formulated the physical
principle of storing information in a dynamically
stable network.
63. [Russell & Norvig, 1995]
o Objective function, averaged over all
training examples is a hilly landscape in
high-dimensional space of weight values
o Negative gradient vector indicates
direction of steepest descent taking it
closer to a minimum
64. • Goal is to learn the weights w from a
labelled set of training samples
• Learning procedure has two stages
1. Evaluate derivatives of error function
∇E(w) with respect to weights w1,..wT
2. Use derivatives to compute adjustments to
weights
w(1)
w( )
E(w( )
)
65. • Gradient descent
• Update the weights using
• Where the gradient vector ∇ E (w(τ)
) consists of the vector of
derivatives evaluated using back-propagation
w(τ + 1)
= w(τ )
− η∇ E (w(τ )
)
∇ E (w) =
d
dw
∂E
11
∂w(1)
.
MD
(1)
∂E
11
∂w(2)
.
K M
⎢
⎢
⎢
⎢
⎢
⎡ ⎤
⎢
⎢
E (w) = ⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎢
∂E
⎥
⎥
⎥
⎥
⎥
⎥ There are W= M(D+1)+K(M+1) elements
in the vector
⎥
⎥
⎢ ∂w ⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎢ ∂E ⎥
⎥
⎢ ∂w(2) ⎥
⎥
⎦
is a W x 1 vector
(τ )
Gradient ∇ E (w )
66. Most practitioners use SGD for DL
Consists of showing the input vector for a few examples,
computing the outputs and the errors,
Computing the average gradient for those examples,
and adjusting the weights accordingly.
Process repeated for many small sets of examples
from the training set until the average of the objective function stops decreasing.
Called stochastic because each small set of examples
gives a noisy estimate of the average gradient over all examples.
Usually finds a good set of weights quickly compared to elaborate optimization techniques.
After training, performance is measured
on a different test set:
tests generalization ability of the machine
— its ability to produce sensible answers on new inputs
never seen during training.
Flucutations in objective
as gradient steps are taken
in mini batches
67. 4. Use
• Backpropagation Formula
• Value of for a particular
hidden unit can be obtained
by propagating the ’s
backward from units higher-
up in the network
j h'(aj )wkjk
k
1. Apply input vector xn to network and
forward propagate through network using
i
aj w jizi and zj=h(aj)
k
2. Evaluate k for all output units using
k=yk-tk
3. Backpropagate the ’s using
j h'(aj )wkjk
to obtain j for each hidden unit
En
w ji
to evaluate required derivatives
z
j i
Unit
j
Unit
k
68.
69. • Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak
• Hidden units have logistic sigmoid
activation function
h(a)=tanh (a)
where
simple form for derivative
tanh(a)
ea
ea
ea
ea
h'(a) 1 h(a)2
Standard Sum of Squared Error
yk: activation of output unit k tk
: corresponding target
for input xk
1
2 k1
En yk tk
K 2
70. • Forward Propagation
• Backward Propagation (s for hidden
units)
• Derivatives wrt first layer and second layer
weights
• Batch method
j
(1)
a w x
ji i
k
(2)
j 0
i 0
z j tanh(aj )
M
y w z
kj j
• Output differences
k yk tk
Simple Example: Forward and Backward Prop
For each input in training set:
D
j j
2
(1 z ) w
kj k
k1
K
En En
w(1)
ji w(2)
kj
x z
j i k j
j h'(aj )wkjk
k
h'(a) 1 h(a)2
∂E
= ∑
∂En
∂wji
∂wji
n
71. 2 Hidden Layers, 1 Output Layer
Each layer is a module through which one can back-propagate
gradients.
At each layer, we
first compute the
total input z to each
unit,
which is a weighted sum of
the outputs
of the units in the layer
below.
Then a non-linear function f(.)
is applied to z to get the
output of the unit.
72. A small change Δxin x gets transformed
first into a small change Δ
y in y by getting
multiplied
by ∂y/∂x (that is, the definition of
partial derivative). Similarly, the
change Δ
y creates a change Δ
z in z.
Substituting one equation into the other
gives the chain rule of derivatives —
how Δxgets turned into Δzthrough
multiplication by the product of ∂y/∂x
and ∂z/∂x.
It also works when x, y and z are vectors
(and the derivatives are Jacobian
matrices).
Tells us how two small effects (that of a small change of x on y, and that of y on z)
are composed.
73. Two hidden layers Layers
are labelled i,j,k,l
At each hidden layer we compute
the error derivative wrt the output of each unit,
which is a weighted sum of the error derivatives
wrt the total inputs to the units in the layer above.
Convert error derivative wrt the output into the error
derivative wrt the input
by multiplying it by the gradient of f(z).
At the output layer, the error derivative wrt
the output of a unit is computed
by differentiating the cost function.
This gives yl − tl if the cost function for unit l is
l
0.5(yl − t )2,
where tl is the target value.
Once the ∂E/∂zk is known,
the error-derivative for the weight wjk
on the connection from
unit j in the layer below is just yj ∂E/∂zk.
74. • Most popular today is Rectified Linear Unit(ReLU)
• a half-wave rectifier
f(z)=max(z,0)
• In past decades, neural nets used smoother
non- linearities
tanh(z) or 1/(1+exp(-z))
• ReLU learns faster in networks with many layers
• Allowing training of deep supervised network
without unsupervised pre-training
75. • Rectified linear unit (ReLU)
f(z) = max(0,z)
commonly used in recent years,
• More conventional sigmoids:
• hyberbolic tangent,
f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and
• logistic function,
f(z) = 1/(1 + exp(−z))
76. with 2 Inputs, 2 HiddenUnits, 1 Output unit
• By distorting the input space
• Note that grid is also distorted
x1
x2
x1 y1
x2 y2
• Input Data (red and blue curves) are not linearly separable
• Network makes them linearly separable
z
77. • Linear classifiers can only carve the input space
into very simple regions
• Image and speech recognition require input-output
function to be insensitive to irrelevant variations of
the input,
• e.g., position, orientation and illumination of an object
• Variations in pitch or accent of speech
• While being sensitive to minute variations, e.g., white wolf
and breed of wolf-like white dog called Samoyed
• At pixel level two Samoyeds in different positions may be
very different, whereas a Samoyed and a wolf in the
same position and background may be very similar
78. • Shallow classifiers need a good feature extractor
• One that produces representations that are:
• selective to aspects of image important for
discrimination
• but invariant to irrelevant aspects such as pose of the
animal
• Generic features (e.g., Gaussian kernel)
do not generalize well far from training
examples
• Hand-designing good feature extractors
requires engineering skill and domain
expertise
• Deep learning learns features automatically
79. • Multilayer stack of simple modules
• All modules (or most) subject to :
• Learning
• Non-linear input-output mappings
• Each module transforms input to improve both
selectivity and invariance of the
representation
• With depth of 5 to 20 layers can implement
extremely intricate functions of input
• Sensitive to minute details
• Distinguish Samoyeds from white wolves
• Insensitive ti irrelevant variations
• Background, pose, lighting, surrounding objects
80. • It was thought that simple gradient descent would get
trapped in local minima
• Rarely a problem with large networks
• Regardless of initial conditions, system always reaches
solutions of similar quality
• Landscape is packed with a combinatorially large number of
saddle points where gradient is zero
• Almost all have similar values of objective function
w(τ+1)
= w(τ )
− η∇ E (w(τ )
)
81. Srihari
• Unsupervised learning to create layers of
feature detectors
• No need for labelled data
• Objective of learning in each layer of
feature detectors:
• Ability to reconstruct or model activities of feature detectors
(or raw inputs) in layer below
• By pre-training weights of a deep network could be initialized
to sensible values
• Final layer of output units added at top of
network and whole deep system fine-tuned using
back- propagation
• Worked well in handwritten digit recognition
• When data was limited
82. • First major application was in speech recognition
• Made possible by advent of fast GPUs
• Allowed networks to be trained 10 or 20 times faster
• Record-breaking results on speech reco
benchmark
83. • It turned out that pre-training stage was
only needed for small data sets
• Convolutional neural networks
• Type of deep feedforward network
• Much easier to train and generalized
much better than networks with full
connectivity between adjacent layers
84. • Designed to process data that come in the form
of multiple arrays
• E.g., a color image composed of three 2D arrays of pixel
intensities in three color channels
• Many data modalities are in the form of
multiple arrays:
• 1D for signals and sequences, including language
• 2D for images and audio spectrograms
• 3D for video or volumetric images
85. • Take advantage of properties of natural signals
1. Local connections
2. Shared weights
3. Pooling
4. Use of many layers
86. • Need substantial number of training samples
• Slow learning (convergence times)
• Inadequate parameter selection techniques that
lead to poor minima
87. • Network should exhibit invariance to
translation, scaling and elastic
deformations
• A large training set can take care of this
• It ignores a key property of images
• Nearby pixels are more strongly correlated than distant
ones
• Modern computer vision approaches exploit this
property
• Information can be merged at later stages
to get higher order features and about
whole image
88. • Classic notions of simple cells and complex cells
• Architecture similar to LGN-V1-V2-V4-IT hierarchy in
visual cortex ventral pathway
• LGN: lateral geniculate nucleus receives input from retina
• 30 different areas of visual cortex: V1 and V2 are principal
• Infero-Temporal cortex performs object recognition
90. • Instead of treating input to a fully
connected network
• Two layers of Neural networks are
used
1. Layer of convolutional units
• which consider
overlapping regions
2. Layer of subsampling units
• Several feature maps and sub-
sampling
• Gradual reduction of spatial resolution
compensated by increasing no. of
features
• Final layer has softmax output
• Whole network trained
using backpropagation
Each pixel
patch is 5 x 5
This plane
has 10 x
10=100
neural network units
(called a feature
map). Weights are
same for different
planes.
So only 25
weights are
needed.
Due to weight
sharing this is
equivalent
to convolution.
Different features
have different
feature maps
10 x
10
units
5 x 5
pixel
s
2 x 2
units
Input
imag
e
5 x 5
units
91. • Reducing the complexity of a network
• Encouraging groups of weights to have similar
values
• Only applicable when form of the network
can be specified in advance
• Division of weights into groups, mean weight
value for each group and spread of values are
determined during the learning process
92. Outputs (not filters) of each layer (horizontally).
Each rectangular image is a feature map corrresponding to output for one of the learned features,
detected at each of the image positions.
Lower-level features act as oriented edge detectors
93. • Structured as a series of stages
• First few stages are composed of two types of
layers and a non-linearity:
1. Convolutional layer
• To detect local conjunctions of features from previous
layer
2. Non-linearity
• ReLU
3. Pooling layer
• To merge semantically similar features into one
94. • Organized in feature maps
• Each unit connected to local patches in
feature maps of previous layer through
weights (called a filter bank)
• Result is passed through a ReLU
95. • Computes maximum of a local patch of units
in one feature map
• Neighboring pooling units take input from
patches that are shifted by more than one
row or column
• Thereby reducing the dimension of the
representation
• Creating invariance to small shifts and distortions
96. • As simple as through a regular
deep network
• Allow all weights in all filter
banks to be trained
97. • Neocognitron
• Similar architecture
• Did not have end-to-end supervised
learning using Backprop
• ConvNet with probabilistic model
was used for OCR and handwritten
check reading
98. • Applied with great success to images: detection,
segmentation and recognition of objects and regions
• Tasks where abundant data was available
• Traffic sign recognition
• Segmentation of biological images
• Connectomics
• Detection of faces, text, pedestrians, human bodies in natural
images
• Major recent practical success is face recognition
• Images can be labeled at pixel level
• Applications in autonomous mobile robots, and self-driving cars
• Other applications gaining importance
• Natural language understanding
• Speech recognition
99. • Data set of 1 million images from the web
• Contained 1,000 different classes
• Error rate
• Halved the error rate of best competing computer vision approaches
• Success came from:
• Efficient use of GPUs
• ReLUs
• New regularization technique called dropout
• Techniques to generate new training samples by deforming existing ones
100. • Combining ConvNets and Recurrent Net Modules
• Caption generated by a recurrent neural network (RNN)
taking as input:
1. Representation generated by a deep CNN
2. RNN trained to translate high-level representations of
images into captions
101. • Different focus (lighter patches given more
attention)
• As it generates each word (bold), it exploits it
to achieve better translation of images to
captions
102. 1. 10 to 20 layers of ReLUs
2. Hundreds of millions of weights
3. Billions of connections between units
4. Training time
• Would have taken weeks a couple of years ago
• Advances in hardware, software and parallelization
reduces it to a few hours
• ConvNets are easily amenable to efficient
hardware implementations
• NVIDIA, Mobileye, Intel, Qualcomm and Samsung are
developing ConvNet chips for smartphones, cameras,
robots and self-driving cars
103. • Deep nets have two different exponential advantages
over classic learning algorithms. Both advantages
arise from
• power of composition
• Depend on underlying data-generating distribution having an
appropriate compositional structure
1. Learning distributed representations enable
generalization to new combinations of the values of
learned features beyond those seen during training
1. E.g., 2n combinations are possible with n binary features
2. Composing layers of representations brings another
advantage: exponential in depth
104. • Predicting the next word from local context of earlier
words
• Each word presented as a 1-of-N vector
• In the first layer each word creates a different word vector
• In the language model, other next layers learn to convert
input word vector to output word vector for the predicted word
Words Phrases
5
0
Word
representation for
modeling
language,
non-linearly
projected to 2-D
using t-SNE
algorithm.
Semantically
similar words are
mapped nearby.
2-D representation of
phrases learnt by
English-French
encoder-decoder
learnt by RNN
Learnt using backpropagation that jointly
learns representation for each word and
function that predicts a target quantity (next
word or sequence of words for translation)
105. • Logic-inspired paradigm uses
• Instance of a symbol is something for which the
only property is that is either identical or non-
identical to other symbol instances
• It has no internal structure relevant to its use
• To reason with symbols they must be bound to
variables in judiciously chosen rules of inference
• Neural networks use
• big activity vectors, big weight matrices and
scalar non- linearities
• to perform fast intuitive inference that
underpins commonsense reasoning
106. Standard statistical models count frequencies
of short symbol sequences of length upto N
No of possible sequences is VN, where
V is vocabulary size
So taking context of more than a handful
of words would require very large corpora
N-grams treat each word as an atomic unit,
so they cannot generalize across
semantically related sequences
Neural models can because they associate
each word with a vector of real-valued features
107. • Exciting early use of backpropagation was for training RNNs
• For tasks involving sequential inputs
• e.g., speech and language, it is better to use RNNs
• RNNs process input sequence one element at a time
• maintain in their hidden units a state vector that implicitly contains history
of past elements in the sequence
• Same parameters (matrices U, V, W) are used at each time step
Backpropagation algorithm is applied to the unfolded graph of computational
network To compute derivative of total error (log-probability of generating
right sequence) wrt to states si and all the parameters
108. • DBNs are Generative Models
• Provide estimates of both p(x|Ck) and p(Ck|x)
• Conventional neural networks are
discriminative
• Directly estimate p(Ck|x)
• Consist of several layers of Restricted
Boltzmann Machines (RBM)
• RBM
• A form of Markov Random Field
109.
110. • Named after Boltzmann Distribution (Or Gibbs
Distribution)
• Gives probability that a system will be in a certain state given its
temperature
• Where E is energy and kT (constant of the distribution) is a product of
Boltzmann’s constant and thermodynamic temperature
• Energy of Boltzmann network
• Training
112. • What is dropout?
• Dropout as an ensemble method
• Mask for dropout training
• Bagging vs Dropout
• Prediction intractability
113. • Deep nets have many non-linear hidden layers
– Making them very expressive to learn complicated
relationships between inputs and outputs
– But with limited training data, many complicated
relationships will be the result of training noise
• So they will exist in the training set and not in test
set even if drawn from same distribution
• Many methods developed to reduce overfitting
– Early stopping with a validation set
– Weight penalties (L1 and L2 regularization)
– Soft weight sharing
114. • Best way to regularize a fixed size model is:
– Average the predictions of all possible settings of
the parameters
– Weighting each setting with the posterior probability
given the training data
• This would be the Bayesian approach
• Dropout does this using considerably less
computation
– By approximating an equally weighted geometric
mean of the predictions of an exponential number
of learned models that share parameters
115. • Bagging is a method of averaging over several
models to improve generalization
• Impractical to train many neural networks since
it is expensive in time and memory
– Dropout makes it practical to apply bagging to very
many large neural networks
• It is a method of bagging applied to neural networks
• Dropout is an inexpensive but powerful method
of regularizing a broad family of models
116. • Dropout trains an ensemble of all subnetworks
– Subnetworks formed by removing non-output units
from an underlying base network
• We can effectively remove units by multiplying
its output value by zero
– For networks based on performing a series of affine
transformations or on-linearities
– Needs some modification for radial basis functions
based on difference between unit state and a
reference value
117. • A simple way to prevent neural net overfitting
(a) A standard neural net with
two hidden layers
(b) A thinned net produced by
applying dropout, crossed units
have been dropped
Drop hidden and
visible units from net,
i.e., temporarily remove
it from the network with
all input/output connections.
Choice of units to drop is
random, determined by a
probability p, chosen by a
validation set, or equal to 0.5
118.
119. Deep net in Keras
Validate on CIFAR-10 dataset
Network built had three convolution layers of
size 64, 128 and 256 followed by two densely
connected layers of size 512 and an output layer
dense layer of size 10
Accuracy vs dropout Loss vs dropout
120. • In bagging we define k different models,
construct k different data sets by sampling from
the dataset with replacement, and train model i
on dataset i
• Dropout aims to approximate this process, but
with an exponentially large no. of neural
networks
121. • Remove non-output units
from base network.
• Remaining 4 units
yield 16 networks
• Here many networks have
no path from input to output
• Problem insignificant with
large networks
122. • To train with dropout we use minibatch
based learning algorithm that takes
small steps such as SGD
• At each step randomly sample a binary
mask
– Probability of including a unit is a
hyperparameter
• 0.5 for hidden units and 0.8 for input units
• We run forward & backward
propagation as usual
123. Feed-
forward
network
• Network with binary vector μ
whose elements correspond to
input and hidden units
• Elements of μ
• With probability
of 1 being a
hyperparameter
• 0.5 for hidden
• 0.8 for input
• Each unit is
• Multiplied by
corresponding mask
• Forward prop as usual
• Equivalent to randomly selecting one of the
subnetworks of previous slide
124. • Suppose that mask vector μ specifies which
units to include
• Cost of the model is specified by J(θ,μ)
• Drop training consists of minimizing Eμ(J(θ,μ))
• Expected value contains exponential no. of
terms
• We can get an unbiased estimate of its gradient
by sampling values of μ
125. • Dropout training not same as bagging training
– In bagging, the models are all independent
– In dropout, models share parameters
• Models inherit subsets of parameters from parent
network
• Parameter sharing allows an exponential no. of
models with a tractable amount of memory
• In bagging each model is trained to
convergence on its respective training set
– In dropout, most models are not explicitly trained
• Fraction of sub-networks are trained for a single step
• Parameter sharing allows good parameter settings
126. • Bagging:
– Ensemble accumulates votes of members
– Process is referred to as inference
• Assume model needs to output a probability distribution
• In bagging, model i produces p(i)(y|x)
• Prediction of ensemble is the mean
• Dropout:
– Submodel defined by mask vector μ defines a
probability distribution p(y|x,μ)
– Arithmetic mean over all masks is
• Where p(μ) is the distribution used to sample μ at
training time
k i=1
k
1
p(i)
(y |x )
p(y |x,)
127. • Dropout prediction is
It is intractable to evaluate due to
an exponential no. of terms
We can approximate inference using sampling
By averaging together the output from many
masks
10-20 masks are sufficient for good performance
Even better approach, at the cost of a single
forward propagation:
use geometric mean rather than arithmetic mean
of the ensemble member’s predicted distributions 18
p(y |x,)