3. Its very hard to write programs that solve problems like recognizing a 3D
object from a novel viewpoint.
Even if we could have written such a program, it would have been very complicated
Its hard to write a program that computes credit card fraudulent
There are no specific rules that are simple and reliable. We need to combine a large
number of weak rules
Fraud is a moving target. The program needs to keep changing
4. Instead of writing programs for a specific task, we collect a lot of examples
that specify the correct output for a given input
The machine learning algorithm then takes these examples and produces a
program that does the job
Looks very different from a typical program
Program works well for new cases and the ones that we train it on
If the data changes, the program can change by training on the new data
Massive amount of computation is now cheaper than paying for the code
5. To study how the brain actually works
It very big and complicated. So, we need the use of computer simulation
To understand the style of parallel computation inspired by neurons and
their adaptive connection
Very different from sequential connection
Should be good at things that the brain is good at. Ex. Vision
Should be bad at thing the brain is bad at. Ex. Computation 24 X 44
To solve practical problems by using novel learning algorithms inspired by
brain
6. Revolutionary Idea: think of neural tissue as circuits performing mathematical
computation
7. Linear weighted sum of inputs
Non-linear, possibly stochastic transfer function
Learning rule
8. Gross physical structure
One axon that branches
There is a dendritic tree that collects inputs from other neurons
Axons typically contact dendritic tree at synapses
A spike of activity in axon causes charge to be injected into post-synaptic neuron
Spike generation
There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at
synapse to depolarize the cell membrane
9. To model things we have to idealize them (ex. Atoms)
Idealization removes complicated details that are not essential for understanding the
main principle
Allows to apply mathematics and make analogies to other familiar systems
Its worth understanding models that are known to be wrong
Neurons that communicate real values rather than discrete spikes of activity
10. These are simple but computationally limited
If we can make them learn, we may get insight into more complicated neurons
11. First compute the weighted sum of the inputs
Then send out the fixed spike of activity if the weighted sum exceeds a threshold
There are two equivalent ways to write the equations for a binary threshold neuron
12. Also called threshold linear neuron
Computes a linear weighted sum of their inputs
The output is a non-linear function of the total input
13. Gives a real valued output that is smooth and bounded function of their total input
Typically, they use logistic function
They have derivatives, which make the learning easy
14. They use the same equation as logistic unit
They treat the output of the logistic as probability of producing a spike in short time window.
15. Supervised Learning
Learn to predict the output when given an input vector
Reinforcement Learning
Learn to select an action to maximize payoff
Unsupervised Learning
Discover a good internal representation of the input
16. Each training case consist of an input vector x and a target output t
Regression: the target output is a real number or a whole vector of a real number
The price of stock in six month time
The temperature at noon tomorrow
Classification: the target output is a class label
The simplest case is between 1 and 0
We can have multiple alternative label
Working - We start by choosing a model class: y=f(x, w)
A model class f, is a way of using some numerical parameters, w to match each input vector, x
into a predicted output y
17. In reinforcement learning, the output is an action or sequence of actions and the only
supervisory signal is the occasional scalar reward.
The goal in selecting each action is to maximize the expected sum of the future rewards.
Reinforcement learning is difficult
The rewards are typically delayed and its hard to know where we went wrong.
A scalar reward does not supply much information
18. Architecture of a neural network means way in which the neurons are connected
to each other
19. Most common type
First layer is input and the last layer is output
If there is more than one hidden layer, we call it
“deep” neural network
They compute a series of transformations that
change the similarities between cases
Activities of neuron in each layer are a non-linear
function of the activities in the layer below
20. These have directed cycles in their connection graph
This means that you can sometimes get back to where you started by
following the arrows
They can have complicated dynamics and this can make them very
difficult to train
They are more biologically realistic
They have a natural way to model sequential data
Equivalent to deep nets with one hidden layer per time slice
They use same weights at every time slice and get input at every time slice
They have the ability to remember information in their hidden state
for a long time
Its hard to train to use this potential
21. Like recurrent network, but the connections between units are symmetrical (have same
weights in both direction)
Much easier to analyze than recurrent networks
More restricted in what they can do because they are restricted by energy function
Ex. Cannot model cycles
Symmetrically connected nets without hidden units are called Hopfield nets
22.
23.
24.
25. The space has one dimension for each
weight
A point in the space represents a
particular setting of the weight
Each training case represents a
hyperplane
The weights must lie on one side of a
hyperplane to get the answer correct
26.
27.
28.
29.
30. Theorem:
If a problem is linearly separable, then
A perceptron will learn it
In a finite number of steps
31.
32.
33.
34. Works fine for single layer of trainable weights, but what about multi-layer neurons?
35.
36.
37.
38. In perceptron, the weights are always getting closer to a good set of weights
In a linear neuron, the outputs are always getting closer to the target output
Why perceptron convergence procedure cannot be generalized to hidden layers?
The perceptron learning algorithm works by ensuring that every time the weights change, they
get closer to a generously feasible set of weights
This type of extension cannot be extended to more complex network
We hence show that the actual output values get closer to the target values while this may
not be the case with perceptron i.e. the outputs may get away from the target outputs
39. Does the learning procedure eventually get the right answer?
There may be no perfect answer
By making the learning rate slow, we get very close to the desired answer.
How quickly do the weights converge?
Can be very slow if the input dimensions are highly correlated
40.
41. Selection of parameter values which are optimal in some desired sense
Ex. Minimize the object function over a dataset
Parameters are weights and biases
Training the neural nets is iterative and time consuming and hence its in our interest to
reduce training time
Methods
Gradient descent
Line search
Conjugate gradient search
42.
43.
44.
45.
46.
47. Horizontal axis corresponds
to weight and vertical axis for
error
For linear neuron with squared
error, it is a quadratic bowl
Vertical cross-sections are
parabolas
Horizontal cross-sections are
ellipses
48. The gradient is big in the direction in which we only want to travel a small distance
The gradient is small in the direction in which we want to travel large distance
49. If the learning rate is big, the weight slosh to and
fro across the ravine.
If the learning rate is too big, this oscillation
diverges
What we would like to achieve
Move quickly in directions with small and
consistent gradients
Move slowly in direction with big inconsistent
gradients
50. Straightforward, iterative, tractable, locally optimal descent in error
Cannot avoid local minima and cannot escape them – my overshoot them
Cannot guarantee a scalable bound on time complexity
Search direction only locally optimal
51. Local minima is possible by random perturbation
Stochastic gradient descent is a form of injecting randomness into gradient descent
52.
53. When applying machine leaning to sequences, we often want to turn an input sequence to
output sequence that lives in different domain
Ex. Turn a sequence of sound pressure into a sequence of word identities
When there is no separate target sequence, we get a teaching sequence by trying to
predict the next term in the input sequence
Target output sequence is the input output sequence with an advance of 1 step
Its like predicting one pixel of an image from the other pixel, or one patch of the image from
other
54. Autoregressive models
Output depends linearly on its own previous
values
Take previous terms and predicts the next
Weighted average of previous terms
Feed-forward neural network
Take in a few terms, put them through some
hidden units and predict the next term
Connection between units do not form a
directed cycle
55. Recurrent means feeding back on itself
They are powerful because they combine
two properties:
Have distributes hidden states – means several
different units can be active at once. Hence, they
can remember multiple values at once
Non-linear Dynamics – allows the dynamics to be
updated in complicated way
56. They can oscillate – good for motor control
They can settle to point attractors – good for retrieving memories
They behave chaotically – bad for information processing
Implement small programs in parallel
57. Recurrent backpropogation network
Discrete time
Simple Recurrent Network – Elamn net
Jodan net
Fixed point attractor network
Continuous time
Spin – Glass Model – Hopfield, Boltzmann
Interactive – Activation Model: cognitive modeling
Competitive networks – self-organizing feature maps
58.
59.
60.
61. Assume that there is a time delay of
one in using each connection
The recurrent net is just a layered
net that keeps reusing the same
weights
62. We can specify inputs in several ways:
Specify the initial subsets of all the units
Specify the initial states of a subset of
units
Specify the states of the same subset of
the units at every time step
Specify desired final activities of all the
units
Specify desired activities of all units for
the last few steps
Specify the desired activity of a subset of
unit
63. LIMITATIONS
Maximum number of digits must be decided in
advance
This cannot be generalized for long numbers
because it use different weights
64.
65. The network has two input units and one output unit
Given two input unit each time
Desired output for each step is the output for column
that was provided as input two time steps ago
Takes one time step to update the hidden units based on
the input
It takes another time step for the hidden unit to cause the
output
66. There is big difference between the
forward pass and the backward pass
In forward pass, we use squashing
function (like logistic) to prevent the
activity vectors from exploding
The backward pass is completely linear. If
you double the error derivatives at the
final layer, all the error derivatives will
double
67. What happens to the magnitude of the gradient as we backpropogate?
If the weights are small, the gradients shrinks exponentially
If the weights are big, the gradients grows exponentially
Typical feed-forward nets can cope with these factors because they have a few hidden
units
In the RNN trained on long sequences, the gradients can explode or vanish
Can be avoided by initializing the weights exponentially
68.
69. Long Short Term Memory:
Make RNN of the little modules that are designed to hold values for a long time
Hessian Free Optimization:
Deals with vanishing gradient problem
Ex. HF optimizer
Echo State Networks
Initialize connections so that the hidden state has a huge reservoir of weakly coupled oscillator
Good initialization with momentum
Initialize like in echo-state networks but learn all connections using momentum
70. Dynamic state of the neural network is a short term memory which has to be converted to
long term to make the data last
Very successful for task like recognizing handwriting
Example considered – getting a RNN to remember things for long time (like hundred of
time steps)
Uses logistic and linear units
Write gate - Information gets in
Keep gate - Information stored
Read gate - Information is extracted
71. Circuit implements analog memory cell
Linear unit with self link and weight of 1 will
maintain state
Activate write gate to store information
Activate read gate for retrieving information
Backprop is possible because logistic has nice
derivatives
72.
73. Perceptron
Make early layers random and fixed
We learn the last layer which is a linear model
It uses the transformed inputs to predict the output
Echo state network
Fix the input->hidden and hidden->hidden connections at random values
Learn hidden->output connection
Choose the random connections carefully
80. Output units are said to be in competition for input patterns
During training, the output unit that provides highest activation to a given pattern is
declared the winner and is moved closer to the input pattern
Unsupervised learning
Also called winner-takes-all
One neuron wins over all others
Only the winning neuron learns
Hard Learning – weight of only the winner is updated
Soft Learning – weight of winner and close associates is updated
81.
82. Produces a mapping from multi-dimensional input space onto a lattice of clusters
Mapping is topology-preserving
Typically organized as 1D or 2D lattice
Have a strong neurological basis
Topology is preserved. Ex. If we touch parts of the body that are close together, group of cells will
fire that are also close together
K-SOM results from synergy of three basic processes
Competition
Cooperation
Adaptation
83. Each neuron in SOM is assigned a weight vector
with same dimensionality N as the input space
Any given input pattern is compared to the
weight vector of each neuron and the closest is
declared the winner
The Euclidean norm is usually used to measure
distance
84. The activation of winning neuron is spread to
neurons in its immediate neighbourhood
This allows topologically close neurons to become
sensitive to similar patterns
The size of neighbourhood is initially large, but
shrinks over time
Large neighbourhood promotes a topology
preserving mapping
Smaller neighbourhood allows neurons to
specialize in later stages of training
85. During training, the winner neuron and its
topological neighbours are adapted to make their
weight vectors more similar to the input pattern
that caused the activation
Neurons that are closer to the winner will adapt
more heavily than neurons far away
Magnitude of adaptation is controlled by learning
rate
86. A neuron learn by shifting its weight from inactive neurons to active neurons
Change Dwij applied to synaptic weight wij as
where xi is the input signal and a is the learning rate parameter
The overall effect relies on moving the synaptic weight vector of the winning neuron
towards the input pattern
Matching criteria is the equivalent Euclidean distance
D
ncompetitiothelosesneuronif,0
ncompetitiothewinsneuronif),(
j
jwx
w
iji
ij
87. Euclidean distance is given by
where xi and wij are the ith elements of the vectors X and Wj, respectively.
To identify the winning neuron, jX, that best matches the input vector X, we may apply the
following condition:
2/1
1
2
)(
n
i
ijij wxd WX
,j
j
minj WXX j = 1, 2, . . .,m
88. Suppose 2D input vector is presented to three neuron kohenen network
Initial weight vector is given by
12.0
52.0
X
81.0
27.0
1W
70.0
42.0
2W
21.0
43.0
3W
89. We find the winning neuron using the minimum-distance euclidean criteria
Neuron 3 is the winner and its weight vector is updated according to the competitive
learning rule
2
212
2
1111 )()( wxwxd 73.0)81.012.0()27.052.0( 22
2
222
2
1212 )()( wxwxd 59.0)70.012.0()42.052.0( 22
2
232
2
1313 )()( wxwxd 13.0)21.012.0()43.052.0( 22
0.01)43.052.0(1.0)( 13113 D wxw
0.01)21.012.0(1.0)( 23223 D wxw
90. The updated weight vector with iteration (p+1) is determined as
The weight vector w3 of the winning neuron 3 becomes closer to the input vector X with
each iteration
D
20.0
44.0
01.0
0.01
21.0
43.0
)()()1( 333 ppp WWW
91. Kohenen network with 100 neurons arranged in the form of 2D lattice with 10 rows and
10 columns
The network is required to classify 2D input vectors – each neuron should respond to
input vectors occurring in that region only
The network is trained with 1000 2D input vectors generated randomly in a square
region in the interval between -1 and +1
Learning rate parameter is 0.1
96. Serves as Content-Addressable Memory system with binary threshold nodes
Provides model for understanding human memory
Used for storing memory as distributed patterns of activity
Stable states are fixed point attractors
97. Two ways of updating
Asynchronous: picks one neuron, calculate weight sum and updates immediately. Can be done in
fixed order or neurons can be picked at random
Synchronous: weight sum is calculated without updating neurons. Then all neurons are set to
new values
Conditions on weight matrix:
symmetry: wij = wji
no self connections: wii = 0s
98. Global energy depends on one connection weight and the binary state of two neurons
Weight of two neurons
Activity of two
connecting neurons
Bias term
99.
100. Memories could be energy minima of a neural net
The binary threshold decision rule can then be used to clean up incomplete or corrupted
memories
Using energy minima to represent memories gives a content-addressable memory
An item can be accessed by just knowing a part of its content