SlideShare a Scribd company logo
1/120
Supervised Learning I:
Perceptrons and LMS
Instructor: Tai-Yue (Jason) Wang
Department of Industrial and Information Management
Institute of Information Management
2/120
Two Fundamental Learning
Paradigms
 Non-associative
 an organism acquires the properties of a single
repetitive stimulus.
 Associative
 an organism acquires knowledge about the
relationship of either one stimulus to another, or one
stimulus to the organism’s own behavioral response to
that stimulus.
3/120
Examples of Associative
Learning(1/2)
 Classical conditioning
 Association of an unconditioned stimulus (US) with a
conditioned stimulus (CS).
 CS’s such as a flash of light or a sound tone produce
weak responses.
 US’s such as food or a shock to the leg produce a
strong response.
 Repeated presentation of the CS followed by the US,
the CS begins to evoke the response of the US.
 Example: If a flash of light is always followed by
serving of meat to a dog, after a number of learning
trials the light itself begins to produce salivation.
4/120
Examples of Associative
Learning(2/2)
 Operant conditioning
 Formation of a predictive relationship between a
stimulus and a response.
 Example:
Place a hungry rat in a cage which has a lever on
one of its walls. Measure the spontaneous rate at
which the rat presses the lever by virtue of its
random movements around the cage. If the rat is
promptly presented with food when the lever is
pressed, the spontaneous rate of lever pressing
increases!
5/120
Reflexive and Declarative Learning
 Reflexive learning
 repetitive learning is involved and recall does not
involve any awareness or conscious evaluation.
 Declarative learning
 established by a single trial or experience and
involves conscious reflection and evaluation for its
recall.
 Constant repitition of declarative knowledge
often manifest itself in reflexive form.
6/120
 Two distinct stages:
 short-term memory (STM)
 long-term memory (LTM)
 Inputs to the brain are
processed into STMs
which last at the most for a
few minutes.
Important Aspects of Human
Memory(1/4)
Recall process of these memories
is distinct from the memories
themselves.
Memory
recalled
SHORT TERM MEMORY
(STM)
LONG TERM MEMORY
(LTM)
Download
Input
stimulus
Recall
Process
7/120
 Information is
downloaded into
LTMs for more
permanent storage:
days, months, and
years.
 Capacity of LTMs is
very large.
Important Aspects of Human
Memory(2/4)
Recall process of these memories
is distinct from the memories
themselves.
Memory
recalled
SHORT TERM MEMORY
(STM)
LONG TERM MEMORY
(LTM)
Download
Input
stimulus
Recall
Process
8/120
Important Aspects of Human
Memory(3/4)
 Recall of recent memories is more easily
disrupted than that of older memories.
 Memories are dynamic and undergo
continual change and with time all
memories fade.
 STM results in
 physical changes in sensory receptors.
 simultaneous and cohesive reverberation of
neuron circuits.
9/120
Important Aspects of Human
Memory(4/4)
 Long-term memory involves
 plastic changes in the brain which take the form of
strengthening or weakening of existing synapses
 the formation of new synapses.
 Learning mechanism distributes the memory
over different areas
 Makes robust to damage
 Permits the brain to work easily from partially
corrupted information.
 Reflexive and declarative memories may
actually involve different neuronal circuits.
10/120
Learning Algorithms(1/2)
 Define an architecture-dependent procedure to
encode pattern information into weights
 Learning proceeds by modifying connection
strengths.
 Learning is data driven:
 A set of input–output patterns derived from a
(possibly unknown) probability distribution.
 Output pattern might specify a desired system response for
a given input pattern
 Learning involves approximating the unknown function as
described by the given data.
11/120
Learning Algorithms(2/2)
 Learning is data driven:
 Alternatively, the data might comprise patterns that
naturally cluster into some number of unknown
classes
 Learning problem involves generating a suitable
classification of the samples.
12/120
Supervised Learning(1/2)
 Data comprises a set
of discrete samples
drawn from the
pattern space where
each sample relates
an input vector Xk ∈
Rn to an output vector
Dk ∈ Rp.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-20
-15
-10
-5
0
5
10
15
20
x
f(x)
3 x
5
-1.2 x
4
-12.27 x
3
+3.288 x
2
+7.182 x
 
 Q
k
k
k D
X
T 1
, 

An example function described
by a set of noisy data points
13/120
Supervised Learning(2/2)
 The set of samples
describe the behavior
of an unknown
function f : Rn → Rp
which is to be
characterized.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-20
-15
-10
-5
0
5
10
15
20
x
f(x)
3 x
5
-1.2 x
4
-12.27 x
3
+3.288 x
2
+7.182 x
An example function described
by a set of noisy data points
14/120
The Supervised Learning Procedure
 We want the system to generate an output Dk in
response to an input Xk, and we say that the system has
learnt the underlying map if a stimulus Xk’ close to Xk
elicits a response Sk’ which is sufficiently close to Dk.
The result is a continuous function estimate.
Error information fed back for network adaptation
Error
Sk
Neural Network
Dx
Xk
15/120
Unsupervised Learning(1/2)
 Unsupervised learning provides the system with
an input Xk, and allow it to self-organize its
weights to generate internal prototypes of sample
vectors. Note: There is no teaching input involved
here.
 The system attempts to represent the entire data
set by employing a small number of prototypical
vectors—enough to allow the system to retain a
desired level of discrimination between samples.
16/120
Unsupervised Learning(2/2)
 As new samples continuously buffer the system,
the prototypes will be in a state of constant flux.
 This kind of learning is often called adaptive
vector quantization
17/120
Clustering and Classification(1/3)
 Given a set of data
samples {Xi}, Xi ∈ Rn,
is it possible to identify
well defined “clusters”,
where each cluster
defines a class of
vectors which are
similar in some broad
sense?
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-0.5
0
0.5
1
1.5
2
2.5
Cluster centroids
Cluster 1
Cluster 2
18/120
Clustering and Classification(2/3)
 Clusters help establish a
classification structure
within a data set that has
no categories defined in
advance.
 Classes are derived from
clusters by appropriate
labeling.
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-0.5
0
0.5
1
1.5
2
2.5
Cluster centroids
Cluster 1
Cluster 2
19/120
Clustering and Classification(3/3)
 The goal of pattern
classification is to
assign an input pattern
to one of a finite
number of classes.
 Quantization vectors
are called codebook
vectors. -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-0.5
0
0.5
1
1.5
2
2.5
Cluster centroids
Cluster 1
Cluster 2
20/120
Characteristics of Supervised and
Unsupervised Learning
21/120
General Philosophy of Learning:
Principle of Minimal Disturbance
 Adapt to reduce the output error for the
current training pattern, with minimal
disturbance to responses already learned.
22/120
Error Correction and Gradient
Descent Rules
 Error correction rules alter the weights of a
network using a linear error measure to reduce
the error in the output generated in response to
the present input pattern.
 Gradient rules alter the weights of a network
during each pattern presentation by employing
gradient information with the objective of
reducing the mean squared error (usually
averaged over all training patterns).
23/120
Learning Objective for TLNs(1/4)
 Augmented Input and
Weight vectors
 Objective: To design the
weights of a TLN to
correctly classify a given
set of patterns.
S
+1
WO
X1
X2
Xn
W1
W2
Wn


 
  1
1
0
1
1
0
,
,
,
,
,
,






n
k
T
k
n
k
k
k
n
k
T
k
n
k
k
R
W
w
w
w
W
R
X
x
x
x
X


24/120
Learning Objective for TLNs(2/4)
 Assumption: A training
set of following form is
given
 Each pattern Xk is tagged
to one of two classes C0
or C1 denoted by the
desired output dk being 0
or 1 respectively.
S
+1
WO
X1
X2
Xn
W1
W2
Wn


 
   
1
,
0
,
, 1
1


 
 k
n
K
Q
k
k
k d
R
X
D
X
T
25/120
Learning Objective for TLNs(3/4)
 Two classes identified by two possible signal
states of the TLN
 C0 by a signal S(yk) = 0, C1 by a signal S(yk) = 1.
 Given two sets of vectors X0 andX1
belonging to classes C0 and C1 respectively
the learning procedure searches a solution
weight vector WS that correctly classifies the
vectors into their respective classes.
26/120
Learning Objective for TLNs(4/4)
 Context: TLNs
 Find a weight vectorWS such that for all Xk ∈ X1,
S(yk) = 1; and for all Xk ∈ X0, S(yk) = 0.
 Positive inner products translate to a +1 signal
and negative inner products to a 0 signal
 Translates to saying that for all Xk ∈ X1, Xk
TWS
> 0; and for all Xk ∈ X0, Xk
TWS < 0.
27/120
Pattern Space(1/2)
 Points that satisfy XTWS
= 0 define a separating
hyperplane in pattern
space.
 Two dimensional case:
 Pattern space points on
one side of this
hyperplane (with an
orientation indicated by
the arrow) yield positive
inner products with WS
and thus generate a +1
neuron signal.
Activation
28/120
Pattern Space(2/2)
 Two dimensional case:
 Pattern space points on
the other side of the
hyperplane generate a
negative inner product
with WS and consequently
a neuron signal equal to 0.
 Points in C0 and C1 are
thus correctly classified
by such a placement of
the hyperplane.
Activation
29/120
A Different View: Weight Space
(1/2)
 Weight vector is a variable
vector.
 WTXk = 0 represents a
hyperplane in weight space
 Always passes through the
origin since W = 0 is a
trivial solution of WTXk =
0.
30/120
A Different View: Weight Space
(2/2)
 Called the pattern
hyperplane of pattern
Xk.
 Locus of all pointsW
such thatWTXk = 0.
 Divides the weight
space into two parts:
one which generates a
positive inner product
WTXk > 0, and the
other a negative inner
product WTX <0.
31/120
Identifying a Solution Region from
Orientated Pattern Hyperplanes(1/2)
 For each pattern Xk in
pattern space there is a
corresponding
hyperplane in weight
space.
 For every point in
weight space there is a
corresponding
hyperplane in pattern
space.
W2
X3 X2
X1
X4
W1
Solution region
32/120
Identifying a Solution Region from
Orientated Pattern Hyperplanes(2/2)
 A solution region in
weight space with
four pattern
hyperplanes
 1 = {X1,X2}
 0 = {X3,X4}
W2
X3 X2
X1
X4
W1
Solution region
33/120
Requirements of the Learning
Procedure
 Linear separability guarantees the existence of a
solution region.
 Points to be kept in mind in the design of an
automated weight update procedure :
 It must consider each pattern in turn to assess the correctness of
the present classification.
 It must subsequently adjust the weight vector to eliminate a
classification error, if any.
 Since the set of all solution vectors forms a convex cone, the
weight update procedure should terminate as soon as it penetrates
the boundary of this cone (solution region).
34/120
Design in Weight Space(1/3)
 Assume: Xk ∈ X1 and
Wk
TXk as erroneously
non-positive.
 For correct
classification, shift the
weight vector to some
position Wk+1 where
the inner product is
positive.
Wk+1
Wk
Wk
T Xk>0
WT Xk<0
Xk
35/120
Design in Weight Space(2/3)
 The smallest
perturbation in Wk
that produces the
desired change is, the
perpendicular distance
from Wk onto the
pattern hyperplane.
Wk+1
Wk
Wk
T Xk>0
WT Xk<0
Xk
36/120
Design in Weight Space(3/3)
 In weight space, the
direction
perpendicular to the
pattern hyperplane is
none other than that
of Xk itself.
Wk+1
Wk
Wk
T Xk>0
WT Xk<0
Xk
37/120
Simple Weight Change Rule:
Perceptron Learning Law
 If Xk ∈ X1 and Wk
TXk < 0 add a fraction of the
pattern to the weight Wk if one wishes the inner
product Wk
TXk to increase.
 Alternatively, if Xk ∈ X0, and Wk
TXk is erroneously
non-negative we will subtract a fraction of the
pattern from the weight Wk in order to reduce this
inner product.
38/120
Weight Space Trajectory
 The weight space
trajectory
corresponding to the
sequential presentation
of four patterns with
pattern hyperplanes as
indicated:
 1 = {X1,X2} and
0 = {X3,X4}
X3 X2
X1
X4
W1
Solution region
W2
39/120
Linear Containment
 Consider the set X0’ in which each element X0 is
negated.
 Given a weight vectorWk, for any Xk ∈ X1 ∪ X0’,
Xk
T Wk > 0 implies correct classification and
Xk
TWk < 0 implies incorrect classification.
 X‘ = X1 ∪ X0’ is called the adjusted training set.
 Assumption of linear separability guarantees the
existence of a solution weight vector WS, such
that Xk
TWS > 0 ∀ Xk ∈ X
 We say X’ is a linearly contained set.
40/120
Recast of Perceptron Learning with
Linearly Contained Data
 Since Xk ∈ X’, a misclassification of Xk will add
kXk to Wk.
 ForXk ∈ X0‘, Xk actually represents the negative of
the original vector.
 Therefore addition of kXk to Wk actually
amounts to subtraction of the original vector from
Wk.
41/120
Perceptron Algorithm:
Operational Summary
42/120
Perceptron Convergence
Theorem
 Given: A linearly contained training set X’ and
any initial weight vectorW1.
 Let SW be the weight vector sequence generated
in response to presentation of a training
sequence SX upon application of Perceptron
learning law. Then for some finite index k0 we
have: Wk0 = Wk0+1 = Wk0+2 = ・ ・ ・ = WS
as a solution vector.
 See the text for detailed proofs.
43/120
Hand-worked Example
S
+1
WO
W1
W2
X1
X2
Binary threshold neuron
44/120
Classroom Exercise(1/2)
45/120
Classroom Exercise(2/2)
46/120
MATLAB Simulation
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-0.5
0
0.5
1
1.5
2
2.5
3
x
1
x
2
(0,0) (1,0)
(0,1) (1,1)
R
2
(a) Hyperplane movement depicted during Perceptron Learning
k=5
k=15
k=25
k=35
-4
-3
-2
-1
0
0
0.5
1
1.5
2
2.5
3
0
1
2
3
w0
(b) Weight space trajectories: W
0
= (0,0,0), W
S
=(-4 3 2)
w
1
w
2
47/120
Perceptron Learning and Non-
separable Sets
 Theorem:
 Given a finite set of training patterns X, there
exists a number M such that if we run the
Perceptron learning algorithm beginning with
any initial set of weights,W1, then any weight
vector Wk produced in the course of the
algorithm will satisfyWk ≤ W1 +M
48/120
Two Corollaries(1/2)
 If, in a finite set of training patterns X, each
pattern Xk has integer (or rational) components
xi
k, then the Perceptron learning algorithm will
visit a finite set of distinct weight vectors Wk.
49/120
Two Corollaries(2/2)
 For a finite set of training patterns X, with
individual patterns Xk having integer (or
rational) components xi
k the Perceptron learning
algorithm will, in finite time, produce a weight
vector that correctly classifies all training
patterns iff X is linearly separable, or leave and
re-visit a specific weight vector iff X is linearly
non-separable.
50/120
Handling Linearly Non-separable
Sets: The Pocket Algorithm(1/2)
 Philosophy: Incorporate positive reinforcement in
a way to reward weights that yield a low error
solution.
 Pocket algorithm works by remembering the
weight vector that yields the largest number of
correct classifications on a consecutive run.
51/120
Handling Linearly Non-separable
Sets: The Pocket Algorithm(2/2)
 This weight vector is kept in the “pocket”, and
we denote it as Wpocket .
 While updating the weights in accordance with
Perceptron learning, if a weight vector is
discovered that has a longer run of consecutively
correct classifications than the one in the pocket,
it replaces the weight vector in the pocket.
52/120
Pocket Algorithm:
Operational Summary(1/2)
53/120
Pocket Algorithm:
Operational Summary(2/2)
54/120
Pocket Convergence Theorem
 Given a finite set of training examples, X,
and a probabilityp < 1, there exists an
integer k0 such that after any k > k0
iterations of the pocket algorithm, the
probability that the pocket weight
vectorWpocket is optimal exceeds p.
55/120
Linear Neurons and Linear Error
 Consider a training set of the form T = {Xk,
dk}, Xk ∈ Rn+1, dk ∈ R.
 To allow the desired output to vary smoothly or
continuously over some interval consider a
linear signal function: sk = yk = Xk
TWk
 The linear error ek due to a presented training
pair (Xk, dk), is the difference between the
desired output dk andthe neuronal signal sk:
ek = dk − sk = dk − Xk
TWk
56/120
Operational Details of –LMS
 –LMS error correction is
proportional to the error
itself
 Each iteration reduces the
error by a factor of η.
 η controls the stability and
speed of convergence.
 Stability ensured
if 0 < η < 2.
Sk=Xk
TWk
+1
WO
W1
k
W2
k
Wn
k
X1
k
X2
k
Xn
k

57/120
–LMS Works with Normalized
Training Patterns
x3
xk
Wk+1
Wk
Wk
x1
x2
58/120
–LMS: Operational Summary
59/120
MATLAB Simulation Example
 Synthetic data set
shown in the figure is
generated by artificially
scattering points around
a straight line: y = 0.5x
+ 0.333 and generate a
scatter of 200 points in
a ±0.1 interval in the y
direction. -2 -1 0 1 2
-20
-15
-10
-5
0
5
10
x
f
=
3x(x-1)(x-1.9)(x+0.7)(x+1.8)


60/120
MATLAB Simulation Example
 This is achieved by first
generating a random
scatter in the interval
[0,1].
 Then stretching it to the
interval [−1, 1], and
finally scaling it to ±0.1
-2 -1 0 1 2
-20
-15
-10
-5
0
5
10
x
f
=
3x(x-1)(x-1.9)(x+0.7)(x+1.8)


61/120
Computer Simulation of -LMS
Algorithm
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
y
Iteration 1
Iteration 10
Iteration 50 (MSE = 0.04)
62/120
A Stochastic Setting
 Assumption that the training set T is well defined
in advance is incorrect when the setting is
stochastic.
 In such a situation, instead of deterministic
patterns, we have a sequence of samples {(Xk,
dk)} assumed to be drawn from a statistically
stationary population or process.
 For adjustment of the neuron weights in response
to some pattern-dependent error measure, the
error computation has to be based on the
expectation of the error over the ensemble.
63/120
Definition of Mean Squared Error
(MSE)(1/2)
 We introduce the square error on a pattern Xk as
 Assumption: The weights are held fixed at Wk
while computing the expectation.
 
  (2)
2
2
1
(1)
2
1
2
1
2
2
2
k
T
k
k
T
k
k
T
k
k
k
k
k
T
k
k
k
W
X
X
W
W
X
d
d
e
W
X
d







64/120
Definition of Mean Squared Error
(MSE)(2/2)
 The mean-squared error can now be computed by
taking the expectation on both sides of (2):
        (3)
2
1
2
1 2
k
T
k
k
T
k
k
T
k
k
k
k W
X
X
E
W
W
X
d
E
d
E
E 


 

65/120
 To find optimal weight vector that
minimizes the mean-square error.
Our problem…
66/120
Cross Correlations(1/2)
 For convenience of expression we define the
pattern vector P as the cross-correlation between
the desired scalar output, dk, and the input vector,
Xk
 and the input correlation matrix, R, as
   
  (4)
,
,
,
d
E 1
k
k
n
k
k
k
k
T
k
T
x
d
x
d
d
E
X
P 


  (5)
1
E
1
1
1
1
1
1
k
















k
n
k
n
k
k
n
k
n
k
n
k
k
k
k
k
n
k
T
k
x
x
x
x
x
x
x
x
x
x
x
x
E
X
X
R






67/120
Cross Correlations (2/2)
 Using Eqns. (4)-(5), we rewrite the MSE
expression of Eqn. (3) succinctly as
 Note that since the MSE is a quadratic
function of the weights it represents a bowl
shaped surface in the (n+1) x 1 weight—
MSE cross space.
    (6)
2
1
2
1 2
k
T
k
k
T
k
k RW
W
W
P
d
E
E 


 

68/120
 First compute the gradient by straightforward
differentiation which is a linear function of
weights
 To find the optimal set of weights, , simply set
which yields
Finding the Minimum Error(1/2)
(7)
,
,
0
RW
P
w
w
T
n



















 
Ŵ
0


(8)
ˆ P
W
R 
69/120
 This system of equations (8) is called the Weiner-
Hopf system and its solution is the Weiner solution
or the Weiner filter
 is the point in weight space that represents the
minimum mean-square error
Finding the Minimum Error (2/2)
(9)
ˆ 1
P
R
W 

min

70/120
Computing the Optimal Filter(1/2)
 First compute and . Substituting from Eqn.
(9) into Eqn. (6) yields:
1

R P
 
     
  (12)
ˆ
2
1
2
1
(11)
2
1
2
1
(10)
ˆ
ˆ
ˆ
2
1
2
1
2
1
1
1
2
2
min
W
P
d
E
P
R
P
P
R
R
P
R
d
E
W
P
W
R
W
d
E
T
k
T
T
k
T
T
k












71/120
Computing the Optimal Filter(2/2)
 For the treatment of weight update procedures we
reformulate the expression for mean-square error
in terms of the deviation , of the weight
vector from the Weiner solution.
W
W
V ˆ


72/120
 Substituting into Eqn. (6)
Computing R(1/2)
W
V
W ˆ


       
 
      (16)
ˆ
ˆ
ˆ
2
1
(15)
2
1
(14)
ˆ
ˆ
ˆ
ˆ
ˆ
2
1
(13)
ˆ
ˆ
ˆ
2
1
min
min
2
2
ˆ
2
2
2
1
W
W
R
W
W
R
W
W
RV
V
W
P
V
P
W
R
W
RV
W
W
R
V
RV
V
d
E
W
V
P
W
V
R
W
V
d
E
T
T
T
T
T
T
V
P
RV
R
P
RV
W
T
T
T
k
T
T
k
T
T
T





























 




 

 

73/120
 Note that since the mean-square error is
non-negative, we must have . This
implies that R is positive semi-definite.
Usually however, R is positive definite.
Computing R(2/2)

0

RV
VT
74/120
 Assume that R has distinct eigenvalues . Then we
can construct a matrix Q whose columns are
corresponding eigenvectors of R.
 R can be diagonalized using an orthogonal similarity
transformation as follows. Having constructed Q, and
knowing that:
i

i

  (17)
n
1
0 

 

Q
  (18)
0
0
0
0
0
n
1
0
n
1
0




























RQ
We have
  (19)
,
,
,
, 2
1
0
1
n
diag
D
RQ
Q 


 



Diagonalization of R
75/120
 It is usual to choose the eigenvectors of R to be
orthonormal and .
 Then
 From Eqn. (15) we know that the shape of is a
bowl-shaped paraboloid with a minimum at the
Weiner solution (V=0, the origin of V-space).
 Slices of parallel to the W space yield elliptic
constant error contours which define the weights in
weight space that the specific value of the square-error
(say ) at which the slice is made:
I
QQT
 T
Q
Q 
1
1


 QDQ
QDQ
R T


c

(20)
constant
2
1
min
ε
ε
RV
V c
T



Some Observations
76/120
 Also note from Eqn. (15) that we can compute
the MSE gradient in V-space as,
which defines a family of vectors in V-space.
 Exactly n+1 of these pass through the origin of
V-space and these are the principal axes of the
ellipse.
Eigenvectors of R(1/2)
(21)
RV


,
'
V
77/120
 However, vectors passing through the origin
must take the . Therefore, for the principal
axes
 Clearly, is an eigenvector of R.
Eigenvectors of R(2/2)
V

,
'
V
(22)
'
'
V
RV 

'
V
78/120
Steepest Descent Search with Exact
Gradient Information
 Steepest descent search
uses exact gradient
information available
from the mean-square
error surface to direct
the search in weight
space.
 The figure shows a
projection of the
square-error function on
k
i
w


79/120
Steepest Descent Procedure
Summary(1/2)
 Provide an appropriate weight increment to to
push the error towards the minimum which
occurs at .
 Perturb the weight in a direction that depends on
which side of the optimal weight the current
weight value lies.
 If the weight component lies to the left of ,
say at , where the error gradient is negative (as
indicated by the tangent) we need to increase .
k
i
w
i
ŵ
i
ŵ
k
i
w1
k
i
w
k
i
w
i
ŵ
k
i
w
80/120
Steepest Descent Procedure
Summary(2/2)
 If is on the right of , say at where the error
gradient is positive, we need to decrease .
 This rule is summarized in the following
statement:
i
ŵ
k
i
w
k
i
w
k
i
w 2
 
  k
i
i
k
i
k
i
k
i
i
k
i
k
i
w
increase
,
w
w
0,
w
ε
If
w
decrease
,
w
w
0,
w
ε
If
ˆ
ˆ








81/120
 It follows logically therefore, that the weight
component should be updated in proportion
with the negative of the gradient:
 
27
n
,
,
0,1
i
1
















k
i
k
i
k
i
w
w
w


Weight Update Procedure(1/3)
82/120
 Vectorially we may write
where we now re-introduced the iteration time
dependence into the weight components:
  (28)
1 
 



 k
k W
W
(29)
,
,
0
T
k
n
k
w
w 















 
Weight Update Procedure(2/3)
83/120
 Equation (28) is the steepest descent update
procedure. Note that steepest descent uses
exact gradient information at each step to
decide weight changes.
Weight Update Procedure(3/3)
84/120
Convergence of Steepest Descent –
1(1/2)
 Question: What can one say about the
stability of the algorithm? Does it converge
for all values of ?
 To answer this question consider the
following series of subtitutions and
transformations. From Eqns. (28) and (21)

 
 
  (33)
ˆ
1
(32)
ˆ
(31)
(30)
1
W
R
W
R
W
W
R
W
RV
W
W
W
k
k
k
k
k
k
k



















85/120
Convergence of Steepest Descent –
1(2/2)
 Transforming Eqn. (33) into V-space (the
principal coordinate system) by substitution
of for yields:
W
Vk
ˆ
 k
W
  (34)
1 k
k V
R
I
V 



86/120
 Rotation to the principal axes of the elliptic
contours can be effected by using :
Steepest Descent Convergence –
2(1/2)
'
QV
V 
  (35)
'
'
1 k
k QV
R
I
QV 



or
 
  (37)
(36)
'
'
1
'
1
k
k
k
V
D
I
QV
R
I
Q
V





 

87/120
 where D is the diagonal eigenvalue matrix.
Recursive application of
Eqn. (37) yields:
 It follows from this that for stability and
convergence of the algorithm:
Steepest Descent Convergence –
2(2/2)
or
  (38)
'
0
'
V
D
I
V
k
k 


  (39)
0
lim 



k
k
D
I 
88/120
This requires that
being the largest eigenvalue of R.
  (40)
0
1
lim max 



k
k

or
(41)
2
0
max

 

max

Steepest Descent Convergence –
3(1/2)
89/120
If this condition is satisfied then we have
 Steepest descent is guaranteed to converge to
the Weiner solution as long as the learning rate
is maintained within the limits defined by Eqn.
(41).
 
(43)
ˆ
lim
(42)
0
ˆ
lim
lim
k
1
'
W
W
W
W
Q
V
k
k
k
k
k











or
Steepest Descent Convergence –
3(2/2)
90/120
 This simulation example employs the fifth
order function data scatter with the data
shifted in the y direction by 0.5.
Consequently, the values of R,P and the
Weiner solution are respectively:
Computer Simulation
Example(1/2)
(45)
3834
.
1
5303
.
1
ˆ
(44)
4625
.
1
8386
.
0
,
1.61
0.500
0.500
1





























W
P
R
91/120
 Exact gradient information is available
since the correlation matrix R and the cross-
correlation matrix P are known.
 The weights are updated using the equation:
Computer Simulation
Example(2/2)
  (46)
ˆ
2
1 k
k
k W
W
R
W
W 


 
92/120
eta = .01; % Set learning rate
R=zeros(2,2); % Initialize correlation matrix
X = [ones(1,max_points);x]; % Augment input vectors
P = (sum([d.*X(1,:); d.*X(2,:)],2))/max_points; % Cross-correlations
D = (sum(d.^2))/max_points; %squares; % target expectations
for k =1:max_points
R = R + X(:,k)*X(:,k)'; % Compute R
end
R = R/max_points;
weiner=inv(R)*P; % Compute the Weiner solution
errormin = D - P'*inv(R)*P; % Find the minimum error
MATLAB Simulation
Example(1/3)
93/120
shift1 = linspace(-12,12, 21); % Generate a weight space matrix
shift2 = linspace(-9,9, 21);
for i = 1:21 % Compute a weight matrix about
shiftwts(1,i) = weiner(1)+shift1(i); % Weiner solution
shiftwts(2,i) = weiner(2)+shift2(i);
end
for i=1:21 % Compute the error matrix
for j = 1:21 % to plot the error contours
error(i,j) = sum((d - (shiftwts(1,i) + x.*shiftwts(2,j))).^2);
end
end
error = error/max_points;
figure; hold on; % Plot the error contours
plot(weiner(1),weiner(2),'*k') % Labelling statements no shown
MATLAB Simulation Example (2/3)
94/120
w = 10*(2*rand(2,1)-1); % Randomize weights
w0 = w; % Remember the initial weights
for loop = 1:500 % Perform descent for 500 iters
w = w + eta*(-2*(R*w-P));
wts1(loop)=w(1); wts2(loop)=w(2);
End
% Set up weights for plotting
wts1=[w0(1) wts1]; wts2=[w0(2) wts2];
plot(wts1,wts2,'r') % Plot the weight trajectory
MATLAB Simulation Example (3/3)
95/120
Smooth Trajectory towards the
Weiner Solution
 Steepest descent
uses exact
gradient
information to
search the
Weiner solution
in weight space.
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
55.3
90.1
125
160
160
160
160
194
194
229
229
264
264
299
333
368
w
0
w
1
Weiner solution
(1.53, -1.38)
W0
= (-3.9, 6.27)T
96/120
-LMS: Approximate Gradient
Descent(1/2)
 The problem with steepest descent is that
true gradient information is only available
in situations where the data set is
completely specified in advance.
 It is then possible to compute R and P
exactly, and thus the true gradient at
iteration
P
RW
k k 


:
97/120
-LMS: Approximate Gradient
Descent(2/2)
 However, when the data set comprises a
random stream of patterns (drawn from a
stationary distribution), R and P cannot be
computed accurately. To find a correct
approximation one might have to examine
the data stream for a reasonably large period
of time and keep averaging out.
 How long should we examine the stream to
get reliable estimates of R and P ?
98/120
Definition
 The μ-LMS algorithm is convergent in the
mean if the average of the weight vector Wk
approaches the optimal solution Wopt as the
number of iterations k, approaches infinity:
 E[Wk] → Wopt as k → ∞
99/120
 The gradient computation modifies to:
 where , and since we
are dealing with linear neurons. Note
therefore that the recursive update equation
then becomes
-LMS employs k for k(1/2)
(47)
,
,
,
~
0
0
k
k
T
k
n
k
k
k
k
T
k
n
k
k
k
k X
e
w
e
w
e
e
w
w




























 




 
k
k
k s
d
e 
 k
T
k
k W
X
s 
 
(49)
(48)
)
~
(
1
k
k
k
k
k
k
k
k
k
k
X
e
W
X
s
d
W
W
W














100/120
 What value does the long term average of
converge to? Taking the Expectation of both
sides of Eqn. (47):
-LMS employs k for k(2/2)
k


~
     
(52)
(51)
(50)
~












P
RW
W
X
X
X
d
E
X
e
E
E T
k
k
k
k
k
k
k
101/120
 Since the long term average of
approaches , we can safely use as an
unbiased estimate. That’s what makes -
LMS work!
 Since approaches in the long run,
one could keep collecting for a
sufficiently large number of iterations
(while keeping the weights fixed), and then
make a weight change collectively for all
those iterations together.
k


~


k


~
k


~


k


~


k
X
Observations(1/8)
102/120
 If the data set is finite (deterministic), then
one can compute accurately by first
collecting the different gradients over all
training patterns for the same set of
weights. This accurate measure of the
gradient could then be used to change the
weights. In this situation -LMS is identical
to the steepest descent algorithm.
k


~


k
X
Observations(2/8)
103/120
 Even if the data set is deterministic, we still use
to update the weights. After all if the data set
becomes large, collection of all the gradients
becomes expensive in terms of storage. Much
easier to just go ahead and use
 Be clear about the approximation made: we are
estimating the true gradient (which should be
computed from ) by a gradient computed
from the instantaneous sample error .
Although this may seem to be a rather drastic
approximation, it works.
k


~
 
k
E 
k

Observations(3/8)
104/120
 In the deterministic case we can justify this
as follows: if the learning rate , is kept
small, the weight change in each iteration
will be small and consequently the weight
vector W will remain “somewhat constant”
over Q iterations where Q is the number of
patterns in the training set.

Observations(4/8)
105/120
 Of course this is provided that Q is a small
number! To see this, observe the total weight
change , over Q iterations from the
iteration:
W

th
k
(56)
(55)
1
(54)
1
(53)
1
0
1
0
1
0
k
Q
i
i
k
k
Q
i k
i
k
Q
i i
k
i
k
W
Q
Q
W
Q
W
Q
Q
W
W











































 





Observations(5/8)
106/120
 Where denotes the mean-square error. Thus
the weight updates follow the true gradient on
average.

Observations(6/8)
107/120
 Observe that steepest descent search is
guaranteed to search the Weiner solution
provided the learning rate condition (41) is
satisfied.
 Now since is an unbiased estimate of
one can expect that -LMS too will search out
the Weiner solution. Indeed it does—but not
smoothly. This is to be expected since we are
only using an estimate of the true gradient for
immediate use.
k


~


Observations(7/8)
108/120
 Although -LMS and - LMS are similar
algorithms, -LMS works on the normalizing
training set. What this simply means is that -
LMS also uses gradient information, and will
eventually search out the Weiner solution-of the
normalized training set. However, in one case
the two algorithms are identical: the case when
input vectors are bipolar. (Why?)
Observations(8/8)
109/120
 Definition 0.1 The -LMS algorithm is convergent in
the mean if the average of the weight vector
approaches the optimal solution as the number of
iterations k, approaches infinity:
 Definition 0.2 The -LMS algorithm is convergent in
the mean square if the average of the squared error
approaches a constant as the Number of iterations, k,
approaches infinity:
-LMS Algorithm:
Convergence in the Mean (1)(1/2)

k
W
Ŵ
  (57)
k
as
ˆ 

W
W
E k

k

  (58)
as
constant 

 k
E k

110/120
 Convergence in the mean square is a
stronger criterion than convergence in the
mean. In this section we discuss
convergence in the mean and merely state
the result for convergence in the mean
square.
-LMS Algorithm:
Convergence in the Mean (1)(2/2)
111/120
 Consider the -LMS weight update equation:
-LMS Algorithm:
Convergence in the Mean (2)(1/2)
 
 
  (63)
-
I
(62)
-
(61)
(60)
(59)
1
k
k
k
T
k
k
k
k
k
T
k
k
k
k
k
T
k
k
k
k
k
k
T
k
k
k
k
k
k
k
k
X
d
W
X
X
X
d
W
X
X
W
X
W
X
X
d
W
X
W
X
d
W
X
s
d
W
W






















112/120
 Taking the expectation of both sides of Eqn. (63)
yields,
where P and R are as already defined
-LMS Algorithm:
Convergence in the Mean (2)(2/2)
   
     
    (65)
(64)
1
P
W
E
R
I
X
d
E
W
E
X
X
E
I
W
E
k
k
k
k
T
k
k
k











113/120
 Appropriate substitution yields:
 Pre-multiplication throughout by results in:
-LMS Algorithm:
Convergence in the Mean (3)(1/2)
     
   
    W
QDQ
W
E
Q
D
I
Q
W
QDQ
W
E
QDQ
QQ
W
QDQ
W
E
QDQ
I
W
E
T
k
T
T
k
T
T
T
k
T
k
ˆ
(66)
ˆ
ˆ
1
















T
Q
      (67)
ˆ
1 W
DQ
W
E
Q
D
I
W
E
Q T
k
T
k
T

 



114/120
 And subtraction of from both sides
gives:
 We will re-write Eqn. (69) in familiar terms:
-LMS Algorithm:
Convergence in the Mean (3)(2/2)
W
QT ˆ
 
       
   
  (69)
ˆ
(68)
ˆ
ˆ
1
W
W
E
Q
D
I
W
Q
D
I
W
E
Q
D
I
W
W
E
Q
k
T
T
k
T
k
T












  (70)
~
~
1 k
T
k
T
V
Q
D
I
V
Q 



115/120
And
where and . Eqn. (71)
represents a set of n+1Decoupled difference
equations :
-LMS Algorithm:
Convergence in the Mean (4)(1/2)
  (71)
~
~ '
'
1 k
k V
D
I
V 



  W
W
E
V k
k
ˆ
~

 k
T
k V
Q
V
~
~'

  (72)
n.
,
0,1,
i
~
1
~ '
1
'




 k
i
i
k
i v
v 
116/120
Recursive application of Eqn. (72) yields,
 To ensure convergence in the mean, as
since this condition requires that the
deviation of from should tend to 0.
-LMS Algorithm:
Convergence in the Mean (4)(2/2)
  (73)
n
,
0,1,
i
~
1
~ 0
'
'



 i
k
i
k
i v
v 
0
~'

k
i
v


k
 
k
W
E Ŵ
117/120
 Therefore from Eqn. (73):
 If this condition is satisfied for the largest
eigenvalue then it will be satisfied for all
other eigenvalues. We therefore conclude that if
then the -LMS algorithm is convergent in the
mean.
-LMS Algorithm:
Convergence in the Mean (5)(1/2)
(74)
n.
,
0,1,
i
1
|
1
| 


 i

max

(75)
2
0
max

 

118/120
 Further, since tr = (where tr
is the trace of R) convergence is assured if
-LMS Algorithm:
Convergence in the Mean (5)(2/2)
 
R max
0

 


n
i
i  
R
 
(76)
2
0
R
tr


119/120
Random Walk towards the Weiner
Solution
 Assume the
familiar fifth
order function
 -LMS uses a
local estimate of
gradient to search
the Weiner
solution in weight
space.
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
44.96975
4
4
.
9
6
9
7
5
44.96975
69.19828
69.19828
6
9
.
1
9
8
2
8
69.19828
6
9
.1
9
8
2
8
93.42682
93.42682
9
3
.
4
2
6
8
2
93.42682
93.42682
93
.4
26
82
117.6554
117.6554
1
1
7
.
6
5
5
4
1
1
7
.
6
5
5
4
117.6554
117.6554
141.8839
141.8839
1
4
1
.
8
8
3
9
141.8839
141.8839
141.8839
166.1124
166.1124
166.1124
166.1124
190.341
190.341
190.341
190.341
214.5695
214.5695
Weiner solution
(0.339, -1.881)
 LMS solution
(0.5373, -2.3311)
w0
w1
120/120
Adaptive Noise Cancellation

More Related Content

Similar to Supervised learning in artificial neural networks

BP219 class 4 04 2011
BP219 class 4 04 2011BP219 class 4 04 2011
BP219 class 4 04 2011
waddling
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
Institute of Technology Telkom
 
chapter10
chapter10chapter10
chapter10
butest
 
neural networksNnf
neural networksNnfneural networksNnf
neural networksNnf
Sandilya Sridhara
 
Artificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competitionArtificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competition
Mohammed Bennamoun
 
Bp219 2011
Bp219 2011Bp219 2011
Bp219 2011
waddling
 
Bp219 2011-4.13
Bp219 2011-4.13Bp219 2011-4.13
Bp219 2011-4.13
robertstroud
 
Max net
Max netMax net
Annintro
AnnintroAnnintro
Annintro
kaushaljha009
 
3. convolution fourier
3. convolution fourier3. convolution fourier
3. convolution fourier
skysunilyadav
 
03 neural network
03 neural network03 neural network
03 neural network
Tianlu Wang
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Marina Santini
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
Kazuki Fujikawa
 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptx
abhilasha7888
 
Unit 2 signal &amp;system
Unit 2 signal &amp;systemUnit 2 signal &amp;system
Unit 2 signal &amp;system
sushant7dare
 
MD_course.ppt
MD_course.pptMD_course.ppt
MD_course.ppt
Gold681565
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Ilya Loshchilov
 
2 classical field theories
2 classical field theories2 classical field theories
2 classical field theories
Solo Hermelin
 
Lec1 01
Lec1 01Lec1 01
Introduction
IntroductionIntroduction
Introduction
butest
 

Similar to Supervised learning in artificial neural networks (20)

BP219 class 4 04 2011
BP219 class 4 04 2011BP219 class 4 04 2011
BP219 class 4 04 2011
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
chapter10
chapter10chapter10
chapter10
 
neural networksNnf
neural networksNnfneural networksNnf
neural networksNnf
 
Artificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competitionArtificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competition
 
Bp219 2011
Bp219 2011Bp219 2011
Bp219 2011
 
Bp219 2011-4.13
Bp219 2011-4.13Bp219 2011-4.13
Bp219 2011-4.13
 
Max net
Max netMax net
Max net
 
Annintro
AnnintroAnnintro
Annintro
 
3. convolution fourier
3. convolution fourier3. convolution fourier
3. convolution fourier
 
03 neural network
03 neural network03 neural network
03 neural network
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptx
 
Unit 2 signal &amp;system
Unit 2 signal &amp;systemUnit 2 signal &amp;system
Unit 2 signal &amp;system
 
MD_course.ppt
MD_course.pptMD_course.ppt
MD_course.ppt
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
2 classical field theories
2 classical field theories2 classical field theories
2 classical field theories
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
Introduction
IntroductionIntroduction
Introduction
 

Recently uploaded

一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
AnasAhmadNoor
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
cannyengineerings
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 

Recently uploaded (20)

一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 

Supervised learning in artificial neural networks

  • 1. 1/120 Supervised Learning I: Perceptrons and LMS Instructor: Tai-Yue (Jason) Wang Department of Industrial and Information Management Institute of Information Management
  • 2. 2/120 Two Fundamental Learning Paradigms  Non-associative  an organism acquires the properties of a single repetitive stimulus.  Associative  an organism acquires knowledge about the relationship of either one stimulus to another, or one stimulus to the organism’s own behavioral response to that stimulus.
  • 3. 3/120 Examples of Associative Learning(1/2)  Classical conditioning  Association of an unconditioned stimulus (US) with a conditioned stimulus (CS).  CS’s such as a flash of light or a sound tone produce weak responses.  US’s such as food or a shock to the leg produce a strong response.  Repeated presentation of the CS followed by the US, the CS begins to evoke the response of the US.  Example: If a flash of light is always followed by serving of meat to a dog, after a number of learning trials the light itself begins to produce salivation.
  • 4. 4/120 Examples of Associative Learning(2/2)  Operant conditioning  Formation of a predictive relationship between a stimulus and a response.  Example: Place a hungry rat in a cage which has a lever on one of its walls. Measure the spontaneous rate at which the rat presses the lever by virtue of its random movements around the cage. If the rat is promptly presented with food when the lever is pressed, the spontaneous rate of lever pressing increases!
  • 5. 5/120 Reflexive and Declarative Learning  Reflexive learning  repetitive learning is involved and recall does not involve any awareness or conscious evaluation.  Declarative learning  established by a single trial or experience and involves conscious reflection and evaluation for its recall.  Constant repitition of declarative knowledge often manifest itself in reflexive form.
  • 6. 6/120  Two distinct stages:  short-term memory (STM)  long-term memory (LTM)  Inputs to the brain are processed into STMs which last at the most for a few minutes. Important Aspects of Human Memory(1/4) Recall process of these memories is distinct from the memories themselves. Memory recalled SHORT TERM MEMORY (STM) LONG TERM MEMORY (LTM) Download Input stimulus Recall Process
  • 7. 7/120  Information is downloaded into LTMs for more permanent storage: days, months, and years.  Capacity of LTMs is very large. Important Aspects of Human Memory(2/4) Recall process of these memories is distinct from the memories themselves. Memory recalled SHORT TERM MEMORY (STM) LONG TERM MEMORY (LTM) Download Input stimulus Recall Process
  • 8. 8/120 Important Aspects of Human Memory(3/4)  Recall of recent memories is more easily disrupted than that of older memories.  Memories are dynamic and undergo continual change and with time all memories fade.  STM results in  physical changes in sensory receptors.  simultaneous and cohesive reverberation of neuron circuits.
  • 9. 9/120 Important Aspects of Human Memory(4/4)  Long-term memory involves  plastic changes in the brain which take the form of strengthening or weakening of existing synapses  the formation of new synapses.  Learning mechanism distributes the memory over different areas  Makes robust to damage  Permits the brain to work easily from partially corrupted information.  Reflexive and declarative memories may actually involve different neuronal circuits.
  • 10. 10/120 Learning Algorithms(1/2)  Define an architecture-dependent procedure to encode pattern information into weights  Learning proceeds by modifying connection strengths.  Learning is data driven:  A set of input–output patterns derived from a (possibly unknown) probability distribution.  Output pattern might specify a desired system response for a given input pattern  Learning involves approximating the unknown function as described by the given data.
  • 11. 11/120 Learning Algorithms(2/2)  Learning is data driven:  Alternatively, the data might comprise patterns that naturally cluster into some number of unknown classes  Learning problem involves generating a suitable classification of the samples.
  • 12. 12/120 Supervised Learning(1/2)  Data comprises a set of discrete samples drawn from the pattern space where each sample relates an input vector Xk ∈ Rn to an output vector Dk ∈ Rp. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -20 -15 -10 -5 0 5 10 15 20 x f(x) 3 x 5 -1.2 x 4 -12.27 x 3 +3.288 x 2 +7.182 x    Q k k k D X T 1 ,   An example function described by a set of noisy data points
  • 13. 13/120 Supervised Learning(2/2)  The set of samples describe the behavior of an unknown function f : Rn → Rp which is to be characterized. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -20 -15 -10 -5 0 5 10 15 20 x f(x) 3 x 5 -1.2 x 4 -12.27 x 3 +3.288 x 2 +7.182 x An example function described by a set of noisy data points
  • 14. 14/120 The Supervised Learning Procedure  We want the system to generate an output Dk in response to an input Xk, and we say that the system has learnt the underlying map if a stimulus Xk’ close to Xk elicits a response Sk’ which is sufficiently close to Dk. The result is a continuous function estimate. Error information fed back for network adaptation Error Sk Neural Network Dx Xk
  • 15. 15/120 Unsupervised Learning(1/2)  Unsupervised learning provides the system with an input Xk, and allow it to self-organize its weights to generate internal prototypes of sample vectors. Note: There is no teaching input involved here.  The system attempts to represent the entire data set by employing a small number of prototypical vectors—enough to allow the system to retain a desired level of discrimination between samples.
  • 16. 16/120 Unsupervised Learning(2/2)  As new samples continuously buffer the system, the prototypes will be in a state of constant flux.  This kind of learning is often called adaptive vector quantization
  • 17. 17/120 Clustering and Classification(1/3)  Given a set of data samples {Xi}, Xi ∈ Rn, is it possible to identify well defined “clusters”, where each cluster defines a class of vectors which are similar in some broad sense? -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 1 1.5 2 2.5 Cluster centroids Cluster 1 Cluster 2
  • 18. 18/120 Clustering and Classification(2/3)  Clusters help establish a classification structure within a data set that has no categories defined in advance.  Classes are derived from clusters by appropriate labeling. -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 1 1.5 2 2.5 Cluster centroids Cluster 1 Cluster 2
  • 19. 19/120 Clustering and Classification(3/3)  The goal of pattern classification is to assign an input pattern to one of a finite number of classes.  Quantization vectors are called codebook vectors. -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 1 1.5 2 2.5 Cluster centroids Cluster 1 Cluster 2
  • 20. 20/120 Characteristics of Supervised and Unsupervised Learning
  • 21. 21/120 General Philosophy of Learning: Principle of Minimal Disturbance  Adapt to reduce the output error for the current training pattern, with minimal disturbance to responses already learned.
  • 22. 22/120 Error Correction and Gradient Descent Rules  Error correction rules alter the weights of a network using a linear error measure to reduce the error in the output generated in response to the present input pattern.  Gradient rules alter the weights of a network during each pattern presentation by employing gradient information with the objective of reducing the mean squared error (usually averaged over all training patterns).
  • 23. 23/120 Learning Objective for TLNs(1/4)  Augmented Input and Weight vectors  Objective: To design the weights of a TLN to correctly classify a given set of patterns. S +1 WO X1 X2 Xn W1 W2 Wn       1 1 0 1 1 0 , , , , , ,       n k T k n k k k n k T k n k k R W w w w W R X x x x X  
  • 24. 24/120 Learning Objective for TLNs(2/4)  Assumption: A training set of following form is given  Each pattern Xk is tagged to one of two classes C0 or C1 denoted by the desired output dk being 0 or 1 respectively. S +1 WO X1 X2 Xn W1 W2 Wn         1 , 0 , , 1 1      k n K Q k k k d R X D X T
  • 25. 25/120 Learning Objective for TLNs(3/4)  Two classes identified by two possible signal states of the TLN  C0 by a signal S(yk) = 0, C1 by a signal S(yk) = 1.  Given two sets of vectors X0 andX1 belonging to classes C0 and C1 respectively the learning procedure searches a solution weight vector WS that correctly classifies the vectors into their respective classes.
  • 26. 26/120 Learning Objective for TLNs(4/4)  Context: TLNs  Find a weight vectorWS such that for all Xk ∈ X1, S(yk) = 1; and for all Xk ∈ X0, S(yk) = 0.  Positive inner products translate to a +1 signal and negative inner products to a 0 signal  Translates to saying that for all Xk ∈ X1, Xk TWS > 0; and for all Xk ∈ X0, Xk TWS < 0.
  • 27. 27/120 Pattern Space(1/2)  Points that satisfy XTWS = 0 define a separating hyperplane in pattern space.  Two dimensional case:  Pattern space points on one side of this hyperplane (with an orientation indicated by the arrow) yield positive inner products with WS and thus generate a +1 neuron signal. Activation
  • 28. 28/120 Pattern Space(2/2)  Two dimensional case:  Pattern space points on the other side of the hyperplane generate a negative inner product with WS and consequently a neuron signal equal to 0.  Points in C0 and C1 are thus correctly classified by such a placement of the hyperplane. Activation
  • 29. 29/120 A Different View: Weight Space (1/2)  Weight vector is a variable vector.  WTXk = 0 represents a hyperplane in weight space  Always passes through the origin since W = 0 is a trivial solution of WTXk = 0.
  • 30. 30/120 A Different View: Weight Space (2/2)  Called the pattern hyperplane of pattern Xk.  Locus of all pointsW such thatWTXk = 0.  Divides the weight space into two parts: one which generates a positive inner product WTXk > 0, and the other a negative inner product WTX <0.
  • 31. 31/120 Identifying a Solution Region from Orientated Pattern Hyperplanes(1/2)  For each pattern Xk in pattern space there is a corresponding hyperplane in weight space.  For every point in weight space there is a corresponding hyperplane in pattern space. W2 X3 X2 X1 X4 W1 Solution region
  • 32. 32/120 Identifying a Solution Region from Orientated Pattern Hyperplanes(2/2)  A solution region in weight space with four pattern hyperplanes  1 = {X1,X2}  0 = {X3,X4} W2 X3 X2 X1 X4 W1 Solution region
  • 33. 33/120 Requirements of the Learning Procedure  Linear separability guarantees the existence of a solution region.  Points to be kept in mind in the design of an automated weight update procedure :  It must consider each pattern in turn to assess the correctness of the present classification.  It must subsequently adjust the weight vector to eliminate a classification error, if any.  Since the set of all solution vectors forms a convex cone, the weight update procedure should terminate as soon as it penetrates the boundary of this cone (solution region).
  • 34. 34/120 Design in Weight Space(1/3)  Assume: Xk ∈ X1 and Wk TXk as erroneously non-positive.  For correct classification, shift the weight vector to some position Wk+1 where the inner product is positive. Wk+1 Wk Wk T Xk>0 WT Xk<0 Xk
  • 35. 35/120 Design in Weight Space(2/3)  The smallest perturbation in Wk that produces the desired change is, the perpendicular distance from Wk onto the pattern hyperplane. Wk+1 Wk Wk T Xk>0 WT Xk<0 Xk
  • 36. 36/120 Design in Weight Space(3/3)  In weight space, the direction perpendicular to the pattern hyperplane is none other than that of Xk itself. Wk+1 Wk Wk T Xk>0 WT Xk<0 Xk
  • 37. 37/120 Simple Weight Change Rule: Perceptron Learning Law  If Xk ∈ X1 and Wk TXk < 0 add a fraction of the pattern to the weight Wk if one wishes the inner product Wk TXk to increase.  Alternatively, if Xk ∈ X0, and Wk TXk is erroneously non-negative we will subtract a fraction of the pattern from the weight Wk in order to reduce this inner product.
  • 38. 38/120 Weight Space Trajectory  The weight space trajectory corresponding to the sequential presentation of four patterns with pattern hyperplanes as indicated:  1 = {X1,X2} and 0 = {X3,X4} X3 X2 X1 X4 W1 Solution region W2
  • 39. 39/120 Linear Containment  Consider the set X0’ in which each element X0 is negated.  Given a weight vectorWk, for any Xk ∈ X1 ∪ X0’, Xk T Wk > 0 implies correct classification and Xk TWk < 0 implies incorrect classification.  X‘ = X1 ∪ X0’ is called the adjusted training set.  Assumption of linear separability guarantees the existence of a solution weight vector WS, such that Xk TWS > 0 ∀ Xk ∈ X  We say X’ is a linearly contained set.
  • 40. 40/120 Recast of Perceptron Learning with Linearly Contained Data  Since Xk ∈ X’, a misclassification of Xk will add kXk to Wk.  ForXk ∈ X0‘, Xk actually represents the negative of the original vector.  Therefore addition of kXk to Wk actually amounts to subtraction of the original vector from Wk.
  • 42. 42/120 Perceptron Convergence Theorem  Given: A linearly contained training set X’ and any initial weight vectorW1.  Let SW be the weight vector sequence generated in response to presentation of a training sequence SX upon application of Perceptron learning law. Then for some finite index k0 we have: Wk0 = Wk0+1 = Wk0+2 = ・ ・ ・ = WS as a solution vector.  See the text for detailed proofs.
  • 46. 46/120 MATLAB Simulation -0.2 0 0.2 0.4 0.6 0.8 1 1.2 -0.5 0 0.5 1 1.5 2 2.5 3 x 1 x 2 (0,0) (1,0) (0,1) (1,1) R 2 (a) Hyperplane movement depicted during Perceptron Learning k=5 k=15 k=25 k=35 -4 -3 -2 -1 0 0 0.5 1 1.5 2 2.5 3 0 1 2 3 w0 (b) Weight space trajectories: W 0 = (0,0,0), W S =(-4 3 2) w 1 w 2
  • 47. 47/120 Perceptron Learning and Non- separable Sets  Theorem:  Given a finite set of training patterns X, there exists a number M such that if we run the Perceptron learning algorithm beginning with any initial set of weights,W1, then any weight vector Wk produced in the course of the algorithm will satisfyWk ≤ W1 +M
  • 48. 48/120 Two Corollaries(1/2)  If, in a finite set of training patterns X, each pattern Xk has integer (or rational) components xi k, then the Perceptron learning algorithm will visit a finite set of distinct weight vectors Wk.
  • 49. 49/120 Two Corollaries(2/2)  For a finite set of training patterns X, with individual patterns Xk having integer (or rational) components xi k the Perceptron learning algorithm will, in finite time, produce a weight vector that correctly classifies all training patterns iff X is linearly separable, or leave and re-visit a specific weight vector iff X is linearly non-separable.
  • 50. 50/120 Handling Linearly Non-separable Sets: The Pocket Algorithm(1/2)  Philosophy: Incorporate positive reinforcement in a way to reward weights that yield a low error solution.  Pocket algorithm works by remembering the weight vector that yields the largest number of correct classifications on a consecutive run.
  • 51. 51/120 Handling Linearly Non-separable Sets: The Pocket Algorithm(2/2)  This weight vector is kept in the “pocket”, and we denote it as Wpocket .  While updating the weights in accordance with Perceptron learning, if a weight vector is discovered that has a longer run of consecutively correct classifications than the one in the pocket, it replaces the weight vector in the pocket.
  • 54. 54/120 Pocket Convergence Theorem  Given a finite set of training examples, X, and a probabilityp < 1, there exists an integer k0 such that after any k > k0 iterations of the pocket algorithm, the probability that the pocket weight vectorWpocket is optimal exceeds p.
  • 55. 55/120 Linear Neurons and Linear Error  Consider a training set of the form T = {Xk, dk}, Xk ∈ Rn+1, dk ∈ R.  To allow the desired output to vary smoothly or continuously over some interval consider a linear signal function: sk = yk = Xk TWk  The linear error ek due to a presented training pair (Xk, dk), is the difference between the desired output dk andthe neuronal signal sk: ek = dk − sk = dk − Xk TWk
  • 56. 56/120 Operational Details of –LMS  –LMS error correction is proportional to the error itself  Each iteration reduces the error by a factor of η.  η controls the stability and speed of convergence.  Stability ensured if 0 < η < 2. Sk=Xk TWk +1 WO W1 k W2 k Wn k X1 k X2 k Xn k 
  • 57. 57/120 –LMS Works with Normalized Training Patterns x3 xk Wk+1 Wk Wk x1 x2
  • 59. 59/120 MATLAB Simulation Example  Synthetic data set shown in the figure is generated by artificially scattering points around a straight line: y = 0.5x + 0.333 and generate a scatter of 200 points in a ±0.1 interval in the y direction. -2 -1 0 1 2 -20 -15 -10 -5 0 5 10 x f = 3x(x-1)(x-1.9)(x+0.7)(x+1.8)  
  • 60. 60/120 MATLAB Simulation Example  This is achieved by first generating a random scatter in the interval [0,1].  Then stretching it to the interval [−1, 1], and finally scaling it to ±0.1 -2 -1 0 1 2 -20 -15 -10 -5 0 5 10 x f = 3x(x-1)(x-1.9)(x+0.7)(x+1.8)  
  • 61. 61/120 Computer Simulation of -LMS Algorithm 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x y Iteration 1 Iteration 10 Iteration 50 (MSE = 0.04)
  • 62. 62/120 A Stochastic Setting  Assumption that the training set T is well defined in advance is incorrect when the setting is stochastic.  In such a situation, instead of deterministic patterns, we have a sequence of samples {(Xk, dk)} assumed to be drawn from a statistically stationary population or process.  For adjustment of the neuron weights in response to some pattern-dependent error measure, the error computation has to be based on the expectation of the error over the ensemble.
  • 63. 63/120 Definition of Mean Squared Error (MSE)(1/2)  We introduce the square error on a pattern Xk as  Assumption: The weights are held fixed at Wk while computing the expectation.     (2) 2 2 1 (1) 2 1 2 1 2 2 2 k T k k T k k T k k k k k T k k k W X X W W X d d e W X d       
  • 64. 64/120 Definition of Mean Squared Error (MSE)(2/2)  The mean-squared error can now be computed by taking the expectation on both sides of (2):         (3) 2 1 2 1 2 k T k k T k k T k k k k W X X E W W X d E d E E      
  • 65. 65/120  To find optimal weight vector that minimizes the mean-square error. Our problem…
  • 66. 66/120 Cross Correlations(1/2)  For convenience of expression we define the pattern vector P as the cross-correlation between the desired scalar output, dk, and the input vector, Xk  and the input correlation matrix, R, as       (4) , , , d E 1 k k n k k k k T k T x d x d d E X P      (5) 1 E 1 1 1 1 1 1 k                 k n k n k k n k n k n k k k k k n k T k x x x x x x x x x x x x E X X R      
  • 67. 67/120 Cross Correlations (2/2)  Using Eqns. (4)-(5), we rewrite the MSE expression of Eqn. (3) succinctly as  Note that since the MSE is a quadratic function of the weights it represents a bowl shaped surface in the (n+1) x 1 weight— MSE cross space.     (6) 2 1 2 1 2 k T k k T k k RW W W P d E E      
  • 68. 68/120  First compute the gradient by straightforward differentiation which is a linear function of weights  To find the optimal set of weights, , simply set which yields Finding the Minimum Error(1/2) (7) , , 0 RW P w w T n                      Ŵ 0   (8) ˆ P W R 
  • 69. 69/120  This system of equations (8) is called the Weiner- Hopf system and its solution is the Weiner solution or the Weiner filter  is the point in weight space that represents the minimum mean-square error Finding the Minimum Error (2/2) (9) ˆ 1 P R W   min 
  • 70. 70/120 Computing the Optimal Filter(1/2)  First compute and . Substituting from Eqn. (9) into Eqn. (6) yields: 1  R P           (12) ˆ 2 1 2 1 (11) 2 1 2 1 (10) ˆ ˆ ˆ 2 1 2 1 2 1 1 1 2 2 min W P d E P R P P R R P R d E W P W R W d E T k T T k T T k            
  • 71. 71/120 Computing the Optimal Filter(2/2)  For the treatment of weight update procedures we reformulate the expression for mean-square error in terms of the deviation , of the weight vector from the Weiner solution. W W V ˆ  
  • 72. 72/120  Substituting into Eqn. (6) Computing R(1/2) W V W ˆ                   (16) ˆ ˆ ˆ 2 1 (15) 2 1 (14) ˆ ˆ ˆ ˆ ˆ 2 1 (13) ˆ ˆ ˆ 2 1 min min 2 2 ˆ 2 2 2 1 W W R W W R W W RV V W P V P W R W RV W W R V RV V d E W V P W V R W V d E T T T T T T V P RV R P RV W T T T k T T k T T T                                         
  • 73. 73/120  Note that since the mean-square error is non-negative, we must have . This implies that R is positive semi-definite. Usually however, R is positive definite. Computing R(2/2)  0  RV VT
  • 74. 74/120  Assume that R has distinct eigenvalues . Then we can construct a matrix Q whose columns are corresponding eigenvectors of R.  R can be diagonalized using an orthogonal similarity transformation as follows. Having constructed Q, and knowing that: i  i    (17) n 1 0      Q   (18) 0 0 0 0 0 n 1 0 n 1 0                             RQ We have   (19) , , , , 2 1 0 1 n diag D RQ Q         Diagonalization of R
  • 75. 75/120  It is usual to choose the eigenvectors of R to be orthonormal and .  Then  From Eqn. (15) we know that the shape of is a bowl-shaped paraboloid with a minimum at the Weiner solution (V=0, the origin of V-space).  Slices of parallel to the W space yield elliptic constant error contours which define the weights in weight space that the specific value of the square-error (say ) at which the slice is made: I QQT  T Q Q  1 1    QDQ QDQ R T   c  (20) constant 2 1 min ε ε RV V c T    Some Observations
  • 76. 76/120  Also note from Eqn. (15) that we can compute the MSE gradient in V-space as, which defines a family of vectors in V-space.  Exactly n+1 of these pass through the origin of V-space and these are the principal axes of the ellipse. Eigenvectors of R(1/2) (21) RV   , ' V
  • 77. 77/120  However, vectors passing through the origin must take the . Therefore, for the principal axes  Clearly, is an eigenvector of R. Eigenvectors of R(2/2) V  , ' V (22) ' ' V RV   ' V
  • 78. 78/120 Steepest Descent Search with Exact Gradient Information  Steepest descent search uses exact gradient information available from the mean-square error surface to direct the search in weight space.  The figure shows a projection of the square-error function on k i w  
  • 79. 79/120 Steepest Descent Procedure Summary(1/2)  Provide an appropriate weight increment to to push the error towards the minimum which occurs at .  Perturb the weight in a direction that depends on which side of the optimal weight the current weight value lies.  If the weight component lies to the left of , say at , where the error gradient is negative (as indicated by the tangent) we need to increase . k i w i ŵ i ŵ k i w1 k i w k i w i ŵ k i w
  • 80. 80/120 Steepest Descent Procedure Summary(2/2)  If is on the right of , say at where the error gradient is positive, we need to decrease .  This rule is summarized in the following statement: i ŵ k i w k i w k i w 2     k i i k i k i k i i k i k i w increase , w w 0, w ε If w decrease , w w 0, w ε If ˆ ˆ        
  • 81. 81/120  It follows logically therefore, that the weight component should be updated in proportion with the negative of the gradient:   27 n , , 0,1 i 1                 k i k i k i w w w   Weight Update Procedure(1/3)
  • 82. 82/120  Vectorially we may write where we now re-introduced the iteration time dependence into the weight components:   (28) 1        k k W W (29) , , 0 T k n k w w                   Weight Update Procedure(2/3)
  • 83. 83/120  Equation (28) is the steepest descent update procedure. Note that steepest descent uses exact gradient information at each step to decide weight changes. Weight Update Procedure(3/3)
  • 84. 84/120 Convergence of Steepest Descent – 1(1/2)  Question: What can one say about the stability of the algorithm? Does it converge for all values of ?  To answer this question consider the following series of subtitutions and transformations. From Eqns. (28) and (21)        (33) ˆ 1 (32) ˆ (31) (30) 1 W R W R W W R W RV W W W k k k k k k k                   
  • 85. 85/120 Convergence of Steepest Descent – 1(2/2)  Transforming Eqn. (33) into V-space (the principal coordinate system) by substitution of for yields: W Vk ˆ  k W   (34) 1 k k V R I V    
  • 86. 86/120  Rotation to the principal axes of the elliptic contours can be effected by using : Steepest Descent Convergence – 2(1/2) ' QV V    (35) ' ' 1 k k QV R I QV     or     (37) (36) ' ' 1 ' 1 k k k V D I QV R I Q V        
  • 87. 87/120  where D is the diagonal eigenvalue matrix. Recursive application of Eqn. (37) yields:  It follows from this that for stability and convergence of the algorithm: Steepest Descent Convergence – 2(2/2) or   (38) ' 0 ' V D I V k k      (39) 0 lim     k k D I 
  • 88. 88/120 This requires that being the largest eigenvalue of R.   (40) 0 1 lim max     k k  or (41) 2 0 max     max  Steepest Descent Convergence – 3(1/2)
  • 89. 89/120 If this condition is satisfied then we have  Steepest descent is guaranteed to converge to the Weiner solution as long as the learning rate is maintained within the limits defined by Eqn. (41).   (43) ˆ lim (42) 0 ˆ lim lim k 1 ' W W W W Q V k k k k k            or Steepest Descent Convergence – 3(2/2)
  • 90. 90/120  This simulation example employs the fifth order function data scatter with the data shifted in the y direction by 0.5. Consequently, the values of R,P and the Weiner solution are respectively: Computer Simulation Example(1/2) (45) 3834 . 1 5303 . 1 ˆ (44) 4625 . 1 8386 . 0 , 1.61 0.500 0.500 1                              W P R
  • 91. 91/120  Exact gradient information is available since the correlation matrix R and the cross- correlation matrix P are known.  The weights are updated using the equation: Computer Simulation Example(2/2)   (46) ˆ 2 1 k k k W W R W W     
  • 92. 92/120 eta = .01; % Set learning rate R=zeros(2,2); % Initialize correlation matrix X = [ones(1,max_points);x]; % Augment input vectors P = (sum([d.*X(1,:); d.*X(2,:)],2))/max_points; % Cross-correlations D = (sum(d.^2))/max_points; %squares; % target expectations for k =1:max_points R = R + X(:,k)*X(:,k)'; % Compute R end R = R/max_points; weiner=inv(R)*P; % Compute the Weiner solution errormin = D - P'*inv(R)*P; % Find the minimum error MATLAB Simulation Example(1/3)
  • 93. 93/120 shift1 = linspace(-12,12, 21); % Generate a weight space matrix shift2 = linspace(-9,9, 21); for i = 1:21 % Compute a weight matrix about shiftwts(1,i) = weiner(1)+shift1(i); % Weiner solution shiftwts(2,i) = weiner(2)+shift2(i); end for i=1:21 % Compute the error matrix for j = 1:21 % to plot the error contours error(i,j) = sum((d - (shiftwts(1,i) + x.*shiftwts(2,j))).^2); end end error = error/max_points; figure; hold on; % Plot the error contours plot(weiner(1),weiner(2),'*k') % Labelling statements no shown MATLAB Simulation Example (2/3)
  • 94. 94/120 w = 10*(2*rand(2,1)-1); % Randomize weights w0 = w; % Remember the initial weights for loop = 1:500 % Perform descent for 500 iters w = w + eta*(-2*(R*w-P)); wts1(loop)=w(1); wts2(loop)=w(2); End % Set up weights for plotting wts1=[w0(1) wts1]; wts2=[w0(2) wts2]; plot(wts1,wts2,'r') % Plot the weight trajectory MATLAB Simulation Example (3/3)
  • 95. 95/120 Smooth Trajectory towards the Weiner Solution  Steepest descent uses exact gradient information to search the Weiner solution in weight space. -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 55.3 90.1 125 160 160 160 160 194 194 229 229 264 264 299 333 368 w 0 w 1 Weiner solution (1.53, -1.38) W0 = (-3.9, 6.27)T
  • 96. 96/120 -LMS: Approximate Gradient Descent(1/2)  The problem with steepest descent is that true gradient information is only available in situations where the data set is completely specified in advance.  It is then possible to compute R and P exactly, and thus the true gradient at iteration P RW k k    :
  • 97. 97/120 -LMS: Approximate Gradient Descent(2/2)  However, when the data set comprises a random stream of patterns (drawn from a stationary distribution), R and P cannot be computed accurately. To find a correct approximation one might have to examine the data stream for a reasonably large period of time and keep averaging out.  How long should we examine the stream to get reliable estimates of R and P ?
  • 98. 98/120 Definition  The μ-LMS algorithm is convergent in the mean if the average of the weight vector Wk approaches the optimal solution Wopt as the number of iterations k, approaches infinity:  E[Wk] → Wopt as k → ∞
  • 99. 99/120  The gradient computation modifies to:  where , and since we are dealing with linear neurons. Note therefore that the recursive update equation then becomes -LMS employs k for k(1/2) (47) , , , ~ 0 0 k k T k n k k k k T k n k k k k X e w e w e e w w                                     k k k s d e   k T k k W X s    (49) (48) ) ~ ( 1 k k k k k k k k k k X e W X s d W W W              
  • 100. 100/120  What value does the long term average of converge to? Taking the Expectation of both sides of Eqn. (47): -LMS employs k for k(2/2) k   ~       (52) (51) (50) ~             P RW W X X X d E X e E E T k k k k k k k
  • 101. 101/120  Since the long term average of approaches , we can safely use as an unbiased estimate. That’s what makes - LMS work!  Since approaches in the long run, one could keep collecting for a sufficiently large number of iterations (while keeping the weights fixed), and then make a weight change collectively for all those iterations together. k   ~   k   ~ k   ~   k   ~   k X Observations(1/8)
  • 102. 102/120  If the data set is finite (deterministic), then one can compute accurately by first collecting the different gradients over all training patterns for the same set of weights. This accurate measure of the gradient could then be used to change the weights. In this situation -LMS is identical to the steepest descent algorithm. k   ~   k X Observations(2/8)
  • 103. 103/120  Even if the data set is deterministic, we still use to update the weights. After all if the data set becomes large, collection of all the gradients becomes expensive in terms of storage. Much easier to just go ahead and use  Be clear about the approximation made: we are estimating the true gradient (which should be computed from ) by a gradient computed from the instantaneous sample error . Although this may seem to be a rather drastic approximation, it works. k   ~   k E  k  Observations(3/8)
  • 104. 104/120  In the deterministic case we can justify this as follows: if the learning rate , is kept small, the weight change in each iteration will be small and consequently the weight vector W will remain “somewhat constant” over Q iterations where Q is the number of patterns in the training set.  Observations(4/8)
  • 105. 105/120  Of course this is provided that Q is a small number! To see this, observe the total weight change , over Q iterations from the iteration: W  th k (56) (55) 1 (54) 1 (53) 1 0 1 0 1 0 k Q i i k k Q i k i k Q i i k i k W Q Q W Q W Q Q W W                                                   Observations(5/8)
  • 106. 106/120  Where denotes the mean-square error. Thus the weight updates follow the true gradient on average.  Observations(6/8)
  • 107. 107/120  Observe that steepest descent search is guaranteed to search the Weiner solution provided the learning rate condition (41) is satisfied.  Now since is an unbiased estimate of one can expect that -LMS too will search out the Weiner solution. Indeed it does—but not smoothly. This is to be expected since we are only using an estimate of the true gradient for immediate use. k   ~   Observations(7/8)
  • 108. 108/120  Although -LMS and - LMS are similar algorithms, -LMS works on the normalizing training set. What this simply means is that - LMS also uses gradient information, and will eventually search out the Weiner solution-of the normalized training set. However, in one case the two algorithms are identical: the case when input vectors are bipolar. (Why?) Observations(8/8)
  • 109. 109/120  Definition 0.1 The -LMS algorithm is convergent in the mean if the average of the weight vector approaches the optimal solution as the number of iterations k, approaches infinity:  Definition 0.2 The -LMS algorithm is convergent in the mean square if the average of the squared error approaches a constant as the Number of iterations, k, approaches infinity: -LMS Algorithm: Convergence in the Mean (1)(1/2)  k W Ŵ   (57) k as ˆ   W W E k  k    (58) as constant    k E k 
  • 110. 110/120  Convergence in the mean square is a stronger criterion than convergence in the mean. In this section we discuss convergence in the mean and merely state the result for convergence in the mean square. -LMS Algorithm: Convergence in the Mean (1)(2/2)
  • 111. 111/120  Consider the -LMS weight update equation: -LMS Algorithm: Convergence in the Mean (2)(1/2)       (63) - I (62) - (61) (60) (59) 1 k k k T k k k k k T k k k k k T k k k k k k T k k k k k k k k X d W X X X d W X X W X W X X d W X W X d W X s d W W                      
  • 112. 112/120  Taking the expectation of both sides of Eqn. (63) yields, where P and R are as already defined -LMS Algorithm: Convergence in the Mean (2)(2/2)               (65) (64) 1 P W E R I X d E W E X X E I W E k k k k T k k k           
  • 113. 113/120  Appropriate substitution yields:  Pre-multiplication throughout by results in: -LMS Algorithm: Convergence in the Mean (3)(1/2)               W QDQ W E Q D I Q W QDQ W E QDQ QQ W QDQ W E QDQ I W E T k T T k T T T k T k ˆ (66) ˆ ˆ 1                 T Q       (67) ˆ 1 W DQ W E Q D I W E Q T k T k T      
  • 114. 114/120  And subtraction of from both sides gives:  We will re-write Eqn. (69) in familiar terms: -LMS Algorithm: Convergence in the Mean (3)(2/2) W QT ˆ                 (69) ˆ (68) ˆ ˆ 1 W W E Q D I W Q D I W E Q D I W W E Q k T T k T k T               (70) ~ ~ 1 k T k T V Q D I V Q    
  • 115. 115/120 And where and . Eqn. (71) represents a set of n+1Decoupled difference equations : -LMS Algorithm: Convergence in the Mean (4)(1/2)   (71) ~ ~ ' ' 1 k k V D I V       W W E V k k ˆ ~   k T k V Q V ~ ~'    (72) n. , 0,1, i ~ 1 ~ ' 1 '      k i i k i v v 
  • 116. 116/120 Recursive application of Eqn. (72) yields,  To ensure convergence in the mean, as since this condition requires that the deviation of from should tend to 0. -LMS Algorithm: Convergence in the Mean (4)(2/2)   (73) n , 0,1, i ~ 1 ~ 0 ' '     i k i k i v v  0 ~'  k i v   k   k W E Ŵ
  • 117. 117/120  Therefore from Eqn. (73):  If this condition is satisfied for the largest eigenvalue then it will be satisfied for all other eigenvalues. We therefore conclude that if then the -LMS algorithm is convergent in the mean. -LMS Algorithm: Convergence in the Mean (5)(1/2) (74) n. , 0,1, i 1 | 1 |     i  max  (75) 2 0 max    
  • 118. 118/120  Further, since tr = (where tr is the trace of R) convergence is assured if -LMS Algorithm: Convergence in the Mean (5)(2/2)   R max 0      n i i   R   (76) 2 0 R tr  
  • 119. 119/120 Random Walk towards the Weiner Solution  Assume the familiar fifth order function  -LMS uses a local estimate of gradient to search the Weiner solution in weight space. -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 44.96975 4 4 . 9 6 9 7 5 44.96975 69.19828 69.19828 6 9 . 1 9 8 2 8 69.19828 6 9 .1 9 8 2 8 93.42682 93.42682 9 3 . 4 2 6 8 2 93.42682 93.42682 93 .4 26 82 117.6554 117.6554 1 1 7 . 6 5 5 4 1 1 7 . 6 5 5 4 117.6554 117.6554 141.8839 141.8839 1 4 1 . 8 8 3 9 141.8839 141.8839 141.8839 166.1124 166.1124 166.1124 166.1124 190.341 190.341 190.341 190.341 214.5695 214.5695 Weiner solution (0.339, -1.881)  LMS solution (0.5373, -2.3311) w0 w1