2. Outline
• Foundations of trainable decision-making
networks to be formulated
– Input space to output space (classification space)
• Focus on the classification of linearly separable
classes of patterns
– Linear discriminating functions and simple correction
function
– Continuous error function minimization
• Explanation and justification of perceptron and
delta training rules
2
3. Classification Model, Features,
and Decision Regions
• A pattern is the quantitative description of an
object, event, or phenomenon
– Spatial patterns: weather maps, fingerprints …
– Temporal patterns: speech signals …
• Pattern classification/recognition
– Assign the input data (a physical object, event, or
phenomenon) to one of the pre-specified classes
(categories)
– Discriminate the input data within object population
via the search for invariant attributes among
members of the population 3
4. Classification Model, Features,
and Decision Regions (cont.)
• The block diagram of the recognition and
classification system
Dimension
reduction
A neural network
for classification
and for feature
extraction
4
5. Classification Model, Features,
and Decision Regions (cont.)
• More about Feature Extraction
– The compressed data from the input patterns while
poses salient information
– E.g.
• Speech vowel sounds analyzed in 16-channel filterbanks can
provide 16 spectral vectors, which can be further transformed
into two dimensions
– Tone height (high-low) and retraction (front-back)
• Input patterns to be projected and reduced to lower
dimensions
5
7. Classification Model, Features,
and Decision Regions (cont.)
• Two simple ways to generate the pattern vectors for
cases of spatial and temporal objects to be classified
• A pattern classifier maps input patterns (vectors) in En
space into numbers (E1) which specify the membership
j = i0 ( x ), j = 1, 2,..., R
7
8. Classification Model, Features,
and Decision Regions (cont.)
• Classification described in geometric terms
The decision surfaces here
are curved lines
i o ( x ) = j , for all x ∈ Χ j , j = 1, 2 ,..., R
– Decision regions
– Decision surfaces: generally, the decision surfaces for n-
dimensional patterns may be (n-1)-dimensional hyper-surfaces 8
9. Discriminant Functions
• Determine the membership in a category by the
classifier based on the comparison of R
discriminant functions g1(x), g2(x),…, gR(x)
– When x is within the region Xk if gk(x) has the largest
value i0 ( x ) = k if g k ( x ) > g j ( x ) for k, j = 1, 2 ,..., R, k ≠ j
g1
x1 g1(x)
x1, x2,…., xp, ….,xP x2 g2
g2(x)
P>>n gR(x)
xn
Assume the classifier
gR
Has been designed
9
10. Discriminant Functions (cont.)
• Example 3.1 Decision surface Equation: g ( x ) = g1 ( x ) − g 2 ( x )
= -2 x1 + x2 + 2
g ( x ) > 0 : class1
g ( x ) < 0 : class 2
The decision surface does
not uniquely specify the
discriminant functions
The classifier that classifies patterns
into two classes or categories is called
“dichotomizer”
“two” “cut” 10
14. Discriminant Functions (cont.)
The design of discriminator
for this case is not
straightforward.
The discriminant functions
may result as nonlinear
functions of x1 and x2
14
15. Bayes’ Decision Theory
• A decision-making based on both the posterior
knowledge obtained from specific observation
data and prior knowledge of the categories
– Prior class probabilities P(ωi ), ∀ class i
– Class-conditioned probabilities P(x ωi ), ∀ class i
P (x ω i )P (ω i ) P (x ω i )P (ω i )
k = arg max P (ω i x ) = arg max = arg max
i i P (x ) i
j =1
( )
∑ P x ω j P (ω j )
k = arg max P (ω i x ) = arg max P (x ω i )P (ω i )
i i
15
16. Bayes’ Decision Theory (cont.)
• Bayes’ decision rule designed to minimize the
overall risk involved in making decision
– The expected loss (conditional risk) when making
decision δ i
R (δ x ) = ∑ l (δ ω , x )P (ω x ), where l (δ ω , x ) =
0 , i = j
i i j j i j
j 1, i ≠ j
= ∑ P (ω j x)
j≠i
= 1 - P (ω i x )
• The overall risk (Bayes’ risk)
∞
R = ∫ R (δ ( x ) x )p ( x )dx , δ ( x ) : the selected decision for a sample x
−∞
– Minimize the overall risk (classification error) by
computing the conditional risks and select the decision
δ i for which the conditional risk R (δ i x ) is minimum, i.e.,
P (ω i x ) is maximum (minimum-error-rate decision rule) 16
17. Bayes’ Decision Theory (cont.)
• Two-class pattern classification
g 1 ( x ) = P (ω 1 x ) ≅ P (x ω 1 )P (ω 1 ), g 2 (x ) = P (ω 2 x ) ≅ P (x ω 2 )P (ω 2 )
Bayes’ Classifier Likelihood ratio or log-likelihood ratio:
ω1
ω1 P(x ω1 ) > P(ω2 )
> l (x ) =
P (x ω 1 )P (ω 1 ) P (x ω 2 )P (ω 2 ) P(x ω2 ) < P(ω1 ) ω1
< ω2
>
ω2
log l ( x ) = log P(x ω1 ) − log P(x ω2 ) log P(ω2 ) − log P(ω1 )
<
ω2
Classification error:
p (error ) = P ( x ∈ R1 , ω 2 ) + P ( x ∈ R 2 , ω 1 )
= P (x ∈ R1 ω 2 )P (ω 2 ) + P (x ∈ R 2 ω 1 )P (ω 1 )
= ∫R P (x ω 2 )P (ω 2 )dx + ∫R P (x ω 1 )P (ω 1 )dx
1 2
17
18. Bayes’ Decision Theory (cont.)
• When the environment is multivariate Gaussian,
the Bayes’ classifier reduces to a linear classifier
– The same form taken by the perceptron
– But the linear nature of the perceptron is not
contingent on the assumption of Gaussianity
P (x ω ) =
1
1
exp − ( x − µ ) Σ
t −1
( x − µ )
(2 π )
1
n
2 Σ 2 2
Class ω 1 : E [ X ] = µ1
[
E ( X − µ1 )( X − µ1 ) = Σ
t
] P (ω 1 ) = P (ω 2 ) =
1
2
Class ω 2 : E [ X ] = µ 2
[
E ( X − µ 2 )( X − µ 2 ) = Σ
t
]
Assumptions 18
19. Bayes’ Decision Theory (cont.)
• When the environment is Gaussian, the Bayes’
classifier reduces to a linear classifier (cont.)
log l ( x ) = log P (x ω1 ) − log P (x ω 2 )
1
=− ( x − µ1 )t Σ −1 ( x − µ1 ) + 1 ( x − µ2 )t Σ −1 ( x − µ2 )
2 2
(
1 t
= ( µ1 − µ 2 ) Σ −1 x + µ 2 Σ −1 µ 2 − µ1 Σ −1 µ1
t
2
t
)
= wx + b
ω1
>
∴ log l ( x ) = wx + b 0
<
ω2
19
21. Linear Machine and Minimum Distance
Classification
• Find the linear-form discriminant function for two-
class classification when the class prototypes are
known
• Example 3.1: Select the decision hyperplane that
contains the midpoint of the line segment
connecting center point of two classes
21
22. Linear Machine and Minimum Distance
Classification (cont.)
The dichotomizer’s discriminant function g(x):
x1 + x 2
( x1 − x 2 ) t ( x − )=0
2
1 2 2
( x1 − x 2 ) t x + ( x 2 − x1 ) = 0
2
x1 + x 2
w x
Taken as = 0 , where
w n +1 1
2
w = x1 − x 2
w n +1 =
1
2
(x2
2
− x1
2
) Augmented
input pattern
It is a simple minimum-distance classifier.
22
23. Linear Machine and Minimum Distance
Classification (cont.)
• The linear-form discriminant functions for multi-
class classification
– There are up to R(R-1)/2 decision hyperplanes for R
pairwise separable classes
Some classes may not be contiguous
o o
o o o
o Δ
x o o x
o o ΔΔ
x x
x x o oo Δ Δ
Δ Δ
x x x x o o
x Δ Δ x Δ
x x
Δ Δ Δ
Δ
23
24. Linear Machine and Minimum Distance
Classification (cont.)
• Linear machine or minimum-distance classifier
– Assume the class prototypes are known for all classes
• Euclidean distance between input pattern x and the center of
class i, xi :
x − xi = ( x − xi ) ( x − xi )
t
2
• Minimizing x − xi = x t x − 2 xit x + xit xi is equal to
1 t
maximizing xit x − xi xi The same for all classes
2
– Set the discriminant function for each class i to be:
1 t
g i ( x ) = xit x − xi xi g i ( x ) = w it y
2
w = xi
wi x i
gi (x ) = , where ( )
wi , n +1 1
1
w i,n +1 = − x it x i
2 24
25. Linear Machine and Minimum Distance
Classification (cont.)
This approach is also called
correlation classification
An 1 as the n+1’th component
of the input pattern
1
gi ( x) = xit x − xit xi g i ( x ) = w it y
2
25
26. Linear Machine and Minimum Distance
Classification (cont.)
• Example 3.2
10 2 -5
w1 = 2 , w = − 5 , w = 5
2 3
− 52 − 14 . 5 − 25
g 1 ( x ) = 10 x 1 + 2 x 2 − 52
g (x ) =
2 2 x 1 − 5 x 2 − 14 . 5
g 3 (x ) = − 5 x 1 + 5 x 2 − 25
S 12
S 13
S 12 : 8 x1 + 7 x 2 − 37 . 5 = 0
S 13 : − 15 x1 + 3 x 2 + 27 = 0
S 23 : − 7 x1 + 10 x 2 − 10 . 5 = 0
1
gi ( x) = xit x − xit xi
2
S 23 26
27. Linear Machine and Minimum Distance
Classification (cont.)
• If R linear discriminant functions exist for a set of
patterns such that
g i (x ) > g j (x ) for x ∈ Class i,
i = 1 , 2 ,..., R,j = 1 , 2 ,..., R , i ≠ j
– The classes are linearly separable
27
29. Linear Machine and Minimum Distance
Classification (cont.)
(a) 2x1-x2+2=0, decision surface is a line
(b) 2x1-x2+2=0, decision surface is a plane
(c) x1=[2,5], x2=[-1,-3]
=>The decision surface for minimum distance classifier
(x1-x2)t x+1/2 (||x2||2-||x1||2)t=0
3x1+ 8x2-19/2=0 x3
(d)
(19/16,0)
(-1,0) (0,2) (-1,0) (0,2) (-1,0) (0,2)
x1 x1 x1
(0,0) (0,0) (0,0)
(19/6,0)
x2 x2 x2
29
30. Linear Machine and Minimum Distance
Classification (cont.)
• Examples 3.1 and 3.2 have shown that the
coefficients (weights) of the linear
discriminant functions can be determined if
the a priori information about the sets of
patterns and their class membership is
known
30
31. Linear Machine and Minimum Distance
Classification (cont.)
• The example of linearly non-separable patterns
31
33. Discrete Perceptron Training Algorithm
- Geometrical Representations
• Examine the neural network classifiers that
derive/training their weights based on the error-
correction scheme
Class 1: wt y > 0
g(y) = wt y
Class 2: wt y < 0
Augmented
input pattern
Vector Representations
in the Weight Space 33
34. Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Devise an analytic approach based on the
geometrical representations
– E.g. the decision surface for the training pattern y1
y1 in Class 1 ( )
∇ w w t y1 = y1 Gradient
(the direction of
If y1 in Class 1: steep increase)
Weight Space
w ′ = w 1 + cy1
c controls the
If y1 in Class 2: size of adjustment
y1 in Class 2
w ′ = w 1 − cy1
c (>0) is the correction increment (is
two times of the learning constant
Weight Space introduced before)
34
35. Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
Weight adjustments of three
augmented training pattern y1,
y2, y3 , shown in the weight
space
y1 ∈ C 1
y2 ∈ C1
y3 ∈ C 2
- Weights in the shaded region
are the solutions
- The three lines labeled are
fixed during training
Weight Space 35
36. Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• More about the correction increment c
– If it is not merely a constant, but related to the current
training pattern
How to select the correction increment
based on the dislocates of w1 and the
corrected weight vector w
w 1t y
p=
y (w 1
± cy ) t
y =0
w 1t y w 1t y
c = m t = 2
, because c > 0
y y y
w 1t y
⇒ cy = 2
y
y
36
37. Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• For fixed correction rule with c=constant, the
correction of weights is always the same fixed
portion of the current training vector
– The weight can be initialized at any value
w ′ = w + ∆w
w ′ = w ± cy or
[ ( )]
∆ w = c d − sgn w t y y
• For dynamic correction rule with c dependent
on the distance from the weight (i.e. the weight
vector) to the decision surface in the weight
w 1t y
space ⇒ cy = 2
y
y
– The initial weight should be different from 0
37
38. Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Dynamic correction rule with c dependent
on the distance from the weight
w 1t y
c = λ 2
y
w 1t y y
cy = λ
y y
38
39. Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Example 3.3
1 − 0.5 y 1 ∈ C 1
y1 = y2 =
1 1 y2 ∈ C 2
3 2 y 3 ∈ C 1
y3 = y4 =
1 − 1 y 4 ∈ C 2
∆w k =
c
2
[ ( )]
d k − sgn w kt y j y j
What if w kt y j = 0 ?
-> interpreted as a mistake
and followed by a correlation
39
40. Continuous Perceptron
Training Algorithm
• Replace the TLU (Threshold Logic Unit) with the
sigmoid activation function for two reasons:
– Gain finer control over the training procedure
– Facilitate the differential characteristics to enable
computation of the error gradient
w = w − η ∇ E (w
ˆ )
learning constant error gradient
40
41. Continuous Perceptron
Training Algorithm (cont.)
• The new weights is obtained by moving in the
direction of the negative gradient along the
multidimensional error surface
41
42. Continuous Perceptron
Training Algorithm (cont.)
• Define the error as the squared difference
between the desired output and the actual
output 1
E = (d − o)
2
2
1
[
or E = d − f w t y
2
( )]2
=
1
2
[d − f (net )]2
∇ E (w ) =
1
2
(
∇ [d − f (net )]
2
)
∂E ∂ (net )
∂w ∂w
1 1
∂E ∂ (net )
∆
∂w 2
∇ E (w ) =
.
= − (d − o ) f ′ (net ) ∂ w 2 = − (d − o ) f ′ (net ) y
.
. .
∂E ∂ (net )
∂ w n +1
∂ w n +1
42
43. Continuous Perceptron
Training Algorithm (cont.)
• Bipolar Continuous Activation Function
2 exp(− λ ⋅ net )
f (net ) =
2
1 + exp(− λ ⋅ net )
−1 f ′(net ) = λ ⋅
[1 + exp(− λ ⋅ net )]2
{ } (
= λ ⋅ 1 − [ f (net )] = λ 1 − o 2
2
)
w = w +
ˆ
1
2
η ⋅ λ (d − o ) 1 − o 2 y( )
• Unipolar Continuous Activation Function
f (net ) =
1 λ ⋅ exp(− λ ⋅ net)
1 + exp(− λ ⋅ net ) f ′(net) = = λ ⋅ f (net)[1 − f (net)] = λ ⋅ o(1 − o)
[1+ exp(− λ ⋅ net)] 2
w = w + η ⋅ λ ⋅ (d − o )o (1 − o ) y
ˆ
43
45. Continuous Perceptron
Training Algorithm (cont.)
• Example 3.3 Total error surface Trajectories started from four
arbitrary initial weights
45
46. Continuous Perceptron
Training Algorithm (cont.)
• Treat the last fixed component of input pattern
vector as the neuron activation threshold
46
47. Continuous Perceptron
Training Algorithm (cont.)
• R-category linear classifier using R discrete
bipolar perceptrons
– Goal: The i-th TLU response of +1 is indicative of
class i and all other TLU respond with -1
1
wi = wi +
ˆ c ⋅ (d i − o i ) y
2
d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i
For “local representation”
47
49. Continuous Perceptron
Training Algorithm (cont.)
• R-category linear classifier using R continuous
bipolar perceptrons
wi = wi +
ˆ
1
2
( )
η ⋅ λ (d i − o i ) 1 − o i2 y
for i = 1,2 ,...,R
d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i
49
50. Continuous Perceptron
Training Algorithm (cont.)
• Error function dependent on the difference
vector d-o
50
51. Bayes’ Classifier vs. Percepron
• Perceptron operates on the promise that the patterns to
be classified are linear separable (otherwise the training
algorithm will oscillate), while Bayes’ classifier assumes
the (Gaussian) distribution of two classes certainly do
overlap each other
• The perceptron is nonparametric while the Bayes’
classifier is parametric (its derivation is contingent on the
assumption of the underlying distributions)
• The perceptron is simple and adaptive, and needs small
storage, while the Bayes’ classifier could be made
adaptive but at the expanse of increased storage and
more complex computations
51