SlideShare a Scribd company logo
1 of 80
1
Sergios Theodoridis
Konstantinos Koutroumbas
Version 3
2
PATTERN RECOGNITION
 Typical application areas
 Machine vision
 Character recognition (OCR)
 Computer aided diagnosis
 Speech recognition
 Face recognition
 Biometrics
 Image Data Base retrieval
 Data mining
 Bionformatics
 The task: Assign unknown objects – patterns – into the correct
class. This is known as classification.
3
 Features: These are measurable quantities obtained from
the patterns, and the classification task is based on their
respective values.
Feature vectors: A number of features
constitute the feature vector
Feature vectors are treated as random vectors.
,
,...,
1 l
x
x
  l
T
l R
x
x
x 
 ,...,
1
4
An example:
5
 The classifier consists of a set of functions, whose values,
computed at , determine the class to which the
corresponding pattern belongs
 Classification system overview
x
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
Patterns
6
 Supervised – unsupervised – semisupervised pattern
recognition:
The major directions of learning are:
 Supervised: Patterns whose class is known a-priori
are used for training.
 Unsupervised: The number of classes/groups is (in
general) unknown and no training patterns are
available.
 Semisupervised: A mixed type of patterns is
available. For some of them, their corresponding class
is known and for the rest is not.
7
CLASSIFIERS BASED ON BAYES DECISION
THEORY
 Statistical nature of feature vectors
 Assign the pattern represented by feature vector
to the most probable of the available classes
That is
maximum
 T
l
2
1 x
,...,
x
,
x
x 
x
M


 ,...,
, 2
1
)
(
: x
P
x i
i 


8
 Computation of a-posteriori probabilities
 Assume known
• a-priori probabilities
•
This is also known as the likelihood of
)
(
)...,
(
),
( 2
1 M
P
P
P 


M
i
x
p i ,...,
2
,
1
,
)
( 

.
.
. i
to
r
w
x 
9






2
1
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
i
i
i
i
i
i
i
i
i
P
x
p
x
p
x
p
P
x
p
x
P
P
x
p
x
P
x
p








 The Bayes rule (Μ=2)
where
10
 The Bayes classification rule (for two classes M=2)
 Given classify it according to the rule
 Equivalently: classify according to the rule
 For equiprobable classes the test becomes
x
2
1
2
1
2
1
)
(
)
(
)
(
)
(










x
x
P
x
P
x
x
P
x
P
If
If
)
(
)
(
)
(
)
(
)
( 2
2
1
1 


 P
x
p
P
x
p 
)
(
)
(
)
( 2
1 
 x
p
x
p 
x
11
)
(
)
( 2
2
1
1 
 
 R
R and
12
 Equivalently in words: Divide space in two regions
 Probability of error
 Total shaded area

 Bayesian classifier is OPTIMAL with respect to
minimising the classification error probability!!!!
2
2
1
1
in
If
in
If


x
R
x
x
R
x











0
0
)
(
2
1
)
(
2
1
1
2
x
x
e dx
x
p
dx
x
p
P 

13
Indeed: Moving the threshold the total shaded
area INCREASES by the extra “grey” area.
14
 The Bayes classification rule for many (M>2) classes:
 Given classify it to if:
Such a choice also minimizes the classification error
probability
 Minimizing the average risk
 For each wrong decision, a penalty term is assigned since
some decisions are more sensitive than others
i
j
x
P
x
P j
i 

 )
(
)
( 

x i

15
For M=2
• Define the loss matrix
• penalty term for deciding class ,
although the pattern belongs to , etc.
Risk with respect to
)
(
22
21
12
11





L
12

1

1

x
d
x
p
x
d
x
p
r
R
R
)
(
)
( 1
12
1
11
1
2
1



 
 

2

16
Risk with respect to

Average risk
2

x
d
x
p
x
d
x
p
r
R
R
)
(
)
( 2
22
2
21
2
2
1



 
 

)
(
)
( 2
2
1
1 
 P
r
P
r
r 

 Probabilities of wrong decisions,
weighted by the penalty terms
17
 Choose and so that r is minimized
 Then assign to if
 Equivalently:
assign x in if
: likelihood ratio
1
R 2
R
x i

)
( 2
1 

11
12
22
21
1
2
2
1
12
)
(
)
(
)
(
)
(
)
(













P
P
x
p
x
p

12

)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
2
2
22
1
1
12
2
2
2
21
1
1
11
1












P
x
p
P
x
p
P
x
p
P
x
p






18
y
probabilit
error
tion
classifica
Minimum
if
)
(
)
(
if
)
(
)
(
if
0
and
2
1
)
(
)
(
12
21
21
12
1
2
2
12
21
2
1
1
22
11
2
1


























x
P
x
P
x
x
P
x
P
x
P
P
 If
19
 An example:




















0
0
.
1
5
.
0
0
2
1
)
(
)
(
)
)
1
(
exp(
1
)
(
)
exp(
1
)
(
2
1
2
2
2
1
L
P
P
x
x
p
x
x
p






20
Then the threshold value is:
Threshold for minimum r
2
1
)
)
1
(
exp(
)
exp(
:
:
minimum
for
0
2
2
0
0






x
x
x
x
P
x e
2
1
2
)
2
1
(
ˆ
)
)
1
(
(
exp
2
)
(
exp
:
ˆ
0
2
2
0








n
x
x
x
x

0
x̂
21
Thus moves to the left of
(WHY?)
0
x̂ 0
2
1
x

22
DISCRIMINANT FUNCTIONS
DECISION SURFACES
 If are contiguous:
is the surface separating the regions. On the one
side is positive (+), on the other is negative (-). It is
known as Decision Surface.
)
(
)
(
:
)
(
)
(
:
x
P
x
P
R
x
P
x
P
R
i
j
j
j
i
i






j
i R
R , 0
)
(
)
(
)
( 

 x
P
x
P
x
g j
i 

+
- 0
)
( 
x
g
23
 If f (.) monotonically increasing, the rule remains the same if we use:
 is a discriminant function.
 In general, discriminant functions can be defined independent of the
Bayesian rule. They lead to suboptimal solutions, yet, if chosen
appropriately, they can be computationally more tractable.
Moreover, in practice, they may also lead to better solutions. This,
for example, may be case if the nature of the underlying pdf’s are
unknown.
j
i
x
P
f
x
P
f
x j
i
i 


 ))
(
(
))
(
(
:
if 


))
(
(
)
( x
P
f
x
g i
i 

THE GAUSSIAN DISTRIBUTION
 The one-dimensional case
where
is the mean value, i.e.:
is the variance,
24







 

 2
2
2
)
(
exp
2
1
)
(




x
x
p
  




 dx
x
xp
x
E )
(

2
  
  






 dx
x
p
x
x
E
x
E )
(
)
(
)
( 2
2
2



 The Multivariate (Multidimensional) case:
where is the mean value,
and is known s the covariance matrix and it is defined as:
 An example: The two-dimensional case:
,
where 25











 
)
(
)
(
2
1
exp
)
2
(
1
)
( 1
2
1
2



x
x
x
p T

  
x
E


]
)
)(
[( T
x
x
E 
 



 
 
  





















 
2
2
1
1
1
2
2
1
1
2
1
2
1 ,
2
1
exp
2
1
,
)
(




 x
x
x
x
x
x
p
x
p
 
 













2
1
2
1
x
E
x
E


   























 2
2
2
1
2
2
1
1
2
2
1
1
,








x
x
x
x
E
 
)
)(
( 2
2
1
1 

 

 x
x
E

26
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS
 Multivariate Gaussian pdf
is the covariance matrix.
 
 
)
)(
(
for
vector,
1
an
is
)
(
)
(
2
1
exp
)
2
(
1
)
( 1
2
1
2






















i
i
i
i
i
i
i
i
i
i
x
x
E
x
x
E
x
x
x
p












27
 is monotonic. Define:


 Example:
)
ln(
)
(
ln
)
(
ln
))
(
)
(
ln(
)
(
i
i
i
i
i
P
x
p
P
x
p
x
g







i
i
i
i
i
i
T
i
i
C
C
P
x
x
x
g










 
ln
)
2
1
(
2
ln
)
2
(
)
(
ln
)
(
)
(
2
1
)
( 1













 2
2
i
0
0



28

That is, is quadratic and the surfaces
quadrics, ellipsoids, parabolas, hyperbolas,
pairs of lines.
i
i
i
i
i
i
i
C
P
x
x
x
x
x
g









)
ln(
)
(
2
1
)
(
1
)
(
2
1
)
(
2
2
2
1
2
2
2
1
1
2
2
2
2
1
2








)
(x
gi
0
)
(
)
( 
 x
g
x
g j
i
 Example 1:
 Example 2:
29
30
 Decision Hyperplanes
Quadratic terms:
If ALL (the same) the quadratic
terms are not of interest. They are not
involved in comparisons. Then, equivalently,
we can write:
Discriminant functions are LINEAR.
x
x 1
i
T 

Σ
Σi 
i
i
Τ
i
i
i
i
io
T
i
i
Σ
P
w
Σ
w
w
x
w
x
g




1
0
1
2
1
)
(
ln
)
(







31
 Let in addition:
•
•
•
• 2
2
0
2
2
)
(
)
(
ln
)
(
2
1
,
)
(
0
)
(
)
(
)
(
1
)
(
Then
.
j
i
j
i
j
i
j
i
o
j
i
o
T
j
i
ij
i
T
i
i
P
P
x
w
x
x
w
x
g
x
g
x
g
w
x
x
g
I
Σ





























Remark :
• If , then
32
)
(
)
( 2
1 
 P
P   
2
1
0
2
1

 

x
• If , the linear classifier moves towards the
class with the smaller probability
33
   
2
1 
 P
P 
34
Nondiagonal:
•
•
•
Decision hyperplane


 2

0
)
(
)
( 0 

 x
x
w
x
g
T
ij
)
(
1
j
i
w 
 

 
2
1
1
2
0
)
(
where
)
)
(
)
(
(
n
)
(
2
1
1
1
x
x
x
P
P
x
T
j
i
j
i
j
i
j
i



















 
)
(
to
normal
to
normal
not
1
j
i
j
i








35
 Minimum Distance Classifiers
 equiprobable


Euclidean Distance:
smaller

Mahalanobis Distance:
smaller
M
P i
1
)
( 

)
(
)
(
2
1
)
( 1
i
T
i
i x
x
x
g 
 



 
:
Assign
:
2
i
x
I 
 


i
E x
d 


:
Assign
:
2
i
x
I 
 


2
1
1
))
(
)
(( i
T
i
m x
x
d 
 


 
36
37
:
tion
classifica
Bayesian
using
2
.
2
0
.
1
vector
he
classify t
9
.
1
3
.
0
3
.
0
1.1
,
3
3
,
0
0
),
,
(
)
(
),
,
(
)
(
and
)
(
)
(
:
,
Given
2
1
2
2
1
1
2
1
2
1
































x
Σ
N
x
p
Σ
N
x
p
P
P




















55
.
0
15
.
0
15
.
0
95
.
0
1
-
Σ
 
  672
.
3
8
.
0
0
.
2
8
.
0
,
0
.
2
,
952
.
2
2
.
2
0
.
1
2
.
2
,
0
.
1
:
,
from
s
Mahalanobi
Compute
1
2
,
2
1
1
,
2
2
1
























m
m
m
d
Σ
d
d 

1
,
,2
1 that
Observe
.
Classify E
E d
d
x 

 
 Example:
38
 Maximum Likelihood




 
:
method
The
to
w.r.
of
Likelihood
the
as
known
is
which
)
;
(
)
;
,...
,
(
)
;
(
,...
,
)
;
(
)
(
:
parameter
ctor
unknown ve
an
in
known with
)
(
Let
t
independen
and
known
,....,
,
Let
1
2
1
2
1
2
1
X
x
p
x
x
x
p
X
p
x
x
x
X
x
p
x
p
x
p
x
x
x
k
N
k
N
N
N












ESTIMATION OF UNKNOWN PROBABILITY
DENSITY FUNCTIONS
39


 0
)
(
)
(
)
(
1
)
(
)
(
:
ˆ
)
;
(
ln
)
;
(
ln
)
(
)
;
(
max
arg
:
ˆ
;
;
1
1
1
ML


























k
k
N
k
ML
k
N
k
k
Ν
k
x
p
x
p
L
x
p
X
p
L
x
p
40
41
Asymptotically unbiased and consistent
0
ˆ
lim
]
ˆ
[
lim
then
,
)
;
(
)
(
such that
a
is
there
indeed,
If,
2
0
N
0
N
0
0













ML
ML
E
E
x
p
x
p
42
 Example:





















A
A
A
A
x
N
x
Σ
L
L
L
x
Σ
x
Σ
x
p
x
Σ
x
C
x
p
L
x
p
x
p
x
x
x
Σ
N
x
p
T
T
k
N
k
ML
k
N
k
l
k
T
k
l
k
k
T
k
N
k
k
N
k
k
k
N
2
)
(
if
:
Remember
1
0
)
(
.
.
.
)
(
))
(
)
(
2
1
exp(
)
2
(
1
)
;
(
)
(
)
(
2
1
)
;
(
ln
)
(
)
;
(
)
(
,...,
,
unknown,
:
)
,
(
:
)
(
1
1
1
1
1
2
1
2
1
1
1
2
1


























































43
 Maximum Aposteriori Probability Estimation
In ML method, θ was considered as a parameter
Here we shall look at θ as a random vector
described by a pdf p(θ), assumed to be known
Given
Compute the maximum of
From Bayes theorem
 
N
2
1 x
,...,
x
,
x
X 
)
( X
p 
)
(
)
(
)
(
)
(
or
)
(
)
(
)
(
)
(
X
p
X
p
p
X
p
X
p
X
p
X
p
p








44
The method:
ML
MAP
MAP
MAP
p
X
p
P
X
p














ˆ
enough
broad
or
uniform
is
)
(
If
))
(
)
(
(
:
ˆ
or
)
(
max
arg
ˆ
45
46
 Example:
 
k
N
k
ML
k
N
k
MAP
k
N
k
k
N
k
MAP
l
l
N
x
N
For
N
x
x
p
x
p
p
x
,...,
x
X
Ι
N
x
p
1
MAP
2
2
2
2
1
2
2
0
0
2
2
1
1
2
2
0
2
1
2
1
ˆ
ˆ
N
for
or
,
1
1
ˆ
0
)
ˆ
(
1
)
(
1
or
0
))
(
)
(
ln(
:
)
2
exp(
)
2
(
1
)
(
unknown,
),
,
(
:
)
(































































47
 Bayesian Inference

How??
)
p(
estimate
:
goal
The
)
(
and
)
(
},
,...,
{
:
Given
followed.
is
root
different
a
Here
.
for
estimate
single
a
MAP
ML,
1
X
x
p
x
p
x
x
X N 




48














N
k
k
N
N
N
N
x
N
x
N
N
x
N
N
X
p
N
p
N
x
p
Let
1
2
2
0
2
0
2
2
2
2
0
0
2
2
0
2
2
0
0
2
1
,
,
)
,
(
)
(
:
out that
It turns
)
,
(
)
(
)
,
(
)
(
example
an
a
insight vi
more
bit
A




















)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
1













k
N
k
x
p
X
p
d
p
X
p
p
X
p
X
p
p
X
p
X
p
d
X
p
x
p
X
x
p








)
(
49
The previous formulae correspond to a sequence
of Gaussians for different values of N.
 Example : Prior information : , ,
True mean .
0
0 
 8
2
0 

2


50
 Maximum Entropy Method
Compute the pdf so that to be maximally non-
committal to the unavailable information and
constrained to respect the available information.
The above is equivalent with maximizing uncertainty,
i.e., entropy, subject to the available constraints.
Entropy

x
d
x
p
x
p
H )
(
ln
)
(



s
constraint
available
the
subject to
maximum
:
)
(
ˆ H
x
p
51
Example: x is nonzero in the interval
and zero otherwise. Compute the ME pdf
• The constraint:
• Lagrange Multipliers
•
2
1 x
x
x 

 
2
1
1
)
(
x
x
dx
x
p
 


2
1
)
1
)
(
(
x
x
L dx
x
p
H
H 
otherwise
0
1
)
(
ˆ
)
1
exp(
)
(
ˆ
2
1
1
2
x
x
x
x
x
x
p
x
p










 
• This is most natural. The “most random”
pdf is the uniform one. This complies
with the Maximum Entropy rationale.
It turns out that, if the constraints are the:
mean value and the variance
then the Maximum Entropy estimate is the
Gaussian pdf.
That is, the Gaussian pdf is the “most
random” one, among all pdfs with the
same mean and variance.
52
53
 Mixture Models

 Assume parametric modeling, i.e.,
The goal is to estimate
given a set
Why not ML? As before?
 






M
j x
j
J
j
j
x
d
j
x
p
P
P
j
x
p
x
p
1
1
1
)
(
,
1
)
(
)
(
)
;
( 
j
x
p
J
P
P
P ,...,
,
and 2
1

 
N
2
1 x
,...,
x
,
x
X 
)
,...,
,
;
(
max 1
1
,...,
, 1
J
k
N
k
P
P
P
P
x
p
J

 

54
This is a nonlinear problem due to the missing
label information. This is a typical problem with
an incomplete data set.
The Expectation-Maximisation (EM) algorithm.
• General formulation
–
which are not observed directly.
We observe
a many to one transformation
,
;
y
p
R
Y
y
y y
m
)
(
with
,
set
data
complete
the 


),
;
(
with
,
)
( 
x
p
m
l
R
X
y
g
x x
l
ob 



55
• Let
• What we need is to compute
• But are not observed. Here comes the EM.
Maximize the expectation of the loglikelihood
conditioned on the observed samples and the
current iteration estimate of
0
))
;
(
ln(
:
ˆ 



k
k
y
ML
y
p



s
yk '
.





)
(
)
;
(
)
;
(
specific
a
to
'
all
)
(
x
Y
y
x y
d
y
p
x
p
x
s
y
Y
x
Y


56
The algorithm:
• E-step:
• M-step:
Application to the mixture modeling problem
• Complete data
• Observed data
•
• Assuming mutual independence


k
k
y t
X
y
p
E
t
Q ))]
(
;
;
(
ln(
[
))
(
;
( 



0
))
(
;
(
:
)
1
( 







t
Q
t
N
k
j
x k
k ,...,
2
,
1
),
,
( 
N
k
xk ,...,
2
,
1
, 
k
j
k
k
k
k P
j
x
p
j
x
p )
;
(
)
;
,
( 
 
)
)
;
(
ln(
)
(
1



N
k
jk
k
k P
j
x
p
L 

57
• Unknown parameters
• E-step
• M-step
T
J
T
T
T
T
P
P
P
P
P ]
,...,
,
[
,
]
,
[ 2
1


 
J
j
P
Q
Q
k
jk
,...,
2
,
1
,
0
0 
















J
j
j
k
k
k
j
k
k P
t
j
x
p
t
x
p
t
x
p
P
t
j
x
p
t
x
j
P
1
))
(
;
(
))
(
;
(
))
(
;
(
))
(
;
(
))
(
;
(
)
)
(
(
ln
))
(
;
(
]
[
)]
)
;
(
ln(
[
))
(
;
(
;
1
1
1
1
k
k
k
j
k
k
k
k
J
j
N
k
N
k
N
k
j
k
k
P
j
x
p
t
x
j
P
E
P
j
x
p
E
t
Q
















58
 Nonparametric Estimation


In words : Place a segment of length h at
and count points inside it.
If is continuous: as , if
2
2
h
x̂
x̂
h
x̂ 

)
(
)
( x
p
x
p 

2
ˆ
-
,
1
)
ˆ
(
ˆ
)
(
ˆ
total
in
h
x
x
N
k
h
x
p
x
p
N
h
k
N
k
P
N
N
N




x̂
0
,
,
0 



N
k
k
h N
N
N
)
(x
p 

N
59
 Parzen Windows
 Place at a hypercube of length and count
points inside.
x h
60
 Define
• That is, it is 1 inside a unit side hypercube centered
at 0
•
•
• The problem:
• Parzen windows-kernels-potential functions
))
(
1
(
1
)
(
ˆ
1




N
i
i
l
h
x
x
N
h
x
p 
x
N
at
centered
hypercube
side
-
h
an
inside
points
of
number
*
1
*
volume
1
ous
discontinu
(.)
continuous
)
(

x
p
 

x
x
d
x
x
x
1
)
(
,
0
)
(
smooth
is
)
(















otherwise
2
1
0
1
)
( ij
i
x
x

61
Mean value
•
•
•
•
Hence unbiased in the limit






 '
1
'
)
'
(
)
'
(
1
)])
(
[
1
(
1
)]
(
ˆ
[
x
l
N
i
i
l
x
d
x
p
h
x
x
h
h
x
x
E
N
h
x
p
E 



 l
h
1
,
0
h
0
)
'
(
of
width
the
0 


h
x
x
h 
 

1
)
'
(
1
x
d
h
x
x
hl

)
(
)
(
1
0 x
h
x
h
h l

 

 


'
)
(
'
)
'
(
)
'
(
)]
(
ˆ
[
x
x
p
x
d
x
p
x
x
x
p
E 
62
Variance
• The smaller the h the higher the variance
h=0.1, N=1000 h=0.8, N=1000
63
h=0.1, N=10000
The higher the N the better the accuracy
64
If
•
•
•
asymptotically unbiased
The method
• Remember:
•





Ν
h
N
h 0














11
12
22
21
1
2
2
1
12
)
(
)
(
)
(
)
(
)
(
P
P
x
p
x
p
l



)
(
)
(
1
)
(
1
2
1
1
2
1
1







N
i
i
l
N
i
i
l
h
x
x
h
N
h
x
x
h
N
65
 CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the higher
the number of points, N, the better the resulting
estimate.
If in the one-dimensional space an interval, filled
with N points, is adequate (for good estimation), in
the two-dimensional space the corresponding square
will require N2 and in the ℓ-dimensional space the ℓ-
dimensional cube will require Nℓ points.
The exponential increase in the number of necessary
points in known as the curse of dimensionality. This
is a major problem one is confronted with in high
dimensional spaces.
An Example :
66
67
 NAIVE – BAYES CLASSIFIER
Let and the goal is to estimate
i = 1, 2, …, M. For a “good” estimate of the pdf
one would need, say, Nℓ points.
Assume x1, x2 ,…, xℓ mutually independent. Then:
In this case, one would require, roughly, N points
for each pdf. Thus, a number of points of the
order N·ℓ would suffice.
It turns out that the Naïve – Bayes classifier
works reasonably well even in cases that violate
the independence assumption.



x  
i
x
p 
|
   




1
|
|
j
i
j
i x
p
x
p 

68
 K Nearest Neighbor Density Estimation
In Parzen:
• The volume is constant
• The number of points in the volume is varying
Now:
• Keep the number of points
constant
• Leave the volume to be varying
•
k
kN 
)
(
)
(
ˆ
x
NV
k
x
p 
69
•

)
(
1
1
2
2
2
2
1
1


V
N
V
N
V
N
k
V
N
k
70
 The Nearest Neighbor Rule
 Choose k out of the N training vectors, identify the k
nearest ones to x
 Out of these k identify ki that belong to class ωi

 The simplest version
k=1 !!!
 For large N this is not bad. It can be shown that:
if PB is the optimal Bayesian error probability, then:
j
i
k
k
:
x j
i
i 



Assign 
2
)
1
2
( B
B
B
NN
B P
P
M
M
P
P
P 




71


 For small PB:
 An example:
k
P
2
P
P
P NN
B
kNN
B 


B
kNN P
P
,
k 


2
3 )
(
3
2
B
B
NN
B
NN
P
P
P
P
P



72
 Voronoi tesselation
 
j
i
x
x
d
x
x
d
x
R j
i
i 

 )
,
(
)
,
(
:
73
 Bayes Probability Chain Rule
Assume now that the conditional dependence for
each xi is limited to a subset of the features
appearing in each of the product terms. That is:
where
BAYESIAN NETWORKS
)
(
)
|
(
...
...
)
,...,
|
(
)
,...,
|
(
)
,...,
,
(
1
1
2
1
2
1
1
1
2
1
x
p
x
x
p
x
x
x
p
x
x
x
p
x
x
x
p




 

 










2
1
2
1 )
|
(
)
(
)
,...,
,
(
i
i
i A
x
p
x
p
x
x
x
p
 
1
2
1 ,...,
, x
x
x
A i
i
i 


74
For example, if ℓ=6, then we could assume:
Then:
The above is a generalization of the Naïve – Bayes.
For the Naïve – Bayes the assumption is:
Ai = Ø, for i=1, 2, …, ℓ
)
,
|
(
)
,...,
|
( 4
5
6
1
5
6 x
x
x
p
x
x
x
p 
   
1
5
4
5
6 ,...,
, x
x
x
x
A 

75
A graphical way to portray conditional dependencies
is given below
According to this figure we
have that:
• x6 is conditionally dependent on
x4, x5
• x5 on x4
• x4 on x1, x2
• x3 on x2
• x1, x2 are conditionally
independent on other variables.
For this case:
)
(
)
(
)
|
(
)
|
(
)
,
|
(
)
,...,
,
( 1
2
2
3
4
5
4
5
6
6
2
1 x
p
x
p
x
x
p
x
x
p
x
x
x
p
x
x
x
p 




76
 Bayesian Networks
Definition: A Bayesian Network is a directed acyclic
graph (DAG) where the nodes correspond to random
variables. Each node is associated with a set of
conditional probabilities (densities), p(xi|Ai), where xi
is the variable associated with the node and Ai is the
set of its parents in the graph.
A Bayesian Network is specified by:
• The marginal probabilities of its root nodes.
• The conditional probabilities of the non-root nodes,
given their parents, for ALL possible values of the
involved variables.
77
The figure below is an example of a Bayesian
Network corresponding to a paradigm from the
medical applications field.
This Bayesian network
models conditional
dependencies for an
example concerning
smokers (S),
tendencies to develop
cancer (C) and heart
disease (H), together
with variables
corresponding to heart
(H1, H2) and cancer
(C1, C2) medical tests.
78
Once a DAG has been constructed, the joint
probability can be obtained by multiplying the
marginal (root nodes) and the conditional (non-root
nodes) probabilities.
Training: Once a topology is given, probabilities are
estimated via the training data set. There are also
methods that learn the topology.
Probability Inference: This is the most common task
that Bayesian networks help us to solve efficiently.
Given the values of some of the variables in the
graph, known as evidence, the goal is to compute
the conditional probabilities for some of the other
variables, given the evidence.
79
 Example: Consider the Bayesian network of the
figure:
a) If x is measured to be x=1 (x1), compute
P(w=0|x=1) [P(w0|x1)].
b) If w is measured to be w=1 (w1) compute
P(x=0|w=1) [ P(x0|w1)].
80
For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
For b), the propagation is reversed in direction. It
turns out that P(x0|w1) = 0.4.
In general, the required inference information is
computed via a combined process of “message
passing” among the nodes of the DAG.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes.

More Related Content

Similar to Pattern Recognition: Classifiers Based on Bayes Decision Theory

Free Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.comFree Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.comEdhole.com
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca ldanozomuhamada
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theorysia16
 
Support Vector Machine Classifiers
Support Vector Machine ClassifiersSupport Vector Machine Classifiers
Support Vector Machine ClassifiersAerofoil Kite
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
ESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey ResearchESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey ResearchDaniel Oberski
 
LDA classifier: Tutorial
LDA classifier: TutorialLDA classifier: Tutorial
LDA classifier: TutorialAlaa Tharwat
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 

Similar to Pattern Recognition: Classifiers Based on Bayes Decision Theory (20)

Free Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.comFree Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.com
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
Support Vector Machine Classifiers
Support Vector Machine ClassifiersSupport Vector Machine Classifiers
Support Vector Machine Classifiers
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
ESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey ResearchESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey Research
 
LDA classifier: Tutorial
LDA classifier: TutorialLDA classifier: Tutorial
LDA classifier: Tutorial
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Clustering
ClusteringClustering
Clustering
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Pattern Recognition: Classifiers Based on Bayes Decision Theory

  • 2. 2 PATTERN RECOGNITION  Typical application areas  Machine vision  Character recognition (OCR)  Computer aided diagnosis  Speech recognition  Face recognition  Biometrics  Image Data Base retrieval  Data mining  Bionformatics  The task: Assign unknown objects – patterns – into the correct class. This is known as classification.
  • 3. 3  Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values. Feature vectors: A number of features constitute the feature vector Feature vectors are treated as random vectors. , ,..., 1 l x x   l T l R x x x   ,..., 1
  • 5. 5  The classifier consists of a set of functions, whose values, computed at , determine the class to which the corresponding pattern belongs  Classification system overview x sensor feature generation feature selection classifier design system evaluation Patterns
  • 6. 6  Supervised – unsupervised – semisupervised pattern recognition: The major directions of learning are:  Supervised: Patterns whose class is known a-priori are used for training.  Unsupervised: The number of classes/groups is (in general) unknown and no training patterns are available.  Semisupervised: A mixed type of patterns is available. For some of them, their corresponding class is known and for the rest is not.
  • 7. 7 CLASSIFIERS BASED ON BAYES DECISION THEORY  Statistical nature of feature vectors  Assign the pattern represented by feature vector to the most probable of the available classes That is maximum  T l 2 1 x ,..., x , x x  x M    ,..., , 2 1 ) ( : x P x i i   
  • 8. 8  Computation of a-posteriori probabilities  Assume known • a-priori probabilities • This is also known as the likelihood of ) ( )..., ( ), ( 2 1 M P P P    M i x p i ,..., 2 , 1 , ) (   . . . i to r w x 
  • 10. 10  The Bayes classification rule (for two classes M=2)  Given classify it according to the rule  Equivalently: classify according to the rule  For equiprobable classes the test becomes x 2 1 2 1 2 1 ) ( ) ( ) ( ) (           x x P x P x x P x P If If ) ( ) ( ) ( ) ( ) ( 2 2 1 1     P x p P x p  ) ( ) ( ) ( 2 1   x p x p  x
  • 11. 11 ) ( ) ( 2 2 1 1     R R and
  • 12. 12  Equivalently in words: Divide space in two regions  Probability of error  Total shaded area   Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!! 2 2 1 1 in If in If   x R x x R x            0 0 ) ( 2 1 ) ( 2 1 1 2 x x e dx x p dx x p P  
  • 13. 13 Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.
  • 14. 14  The Bayes classification rule for many (M>2) classes:  Given classify it to if: Such a choice also minimizes the classification error probability  Minimizing the average risk  For each wrong decision, a penalty term is assigned since some decisions are more sensitive than others i j x P x P j i    ) ( ) (   x i 
  • 15. 15 For M=2 • Define the loss matrix • penalty term for deciding class , although the pattern belongs to , etc. Risk with respect to ) ( 22 21 12 11      L 12  1  1  x d x p x d x p r R R ) ( ) ( 1 12 1 11 1 2 1         2 
  • 16. 16 Risk with respect to  Average risk 2  x d x p x d x p r R R ) ( ) ( 2 22 2 21 2 2 1         ) ( ) ( 2 2 1 1   P r P r r    Probabilities of wrong decisions, weighted by the penalty terms
  • 17. 17  Choose and so that r is minimized  Then assign to if  Equivalently: assign x in if : likelihood ratio 1 R 2 R x i  ) ( 2 1   11 12 22 21 1 2 2 1 12 ) ( ) ( ) ( ) ( ) (              P P x p x p  12  ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 2 2 22 1 1 12 2 2 2 21 1 1 11 1             P x p P x p P x p P x p      
  • 20. 20 Then the threshold value is: Threshold for minimum r 2 1 ) ) 1 ( exp( ) exp( : : minimum for 0 2 2 0 0       x x x x P x e 2 1 2 ) 2 1 ( ˆ ) ) 1 ( ( exp 2 ) ( exp : ˆ 0 2 2 0         n x x x x  0 x̂
  • 21. 21 Thus moves to the left of (WHY?) 0 x̂ 0 2 1 x 
  • 22. 22 DISCRIMINANT FUNCTIONS DECISION SURFACES  If are contiguous: is the surface separating the regions. On the one side is positive (+), on the other is negative (-). It is known as Decision Surface. ) ( ) ( : ) ( ) ( : x P x P R x P x P R i j j j i i       j i R R , 0 ) ( ) ( ) (    x P x P x g j i   + - 0 ) (  x g
  • 23. 23  If f (.) monotonically increasing, the rule remains the same if we use:  is a discriminant function.  In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet, if chosen appropriately, they can be computationally more tractable. Moreover, in practice, they may also lead to better solutions. This, for example, may be case if the nature of the underlying pdf’s are unknown. j i x P f x P f x j i i     )) ( ( )) ( ( : if    )) ( ( ) ( x P f x g i i  
  • 24. THE GAUSSIAN DISTRIBUTION  The one-dimensional case where is the mean value, i.e.: is the variance, 24            2 2 2 ) ( exp 2 1 ) (     x x p         dx x xp x E ) (  2              dx x p x x E x E ) ( ) ( ) ( 2 2 2   
  • 25.  The Multivariate (Multidimensional) case: where is the mean value, and is known s the covariance matrix and it is defined as:  An example: The two-dimensional case: , where 25              ) ( ) ( 2 1 exp ) 2 ( 1 ) ( 1 2 1 2    x x x p T     x E   ] ) )( [( T x x E                                     2 2 1 1 1 2 2 1 1 2 1 2 1 , 2 1 exp 2 1 , ) (      x x x x x x p x p                  2 1 2 1 x E x E                               2 2 2 1 2 2 1 1 2 2 1 1 ,         x x x x E   ) )( ( 2 2 1 1       x x E 
  • 26. 26 BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONS  Multivariate Gaussian pdf is the covariance matrix.     ) )( ( for vector, 1 an is ) ( ) ( 2 1 exp ) 2 ( 1 ) ( 1 2 1 2                       i i i i i i i i i i x x E x x E x x x p            
  • 27. 27  is monotonic. Define:    Example: ) ln( ) ( ln ) ( ln )) ( ) ( ln( ) ( i i i i i P x p P x p x g        i i i i i i T i i C C P x x x g             ln ) 2 1 ( 2 ln ) 2 ( ) ( ln ) ( ) ( 2 1 ) ( 1               2 2 i 0 0   
  • 28. 28  That is, is quadratic and the surfaces quadrics, ellipsoids, parabolas, hyperbolas, pairs of lines. i i i i i i i C P x x x x x g          ) ln( ) ( 2 1 ) ( 1 ) ( 2 1 ) ( 2 2 2 1 2 2 2 1 1 2 2 2 2 1 2         ) (x gi 0 ) ( ) (   x g x g j i
  • 29.  Example 1:  Example 2: 29
  • 30. 30  Decision Hyperplanes Quadratic terms: If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write: Discriminant functions are LINEAR. x x 1 i T   Σ Σi  i i Τ i i i i io T i i Σ P w Σ w w x w x g     1 0 1 2 1 ) ( ln ) (       
  • 31. 31  Let in addition: • • • • 2 2 0 2 2 ) ( ) ( ln ) ( 2 1 , ) ( 0 ) ( ) ( ) ( 1 ) ( Then . j i j i j i j i o j i o T j i ij i T i i P P x w x x w x g x g x g w x x g I Σ                             
  • 32. Remark : • If , then 32 ) ( ) ( 2 1   P P    2 1 0 2 1     x
  • 33. • If , the linear classifier moves towards the class with the smaller probability 33     2 1   P P 
  • 34. 34 Nondiagonal: • • • Decision hyperplane    2  0 ) ( ) ( 0    x x w x g T ij ) ( 1 j i w       2 1 1 2 0 ) ( where ) ) ( ) ( ( n ) ( 2 1 1 1 x x x P P x T j i j i j i j i                      ) ( to normal to normal not 1 j i j i        
  • 35. 35  Minimum Distance Classifiers  equiprobable   Euclidean Distance: smaller  Mahalanobis Distance: smaller M P i 1 ) (   ) ( ) ( 2 1 ) ( 1 i T i i x x x g         : Assign : 2 i x I      i E x d    : Assign : 2 i x I      2 1 1 )) ( ) (( i T i m x x d       
  • 36. 36
  • 37. 37 : tion classifica Bayesian using 2 . 2 0 . 1 vector he classify t 9 . 1 3 . 0 3 . 0 1.1 , 3 3 , 0 0 ), , ( ) ( ), , ( ) ( and ) ( ) ( : , Given 2 1 2 2 1 1 2 1 2 1                                 x Σ N x p Σ N x p P P                     55 . 0 15 . 0 15 . 0 95 . 0 1 - Σ     672 . 3 8 . 0 0 . 2 8 . 0 , 0 . 2 , 952 . 2 2 . 2 0 . 1 2 . 2 , 0 . 1 : , from s Mahalanobi Compute 1 2 , 2 1 1 , 2 2 1                         m m m d Σ d d   1 , ,2 1 that Observe . Classify E E d d x      Example:
  • 38. 38  Maximum Likelihood       : method The to w.r. of Likelihood the as known is which ) ; ( ) ; ,... , ( ) ; ( ,... , ) ; ( ) ( : parameter ctor unknown ve an in known with ) ( Let t independen and known ,...., , Let 1 2 1 2 1 2 1 X x p x x x p X p x x x X x p x p x p x x x k N k N N N             ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS
  • 40. 40
  • 41. 41 Asymptotically unbiased and consistent 0 ˆ lim ] ˆ [ lim then , ) ; ( ) ( such that a is there indeed, If, 2 0 N 0 N 0 0              ML ML E E x p x p
  • 43. 43  Maximum Aposteriori Probability Estimation In ML method, θ was considered as a parameter Here we shall look at θ as a random vector described by a pdf p(θ), assumed to be known Given Compute the maximum of From Bayes theorem   N 2 1 x ,..., x , x X  ) ( X p  ) ( ) ( ) ( ) ( or ) ( ) ( ) ( ) ( X p X p p X p X p X p X p p        
  • 45. 45
  • 48. 48               N k k N N N N x N x N N x N N X p N p N x p Let 1 2 2 0 2 0 2 2 2 2 0 0 2 2 0 2 2 0 0 2 1 , , ) , ( ) ( : out that It turns ) , ( ) ( ) , ( ) ( example an a insight vi more bit A                     ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 1              k N k x p X p d p X p p X p X p p X p X p d X p x p X x p         ) (
  • 49. 49 The previous formulae correspond to a sequence of Gaussians for different values of N.  Example : Prior information : , , True mean . 0 0   8 2 0   2  
  • 50. 50  Maximum Entropy Method Compute the pdf so that to be maximally non- committal to the unavailable information and constrained to respect the available information. The above is equivalent with maximizing uncertainty, i.e., entropy, subject to the available constraints. Entropy  x d x p x p H ) ( ln ) (    s constraint available the subject to maximum : ) ( ˆ H x p
  • 51. 51 Example: x is nonzero in the interval and zero otherwise. Compute the ME pdf • The constraint: • Lagrange Multipliers • 2 1 x x x     2 1 1 ) ( x x dx x p     2 1 ) 1 ) ( ( x x L dx x p H H  otherwise 0 1 ) ( ˆ ) 1 exp( ) ( ˆ 2 1 1 2 x x x x x x p x p            
  • 52. • This is most natural. The “most random” pdf is the uniform one. This complies with the Maximum Entropy rationale. It turns out that, if the constraints are the: mean value and the variance then the Maximum Entropy estimate is the Gaussian pdf. That is, the Gaussian pdf is the “most random” one, among all pdfs with the same mean and variance. 52
  • 53. 53  Mixture Models   Assume parametric modeling, i.e., The goal is to estimate given a set Why not ML? As before?         M j x j J j j x d j x p P P j x p x p 1 1 1 ) ( , 1 ) ( ) ( ) ; (  j x p J P P P ,..., , and 2 1    N 2 1 x ,..., x , x X  ) ,..., , ; ( max 1 1 ,..., , 1 J k N k P P P P x p J    
  • 54. 54 This is a nonlinear problem due to the missing label information. This is a typical problem with an incomplete data set. The Expectation-Maximisation (EM) algorithm. • General formulation – which are not observed directly. We observe a many to one transformation , ; y p R Y y y y m ) ( with , set data complete the    ), ; ( with , ) (  x p m l R X y g x x l ob    
  • 55. 55 • Let • What we need is to compute • But are not observed. Here comes the EM. Maximize the expectation of the loglikelihood conditioned on the observed samples and the current iteration estimate of 0 )) ; ( ln( : ˆ     k k y ML y p    s yk ' .      ) ( ) ; ( ) ; ( specific a to ' all ) ( x Y y x y d y p x p x s y Y x Y  
  • 56. 56 The algorithm: • E-step: • M-step: Application to the mixture modeling problem • Complete data • Observed data • • Assuming mutual independence   k k y t X y p E t Q ))] ( ; ; ( ln( [ )) ( ; (     0 )) ( ; ( : ) 1 (         t Q t N k j x k k ,..., 2 , 1 ), , (  N k xk ,..., 2 , 1 ,  k j k k k k P j x p j x p ) ; ( ) ; , (    ) ) ; ( ln( ) ( 1    N k jk k k P j x p L  
  • 57. 57 • Unknown parameters • E-step • M-step T J T T T T P P P P P ] ,..., , [ , ] , [ 2 1     J j P Q Q k jk ,..., 2 , 1 , 0 0                  J j j k k k j k k P t j x p t x p t x p P t j x p t x j P 1 )) ( ; ( )) ( ; ( )) ( ; ( )) ( ; ( )) ( ; ( ) ) ( ( ln )) ( ; ( ] [ )] ) ; ( ln( [ )) ( ; ( ; 1 1 1 1 k k k j k k k k J j N k N k N k j k k P j x p t x j P E P j x p E t Q                
  • 58. 58  Nonparametric Estimation   In words : Place a segment of length h at and count points inside it. If is continuous: as , if 2 2 h x̂ x̂ h x̂   ) ( ) ( x p x p   2 ˆ - , 1 ) ˆ ( ˆ ) ( ˆ total in h x x N k h x p x p N h k N k P N N N     x̂ 0 , , 0     N k k h N N N ) (x p   N
  • 59. 59  Parzen Windows  Place at a hypercube of length and count points inside. x h
  • 60. 60  Define • That is, it is 1 inside a unit side hypercube centered at 0 • • • The problem: • Parzen windows-kernels-potential functions )) ( 1 ( 1 ) ( ˆ 1     N i i l h x x N h x p  x N at centered hypercube side - h an inside points of number * 1 * volume 1 ous discontinu (.) continuous ) (  x p    x x d x x x 1 ) ( , 0 ) ( smooth is ) (                otherwise 2 1 0 1 ) ( ij i x x 
  • 61. 61 Mean value • • • • Hence unbiased in the limit        ' 1 ' ) ' ( ) ' ( 1 )]) ( [ 1 ( 1 )] ( ˆ [ x l N i i l x d x p h x x h h x x E N h x p E      l h 1 , 0 h 0 ) ' ( of width the 0    h x x h     1 ) ' ( 1 x d h x x hl  ) ( ) ( 1 0 x h x h h l         ' ) ( ' ) ' ( ) ' ( )] ( ˆ [ x x p x d x p x x x p E 
  • 62. 62 Variance • The smaller the h the higher the variance h=0.1, N=1000 h=0.8, N=1000
  • 63. 63 h=0.1, N=10000 The higher the N the better the accuracy
  • 64. 64 If • • • asymptotically unbiased The method • Remember: •      Ν h N h 0               11 12 22 21 1 2 2 1 12 ) ( ) ( ) ( ) ( ) ( P P x p x p l    ) ( ) ( 1 ) ( 1 2 1 1 2 1 1        N i i l N i i l h x x h N h x x h N
  • 65. 65  CURSE OF DIMENSIONALITY In all the methods, so far, we saw that the higher the number of points, N, the better the resulting estimate. If in the one-dimensional space an interval, filled with N points, is adequate (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the ℓ-dimensional space the ℓ- dimensional cube will require Nℓ points. The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.
  • 67. 67  NAIVE – BAYES CLASSIFIER Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points. Assume x1, x2 ,…, xℓ mutually independent. Then: In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice. It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.    x   i x p  |         1 | | j i j i x p x p  
  • 68. 68  K Nearest Neighbor Density Estimation In Parzen: • The volume is constant • The number of points in the volume is varying Now: • Keep the number of points constant • Leave the volume to be varying • k kN  ) ( ) ( ˆ x NV k x p 
  • 70. 70  The Nearest Neighbor Rule  Choose k out of the N training vectors, identify the k nearest ones to x  Out of these k identify ki that belong to class ωi   The simplest version k=1 !!!  For large N this is not bad. It can be shown that: if PB is the optimal Bayesian error probability, then: j i k k : x j i i     Assign  2 ) 1 2 ( B B B NN B P P M M P P P     
  • 71. 71    For small PB:  An example: k P 2 P P P NN B kNN B    B kNN P P , k    2 3 ) ( 3 2 B B NN B NN P P P P P   
  • 72. 72  Voronoi tesselation   j i x x d x x d x R j i i    ) , ( ) , ( :
  • 73. 73  Bayes Probability Chain Rule Assume now that the conditional dependence for each xi is limited to a subset of the features appearing in each of the product terms. That is: where BAYESIAN NETWORKS ) ( ) | ( ... ... ) ,..., | ( ) ,..., | ( ) ,..., , ( 1 1 2 1 2 1 1 1 2 1 x p x x p x x x p x x x p x x x p                    2 1 2 1 ) | ( ) ( ) ,..., , ( i i i A x p x p x x x p   1 2 1 ,..., , x x x A i i i   
  • 74. 74 For example, if ℓ=6, then we could assume: Then: The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is: Ai = Ø, for i=1, 2, …, ℓ ) , | ( ) ,..., | ( 4 5 6 1 5 6 x x x p x x x p      1 5 4 5 6 ,..., , x x x x A  
  • 75. 75 A graphical way to portray conditional dependencies is given below According to this figure we have that: • x6 is conditionally dependent on x4, x5 • x5 on x4 • x4 on x1, x2 • x3 on x2 • x1, x2 are conditionally independent on other variables. For this case: ) ( ) ( ) | ( ) | ( ) , | ( ) ,..., , ( 1 2 2 3 4 5 4 5 6 6 2 1 x p x p x x p x x p x x x p x x x p     
  • 76. 76  Bayesian Networks Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(xi|Ai), where xi is the variable associated with the node and Ai is the set of its parents in the graph. A Bayesian Network is specified by: • The marginal probabilities of its root nodes. • The conditional probabilities of the non-root nodes, given their parents, for ALL possible values of the involved variables.
  • 77. 77 The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field. This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.
  • 78. 78 Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities. Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology. Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence.
  • 79. 79  Example: Consider the Bayesian network of the figure: a) If x is measured to be x=1 (x1), compute P(w=0|x=1) [P(w0|x1)]. b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)].
  • 80. 80 For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63. For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4. In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG. Complexity: For singly connected graphs, message passing algorithms amount to a complexity linear in the number of nodes.