i2ml-chap5-v1-1.ppt

INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN
© The MIT Press, 2004
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for

CHAPTER 5:
Multivariate Methods

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Multivariate Data
 Multiple measurements (sensors)
 d inputs/features/attributes: d-variate
 N instances/observations/examples















N
d
N
N
d
d
X
X
X
X
X
X
X
X
X




2
1
2
2
2
2
1
1
1
2
1
1
X

4
Multivariate Parameters
   
 
 
j
i
ij
ij
j
i
j
i
ij
T
d
X
,
X
X
,
X
,...,
E












Corr
:
n
Correlatio
Cov
:
Covariance
:
Mean 1
μ
x
    
 




















2
2
1
2
2
2
21
1
12
2
1
Cov
d
d
d
d
d
T
E













μ
μ X
X
X

5
Parameter Estimation
  
j
i
ij
ij
j
t
j
N
t i
t
i
ij
N
t
t
i
i
s
s
s
r
:
N
m
x
m
x
s
:
d
,...,
i
,
N
x
m
:










R
S
m
matrix
n
Correlatio
matrix
Covariance
1
mean
Sample
1
1

6
Estimation of Missing Values
 What to do if certain instances have missing attributes?
 Ignore those instances: not a good idea if the sample is
small
 Use ‘missing’ as an attribute: may give information
 Imputation: Fill in the missing value
 Mean imputation: Use the most likely value (e.g., mean)
 Imputation by regression: Predict based on other attributes

7
Multivariate Normal Distribution
 
 
 
   









 
μ
x
μ
x
x
μ
x
1
2
1
2
2
1
exp
2
1
Σ
Σ
Σ
T
/
/
d
d
p
~ ,
N

8
Multivariate Normal Distribution
 Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑
(normalizes for difference in variances and
correlations)
 Bivariate: d = 2 











 2
2
2
1
2
1
2
1

 
  
  i
i
i
i /
x
z
z
z
z
z
x
,
x
p




















 2
2
2
1
2
1
2
2
2
1
2
1 2
1
2
1
exp
1
2
1

9
Bivariate Normal

10

11
Independent Inputs: Naive Bayes
 If xi are independent, offdiagonals of ∑ are 0,
Mahalanobis distance reduces to weighted (by 1/σi)
Euclidean distance:
 If variances are also equal, reduces to Euclidean
distance
   
  






















 
 


d
i i
i
i
d
i
i
/
d
d
i
i
i
x
x
p
p
1
2
1
2
1 2
1
exp
2
1

x

12
Parametric Classification
 If p (x | Ci ) ~ N ( μi , ∑i )
 Discriminant functions are
 
 
   











i
i
T
i
/
i
/
d
i
C
p μ
x
μ
x
x
1
2
1
2
2
1
exp
2
1
| Σ
Σ
     
     
i
i
i
T
i
i
i
i
i
C
P
d
C
P
C
p
g
log
2
1
log
2
1
2
log
2
log
|
log
1










μ
Σ
μ
Σ x
x
x
x


13
Estimation of Parameters
 
  










t
t
i
T
i
t
t i
t
t
i
i
t
t
i
t
t
t
i
i
t
t
i
i
r
r
r
r
N
r
C
P̂
m
x
m
x
x
m
S
       
i
i
i
T
i
i
i C
P̂
g log
2
1
log
2
1 1







m
x
m
x
x S
S

14
Different Si
 Quadratic discriminant
     
 
i
i
i
i
T
i
i
i
i
i
i
i
i
T
i
i
T
i
i
i
T
i
i
i
T
i
T
i
i
C
P̂
w
w
C
P̂
g
log
log
2
1
2
1
2
1
where
log
2
2
1
log
2
1
1
0
1
1
0
1
1
1






















S
S
S
S
W
W
S
S
S
S
m
m
m
w
x
w
x
x
m
m
m
x
x
x
x

15
likelihoods
posterior for C1
discriminant:
P (C1|x ) = 0.5

16
Common Covariance Matrix S
 Shared common sample covariance S
 Discriminant reduces to
which is a linear discriminant
  i
i
i
C
P̂ S
S 

       
i
i
T
i
i C
P̂
g log
2
1 1




 
m
x
m
x
x S
 
 
i
i
T
i
i
i
i
i
T
i
i
C
P̂
w
w
g
log
2
1
where
1
0
1
0








m
m
m
w
x
w
x
S
S

17
Common Covariance Matrix S

18
Diagonal S
 When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j
p (xj |Ci) (Naive Bayes’ assumption)
Classify based on weighted Euclidean distance (in sj
units) to the nearest mean
   
i
d
j j
ij
t
j
i C
P̂
s
m
x
g log
2
1
1
2








 

 

x

19
Diagonal S
variances may be
different

20
Diagonal S, equal variances
 Nearest mean classifier: Classify based on Euclidean
distance to the nearest mean
 Each mean can be considered a prototype or template
and this is template matching
   
   
i
d
j
ij
t
j
i
i
i
C
P̂
m
x
s
C
P̂
s
g
log
2
1
log
2
2
1
2
2
2










m
x
x

21
Diagonal S, equal variances
*?

22
Model Selection
 As we increase complexity (less restricted S), bias
decreases and variance increases
 Assume simple models (allow some bias) to control
variance (regularization)
Assumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal Si K d(d+1)/2

23
Discrete Features
 Binary features:
if xj are independent (Naive Bayes’)
the discriminant is linear
     
   
   
i
j
ij
j
ij
j
i
i
i
C
P
p
x
p
x
C
P
C
p
g
log
1
log
1
log
log
|
log








x
x
Estimated parameters
 
i
j
ij C
x
p
p |
1


    





d
j
x
ij
x
ij
i
j
j
p
p
C
x
p
1
1
1
|



t
t
i
t
t
i
t
j
ij
r
r
x
p̂

24
Discrete Features
 Multinomial (1-of-nj) features: xj  {v1, v2,..., vnj
}
if xj are independent
   
i
k
j
i
jk
ijk C
v
x
p
C
z
p
p |
|
1 



 
   


 





 
t
t
i
t
t
i
t
jk
ijk
i
ijk
j k jk
i
d
j
n
k
z
ijk
i
r
r
z
p̂
C
P
p
z
g
p
C
p
j
jk
log
log
|
1 1
x
x

25
Multivariate Regression
 Multivariate linear model
 Multivariate polynomial model:
Define new higher-order variables
z1=x1, z2=x2, z3=x1
2, z4=x2
2, z5=x1x2
and use the linear model in this new z space
(basis functions, kernel trick, SVM: Chapter 10)
  

 d
t
t
w
,...,
w
,
w
x
g
r 1
0
|
   2
1
1
0
1
0
2
2
1
1
0
2
1
|  








t
t
d
d
t
t
d
t
d
d
t
t
x
w
x
w
w
r
w
,...,
w
,
w
E
x
w
x
w
x
w
w


X

i2ml-chap5-v1-1.ppt

Recommended

Recommended

More Related Content

Similar to i2ml-chap5-v1-1.ppt

Similar to i2ml-chap5-v1-1.ppt (20)

Recently uploaded

Recently uploaded (20)

i2ml-chap5-v1-1.ppt