Gaussian Bayes Classifiers

Note to other teachers and users of these slides. Andrew would be
delighted if you found this source material useful in giving you r own
lectures. Feel free to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If you make use of
a significant portion of these slides in your own lecture, please include
this message, or the following link to the source repository of Andrew’s
tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and
corrections gratefully received.

Learning Gaussian
Bayes Classifiers
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599

Copyright © 2001, Andrew W. Moore Sep 10th, 2001

Maximum Likelihood learning of
Gaussians for Classification
• Why we should care
• 3 seconds to teach you a new learning
algorithm
• What if there are 10,000 dimensions?
• What if there are categorical inputs?
• Examples “out the wazoo”

Copyright © 2001, Andrew W. Moore Gaussian Bayes Classifiers: Slide 2

1

Why we should care
• One of the original “Data Mining” algorithms
• Very simple and effective
• Demonstrates the usefulness of our earlier
groundwork


Where we were at the end of the
MLE lecture…

Categorical Real-valued Mixed Real /
inputs only inputs only Cat okay

Joint BC
Predict
Inputs

Dec Tree
Classifier category Naïve BC

Joint DE Gauss DE
Prob-
Inputs Inputs

Density
ability
Estimator Naïve DE

Predict
Regressor real no.


2

This lecture…


Joint BC Gauss BC
Predict
Inputs

Dec Tree

Joint DE Gauss DE
Prob-
Inputs Inputs

Density
ability
Estimator Naïve DE

Predict
Regressor real no.


Road Map
Probability
Decision
Trees
Density
PDFs Estimation

Gaussians Bayes
MLE
Classifiers

MLE of
Gaussians


3

Road Map
Probability
Decision
Trees
Density
PDFs Estimation

Gaussians Bayes
MLE
Classifiers

MLE of
Gaussians
Gaussian
Bayes
Classifiers


Gaussian Bayes Classifier
Assumption
• The i’th record in the database is created
using the following algorithm
1. Generate the output (the “class”) by
drawing yi~Multinomial(p1,p2,…p Ny )
2. Generate the inputs from a Gaussian PDF
that depends on the value of yi :
xi ~ N(µ i ,Σ i).
Test your understanding. Given Ny classes and m input attributes, how
many distinct scalar parameters need to be estimated?


4

MLE Gaussian Bayes Classifier
Let DB = Subset of |DB |
• The i i’th record in the p mle = ------is created
databasei
database DB in which i
using classfollowing algorithm
the is y = i
the output |DB|
1. Generate the output (the “class”) by
xi ~ N(µ i ,Σ i).


Let DB = Subset of
• The i i’th record in the database is created
database DB in which
the output class following algorithm
using the is y = i
1. mle mle the output (the “class”) by
Generate
(µ i , Σ i )= MLE Gaussian for DBi
xi ~ N(µ i ,Σ i).


5

Let DB = Subset of
• The i i’th record in the database is created
database DB in which
the output class following algorithm
using the is y = i
1. mle mle the output (the “class”) by
Generate
(µ i , Σ i )= MLE Gaussian for DBi
xi ~ N(µ i ,Σ i).
R

( )( )
1
∑
R
1
∑ x k input x k − µimle
µ mle = T
Test your| understanding. Given Ny classes and m− µ i attributes, how
Si =
xk mle mle
i
| DB i x k ∈DB i | DB i | x ∈DB
k i


Gaussian Bayes Classification
p (x | y = i) P ( y = i)
P ( y = i | x) =
p( x)


6

Gaussian Bayes Classification
p (x | y = i) P ( y = i)
P ( y = i | x) =
p( x)

1 
exp  − (x k − µ i ) S i (x k − µ i ) p i
1 T

( 2π ) 2 
m /2 1/2
|| S i ||
P ( y = i | x) =
p (x )

How do we deal with that?


Here is a dataset
age employment
education edunum
marital … job relation race gender hours_worked wealth
country
…
39 State_gov Bachelors 13 Never_married
… Adm_clericalNot_in_family
White Male 40 United_States
poor
51 Self_emp_not_inc
Bachelors 13 Married … Exec_managerial
Husband White Male 13 United_States
poor
39 Private HS_grad 9 Divorced … Handlers_cleaners
Not_in_family
White Male 40 United_States
poor
54 Private 11th 7 Married … Handlers_cleaners
Husband Black Male 40 United_States
poor
28 Private Bachelors 13 Married … Prof_specialty
Wife Black Female 40 Cuba poor
38 Private Masters 14 Married … Exec_managerial
Wife White Female 40 United_States
poor
50 Private 9th 5 Married_spouse_absent
… Other_service
Not_in_family
Black Female 16 Jamaica poor
52 Self_emp_not_inc
HS_grad 9 Married … Exec_managerial
rich
31 Private Masters 14 Never_married
… Prof_specialty
Not_in_family
White Female 50 United_States
rich
42 Private Bachelors 13 Married … Exec_managerial
rich
37 Private Some_college0
1 Married … Exec_managerial
Husband Black Male 80 United_States
rich
30 State_gov Bachelors 13 Married … Prof_specialty
Husband Asian Male 40 India rich
24 Private Bachelors 13 Never_married
… Adm_clericalOwn_child White Female 30 United_States
poor
33 Private Assoc_acdm12 Never_married
… Sales Not_in_family
Black Male 50 United_States
poor
41 Private Assoc_voc 11 Married … Craft_repairHusband Asian Male 40 *MissingValue*
rich
34 Private 7th_8th 4 Married … Transport_moving
Husband Amer_Indian
Male 45 Mexico poor
26 Self_emp_not_inc
HS_grad 9 Never_married
… Farming_fishing
Own_child White Male 35 United_States
poor
33 Private HS_grad 9 Never_married
… Machine_op_inspct White
Unmarried Male 40 United_States
poor
38 Private 11th 7 Married … Sales Husband White Male 50 United_States
poor
44 Self_emp_not_inc
Masters 14 Divorced … Exec_managerial
Unmarried White Female 45 United_States
rich
41 Private Doctorate 16 Married … Prof_specialty
rich
: : : : : : : : : : : : :

48,000 records, 16 attributes [Kohavi 1995]

7

Predicting wealth from age


Predicting wealth from age


8

Wealth from hours worked


Wealth from years of education


9

age, hours → wealth




10


Having 2 inputs
instead of one helps
in two ways:
1. Combining
evidence from two 1d
Gaussians
2. Off-diagonal
covariance
distinguishes class
“shape”


Having 2 inputs
instead of one helps
in two ways:
1. Combining
evidence from two 1d
Gaussians
2. Off-diagonal
covariance
distinguishes class
“shape”

11

age, edunum → wealth


age, edunum → wealth


12

hours, edunum → wealth


hours, edunum → wealth


13

Accuracy


An “MPG” example


14



Things to note:
•Class Boundaries can be weird
shapes (hyperconic sections)
•Class regions can be non-simply-
connected
•But it’s impossible to model
arbitrarily weirdly shaped regions
•Test your understanding: With
one input, must classes be simply
connected?


15

Overfitting dangers
• Problem with “Joint” Bayes classifier:
#parameters exponential with #dimensions.
This means we just memorize the
training data, and can overfit.


Overfitting dangers
• Problem with “Joint” Bayes classifier:
#parameters exponential with #dimensions.
This means we just memorize the
training data, and can overfit.
• Problemette with Gaussian Bayes classifier:
#parameters quadratic with #dimensions.
With 10,000 dimensions and only 1,000
datapoints we could overfit.

Question: Any suggested solutions?

16

General: O(m2)  σ 21 L σ 1m 
σ 12
 
σ σ 22 L σ 2m 
parameters S =  12 
M M O M
σ L σ 2m 
σ 2m
 1m 


General: O(m2)  σ 21 L σ 1m 
σ 12
 
σ σ 22 L σ 2m 
parameters S =  12 
M M O M
σ L σ 2m 
σ 2m
 1m 


17

 σ 21 0
L
0 0 0
Aligned: O(m)
 
 0 σ 22 0
L
0 0
0 0
σ 23 L
0 0
parameters S = 
M M
M M O M
 
L σ 2 m−1
 0 0 0 0
0 2
σ m
L
 0 0 0


 σ 21 0
L
0 0 0
Aligned: O(m)
 
 0 σ 22 0
L
0 0
0 0
σ 23 L
0 0
parameters S = 
M M
M M O M
 
L σ 2 m−1
 0 0 0 0
0 σ 2m
L
 
0 0 0


18

σ 2 0 0
L
0 0
Spherical: O(1)  
0 σ 0
2
L
0 0
0 0
0 σ2 L
cov parameters
0
S = 
M M
M M O M
 
L σ2
 0
0 0 0
0 σ 2
L
 
0 0 0


σ 2 0 0
L
0 0
Spherical: O(1)  
 0 σ2 0 0
L 0
0 0
0 σ2 L
cov parameters
0
S = 
M M
M M O M
 
L σ2
0 0
0 0
0 σ 2
L
 
0 0 0


19

BCs that have both real and
categorical inputs?


Joint BC Gauss BC
Predict
Inputs

Dec Tree
BC Here???
Joint DE Gauss DE
Prob-
Inputs Inputs

Density
ability
Estimator Naïve DE

Predict
Regressor real no.


categorical inputs?


Joint BC Gauss BC
Predict
Inputs

Dec Tree
BC Here???
Joint DE Gauss DE
Prob-
Inputs Inputs

Density
ability
Estimator Naïve DE
Easy!
Predict
Regressor real no. Guess how?


20

categorical inputs?


Joint BC Gauss BC Dec Tree
Predict
Inputs

Classifier Gauss/Joint BC
category Naïve BC
Gauss Naïve BC
Joint DE Gauss DE Gauss/Joint DE
Prob-
Inputs Inputs

Density
ability
Estimator Naïve DE Gauss DE Gauss Naïve DE

Predict
Regressor real no.


categorical inputs?


Joint BC Gauss BC Dec Tree
Predict
Inputs

Classifier Gauss/Joint BC
category Naïve BC
Gauss Naïve BC
Joint DE Gauss DE Gauss/Joint DE
Prob-
Inputs Inputs

Density
ability
Estimator Naïve DE Gauss DE Gauss Naïve DE

Predict
Regressor real no.


21

Mixed Categorical / Real Density
Estimation
• Write x = (u,v) = (u1 ,u2 ,…uq ,v1 ,v2 … vm-q)

Real valued Categorical valued

P(x |M)= P(u,v |M)

(where M is any Density Estimation Model)


Joint / Gauss DE
sty
ich ta
Combo
re wh …
Not su oy? Try our
nj
e
DE to

P(u,v |M) = P(u |v ,M) P(v |M)
Gaussian with Big “m-q”-dimensional
parameters lookup table
depending on v


22

MLE learning of the Joint /
Gauss DE Combo
P(u,v |M) = P(u |v ,M) P(v |M)
µ v = Mean of u among
records matching v

Σ v = Cov. of u among
records matching v

qv = Fraction of records
that match v

u |v ,M ~ N(µv , Σv ) , P(v |M) = qv

MLE learning of the Joint /
Gauss DE Combo
P(u,v |M) = P(u |v ,M) P(v |M)
1
µ v = Mean of u among =
∑ uk
records matching v R v k s.t. v k = v
Σ v = Cov. of u among =1
∑ =(u k − µ v )(u k − µ v ) T
records matching v
R v k s.t.v k v
qv = Fraction of records = Rv
that match v
R R = # records that match v
v

u |v ,M ~ N(µv , Σv ) , P(v |M) = qv

23

Gender and Hours Worked*

*As with all the results from the UCI “adult census” dataset, we can’t
draw any real-world conclusions since it’s such a non-real-world sample

Joint / Gauss DE
What we just did
Combo


24

Joint / Gauss BC
What we do next
Combo


Joint / Gauss BC
Combo
p (u , v | M i ) P (Y = i )
P (Y = i | u , v ) =
p (u , v )
p (u, | v , M i ) p ( v | M i ) P (Y = i )
=
p (u , v )

N (u; µ i , v , S i , v ) q i , v pi
=
p (u, v )


25

Joint / Gauss BC
Combo
p (u , v | M i ) P (Y = i )
P (Y = i | u , v ) =
p (u , v )
µ i,v = Mean of u among
records matching v = p (u, | v , M i ) p ( v | M i ) P (Y = i )
and in which y=i p (u , v )
Σ i,v = Cov. of u among
records matching v
N (u; µ i , v , S i , v ) q i , v pi
and in which y=i =
qi,v = Fraction of “y=i” p (u, v )
records that match
Rather so-so-notation for
v
“Gaussian with mean µ i,v and
pi = Fraction of records covariance Σ i,v evaluated at u”
that match “y=i”

Gender, Hours→Wealth


26

Gender, Hours→Wealth


Joint / Gauss DE Combo and
Joint / Gauss BC Combo: The
downside

• (Yawn…we’ve done this before…)
More than a few categorical attributes blah blah
blah massive table blah blah lots of parameters
blah blah just memorize training data blah blah
blah do worse on future data blah blah need to
be more conservative blah


27

Naïve/Gauss combo for Density
Estimation
Categorical
Real

q  m − q 
 ∏ p (u j | M )  ∏ P (v j | M ) 
p (u , v | M ) =   j =1 
 j =1  
u j | M ~ N ( µ j , σ j ) v j | M ~ Multinomia l[q j1 , q j 2 ,..., q jN j ]
2

How many parameters?

Naïve/Gauss combo for Density
Estimation
Categorical
Real

q  m − q 
p (u , v | M ) =  ∏ p (u j | M )  ∏ P (v j | M ) 
 j =1  j =1 
  
u j | M ~ N ( µ j , σ j ) v j | M ~ Multinomia l[q j1 , q j 2 ,..., q jN j ]
2

1
∑ ukj
µj =
Rk
1
σ 2 = ∑ (u kj − µ j )
2
j
Rk
# of records in which v j = h
q jh =
R

28

Naïve/Gauss DE Example


Naïve/Gauss DE Example


29

Naïve / p (u , v | Y = i ) P (Y = i )
P (Y = i | u , v ) =
Gauss BC p (u , v )
m− q
q
1
∏ p (u j | µ ij , σ ij ) ∏ P (v j | q ij ) P(Y = i )
= 2

p (u , v ) j =1 j =1
m− q
q
1
∏ N (u j ; µ ij ,σ ij ) ∏ qij [ v j ] pi
= 2

p (u, v ) j =1 j =1

µ ij = Mean of uj among records in which y=i
σ2ij = Var. of uj among records in which y=i
qij[h] = Fraction of “y=i” records in which vj = h
pi = Fraction of records that match “y=i”

Gauss / Naïve BC Example


30

Gauss / Naïve BC Example


Learn Wealth from 15 attributes


31

Learn Wealth from 15 attributes
real values discretized
Same data, except all

to 3 levels


Learn Race from 15 attributes


32

What you should know
• A lot of this should have just been a
corollary of what you already knew
• Turning Gaussian DEs into Gaussian BCs
• Mixing Categorical and Real-Valued


Questions to Ponder
• Suppose you wanted to create an example
dataset where a BC involving Gaussians
crushed decision trees like a bug. What
would you do?
• Could you combine Decision Trees and
Bayes Classifiers? How? (maybe there is
more than one possible way)


33

Gaussian Bayes Classifiers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

More from guestfee8698

More from guestfee8698 (16)

Recently uploaded

Recently uploaded (20)

Gaussian Bayes Classifiers