SlideShare a Scribd company logo
1 of 54
1
Naïve Bayes Classifier
Naïve Bayes Classifier
Thomas Bayes
1702 - 1761
We will start off with a visual intuition, before looking at the math…
Antenna
Length
Grasshoppers Katydids
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Remember this example? Let’s
get lots more data…
Antenna
Length
With a lot of data, we can build a histogram.
Let us just build one for “Antenna Length” for
now…
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
We can leave the
histograms as they are, or
we can summarize them
with two normal
distributions.
Let us use two normal
distributions for ease of
visualization in the following
slides…
p(cj | d) =probabilityofclasscj, given thatwehaveobservedd
3
Antennae length is 3
•We want to classify an insect we have found. Its antennae are 3 units
long. How can we classify it?
•We can just ask ourselves, give the distributions of antennae lengths
we have seen, is it more probable that our insect is a Grasshopper or
a Katydid.
• There is a formal way to discuss the most probable classification…
10
2
P(Grasshopper | 3)=10 /(10+2) =0.833
P(Katydid |3) =2/(10+2)=0.166
3
Antennae length is 3
p(cj | d) =probabilityofclasscj, giventhatwehaveobservedd
9
3
P(Grasshopper | 7)=3/(3+9)=0.250
P(Katydid |7) =9 /(3+9) =0.750
7
Antennae length is 7
p(cj | d) =probabilityofclasscj, giventhatwehaveobservedd
6 6
P(Grasshopper | 5)=6/(6+6)=0.500
P(Katydid |5) =6 /(6+6) =0.500
5
Antennae length is 5
p(cj | d) =probabilityofclasscj, giventhatwehaveobservedd
Bayes Classifiers Example
Find out the probability of the previously unseen instance belonging to each
class, then simply pick the most probable class.
P(X|C)P(C)
P(X)
P(C|X)
LikelihoodPrior
Evidence
Posterior
Probability Basics
• Prior, conditional and joint probability
• Bayesian Rule
–
– Conditional probability:P(X |X ), P(X |X)
1 2 2 1
–
–
–
P(X|C)P(C)
P(X)
P(C|X)
Prior probability:P(X)
Joint probability:X  (X ,X ), P(X)  P(X ,X )
1 2 1 2
Relationship:P(X ,X )  P(X |X )P(X )  P(X |X )P(X )
1 2 2 1 1 1 2 2
2 1 2 1 2 1 1 2 1 2
Independence:P(X |X )  P(X ), P(X |X )  P(X ), P(X ,X )  P(X )P(X )
LikelihoodPrior
Evidence
Posterior
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
– Generative model
• MAP classification rule
– MAP: Maximum A Posterior
–
• Generative classification with the MAP rule
– Apply Bayesian rule to convert:
P(C|X) C  c1,,cL,X (X1,,Xn)
P(X|C) C c1,,cL ,X (X1,,Xn)
L
,c
1
*
c  c , c  c ,
Assign x to c* ifP(C  c*
|X  x)  P(C  c|X  x)
P(C|X)
P(X|C)P(C)
P(X|C)P(C)
P(X)
Naïve Bayes
• Bayes classification
• Naïve Bayes classification
– Making the assumption that all input attributes are independent
P(C|X) P(X|C)P(C)  P(X1,,Xn|C)P(C)
Difficulty: learning the joint probabilityP(X ,,X |C)
1 n
P(X1,X2 ,,Xn |C)  P(X1|X2 ,,Xn;C)P(X2 ,,Xn |C)
P(X1|C)P(X2,,Xn |C)
 P(X1|C)P(X2|C) P(Xn|C)
– MAP classification rule
L
1
*
* * *
1 n 1 n ,c
c  c , c  c ,
[P(x |c )  P(x |c )]P(c )  [P(x |c)  P(x |c)]P(c),
Example
• Example: Play Tennis
Example
• Learning Phase
Outlook Play=Yes Play=No
Sunny 2/9 3/5
Overcast 4/9 0/5
Rain 3/9 2/5
Temperatur
e
Play=Yes Play=No
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
Humidity Play=Ye
s
Play=N
o
Wind Play=Yes Play=No
Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
9/16/2013 3131
Naïve Bayes Example
Probability positive negative
P(Y) 0.5 0.5
P(small | Y) 0.4 0.4
P(medium | Y) 0.1 0.2
P(large | Y) 0.5 0.4
P(red | Y) 0.9 0.3
P(blue | Y) 0.05 0.3
P(green | Y) 0.05 0.4
P(square | Y) 0.05 0.4
P(triangle | Y) 0.05 0.3
P(circle | Y) 0.9 0.3
Test Instance:
<medium ,red, circle>
3232
Naïve Bayes Example
Probability positive negative
P(Y) 0.5 0.5
P(medium | Y) 0.1 0.2
P(red | Y) 0.9 0.3
P(circle | Y) 0.9 0.3
P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1
P(X) = (0.0405 + 0.009) = 0.0495
9/16/2013
Test Instance:
<medium ,red, circle>
P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X)
0.5 * 0.1 * 0.9 * 0.9
= 0.0405 / P(X) = 0.0405 / 0.0495 = 0.8181
P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X)
0.5 * 0.2 * 0.3 * 0.3
= 0.009 / P(X) = 0.009 / 0.0495 = 0.1818
Assume that we have two classes
c1 = male, and c2 = female.
We have a person whose sex we do not
know, say “drew” or d.
Classifying drew as male or female is
equivalent to asking is it more probable
that drew is male or female, I.e which is
greater p(male | drew) or p(female |
drew)
p(male | drew) = p(drew | male )p(male)
p(drew)
(Note: “Drew
can be a male
or female
name”)
What is the probability of being
called “drew” given that you are a
male? What is the
probability of being a
male?
What is the probability
of being named “drew”?
(actually irrelevant, since it is
that same for all classes)
Drew Carey
Drew Barrymore
p(cj | d) = p(d | cj ) p(cj)
p(d)
Officer Drew
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
Alberto Male
Karin Female
Nina Female
Sergio Male
This is Officer . Is Officer Drew a Male
or Female?
Luckily, we have a small database
with names and sex.
We can use it to apply Bayes rule…
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8
p(female | drew) = 2/5 * 5/8 = 0.250
3/8 3/8
Officer Drew
p(cj | d) = p(d | cj ) p(cj)
p(d)
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
Alberto Male
Karin Female
Nina Female
Sergio Male
Officer Drew is more
likely to be a Female.
Officer Drew IS a female
Officer Drew
Name Over 170CM Eye Hair length Sex
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
j j
p(c | d) = p(d | c ) p(c )
j
p(d)
So far we have only considered Bayes
Classification when we have one attribute (the
“antennae length”, or the “name”). But we may
have many features.
How do we use all the features?
• Tosimplify the task, naïve Bayesian classifiers assume
attributes have independent distributions, and thereby
estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
The probability of class cj
generating instance d,
equals….
The probability of class
cj generating the
observed value for
feature 1, multiplied by..
The probability of class
cj generating the
observed value for
feature 2, multiplied by..
• Tosimplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….
Officer Drew is
blue-eyed,
over 170cm tall,
and has long
hair
p(officer drew| Female) = 2/5 * 3/5 * ….
p(officer drew| Male) = 2/3 * 2/3 * ….
p(d1|cj) p(d2|cj) p(dn|cj)
cj
The Naive Bayes classifiers is
often represented as this type of
graph…
Note the direction of the arrows,
which state that each class
causes certain features, with a
certain probability
…
Naïve Bayes is fast
and space efficient
Sex Over190cm
Male Yes 0.15
No 0.85
Female Yes 0.01
No 0.99
cj
…
We can look up all the
probabilities with a single scan of
the database and store them in a
(small) table…
p(d1|cj) p(d2|cj) p(dn|cj)
Sex Long Hair
Male Yes 0.05
No 0.95
Female Yes 0.70
No 0.30
Sex
Male
Female
Naïve Bayes is NOT sensitive to irrelevant features...
p(Jessica | Female) = 9,000/10,000
p(Jessica | Male) = 9,001/10,000
* 9,975/10,000 * ….
* 2/10,000 * ….
Suppose we are trying to classify a persons sex
based on several features, including eye color. (Of
course, eye color is completely irrelevant to a
persons gender)
p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * ….
Almost the same!
However, this assumes that we have good enough estimates of the
probabilities, so the more data the better.
An obvious point. I have used a simple
two class problem, and two possible
values for each example, for my previous
examples. However we can have an
arbitrary number of classes, or feature
values
Animal Mass >10kg
Cat Yes 0.15
No 0.85
Dog Yes 0.91
No 0.09
Pig Yes 0.99
No 0.01
cj
…
p(d1|cj) p(d2|cj) p(dn|cj)
Animal
Cat
Dog
Pig
Animal Color
Cat Black 0.33
White 0.23
Brown 0.44
Dog Black 0.97
White 0.03
Brown 0.90
Pig Black 0.04
White 0.01
Brown 0.95
Naïve Bayesian
Classifier
p(d1|cj) p(d2|cj) p(dn|cj)
j
p(d|c)
Problem!
Naïve Bayes assumes
independence of
features…
Sex Over6
foot
Male Yes 0.15
No 0.85
Female Yes 0.01
No 0.99
Sex Over200
pounds
Male Yes 0.11
No 0.80
Female Yes 0.05
No 0.95
Naïve Bayesian
Classifier
p(d1|cj) p(d2|cj) p(dn|cj)
j
p(d|c)
Solution
Consider the relationships
between attributes…
Sex Over 6
foot
Male Yes 0.15
No 0.85
Female Yes 0.01
No 0.99
Sex Over 200 pounds
Male Yes and Over 6 foot 0.11
No and Over 6 foot 0.59
Yes and NOT Over 6 foot 0.05
No and NOT Over 6 foot 0.35
Female Yes and Over 6 foot 0.01
Relevant Issues
during test
• Violation of Independence Assumption
– For many real world tasks,P(X1,,Xn|C)P(X1|C)P(Xn|C)
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute valueXjajk , Pˆ(Xjajk |Cci)0
–
– For a remedy, conditional probabilities estimated with
1 i jk i n i
In this circumstance,Pˆ(x |c)Pˆ(a |c) Pˆ(x|c)0
n m
n mp
nc : numberof training examplesfor whichXj  ajk and C  ci
n:numberof training examplesfor whichC  ci
p :prior estimate(usually, p  1/t fort possible values of Xj )
m : weight to prior(numberof "virtual" examples, m  1)
c
Pˆ(Xj ajk |C  ci ) 
9/16/2013 4949
Estimating Probabilities
• Normally, probabilities are estimated based on observed
frequencies in the training data.
• If D contains nk examples in category yk, and nijk of these nk
examples have the jth value for feature Xi, xij, then:
• However, estimating such probabilities from small training
sets is error-prone.
– If due only to chance, a rare feature, Xi, is always false in the
training data, yk :P(Xi=true | Y=yk) = 0.
– If Xi=true then occurs in a test example, X, the result is that yk:
P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0
k
ijk
n
n
P( Xi  xij | Y  yk ) 
9/16/2013 5050
Probability Estimation Example
Probability positive negative
P(Y) 0.5 0.5
P(small | Y) 0.5 0.5
P(medium | Y) 0.0 0.0
P(large | Y) 0.5 0.5
P(red | Y) 1.0 0.5
P(blue | Y) 0.0 0.5
P(green | Y) 0.0 0.0
P(square | Y) 0.0 0.0
P(triangle | Y) 0.0 0.5
P(circle | Y) 1.0 0.5
Ex Size Color Shape Category
1 small red circle positive
2 large red circle positive
3 small red triangle negitive
4 large blue circle negitive
Test Instance X:
<medium, red, circle>
P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0
P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0
9/16/2013 5151
Smoothing
• Toaccount for estimation from small samples,
probability estimates are adjusted or smoothed.
• Laplace smoothing using an m-estimate assumes
that each feature is given a prior probability, p, that
is assumed to have been previously observed in a
“virtual” sample of size m.
• For binary features, p is simply assumed to be 0.5.
k
ijk
n m
n  mp
P( Xi  xij | Y  yk ) 
9/16/2013 5252
Laplace Smothing Example
• Assume training set contains 10 positive examples:
– 4: small
– 0: medium
– 6: large
• Estimate parameters as follows (if m=1, p=1/3)
– P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394
– P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03
– P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576
– P(small or medium or large | positive) = 1.0
9/16/2013 53
Numerical Stability
• It is often the case that machine learning algorithms need to
work with very small numbers
– Imagine computing the probability of 2000 independent
coin flips
– MATLAB thinks that (.5)2000=0
54
Stochastic Language Models
0.2 the
0.1 a
0.01 guy
0.01 fruit
0.03 said
0.02 likes
…9/16/2013
the guy likes the fruit
0.2 0.01 0.02 0.2 0.01
multiply
• Models probability of generating strings (each
word in turn) in the language (commonly all
strings over ∑). E.g., unigram model
Model M
P(s | M) = 0.00000008
13.2.1
9/16/2013 55
Numerical Stability
• Instead of comparing P(Y=5|X1,…,Xn) with P(Y=6|X1,…,Xn),
– Compare their logarithms
5656
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
9/16/2013
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
–
– Test Phase:
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
ji
j i




 

exp
1
P(X |C  c ) 
ˆ
2
ji
ji :mean(avearage) of attributevalues Xj of examples for whichC  ci
ji :standard deviation of attributevalues Xj of examples for whichC  ci
2
(Xjji )2
2
1 n 1 L
Learning Phase:for X  (X ,,X ), C  c,,c
1 n

,X )
 
for X  (X,
i
Output: n L normal distributions andP(C  c ) i  1,,L
Data with Numeric Attributes
sex height (feet) weight (lbs)
foot
size(inches)
male 6 180 12
male 5.92 (5'11") 190 11
male 5.58 (5'7") 170 12
male 5.92 (5'11") 165 10
female 5 100 6
female 5.5 (5'6") 150 8
female 5.42 (5'5") 130 7
female 5.75 (5'9") 150 9
sex
mean
(height)
variance
(height)
mean
(weight)
variance
(weight)
mean
(foot size)
variance
(foot size)
male 5.855
3.5033e-
02
176.25
1.2292e+0
2
11.25
9.1667e-
01
female 5.4175
9.7225e-
02
132.5
5.5833e+0
2
7.5
1.6667e+0
0
Data with Numeric Attributes
sex height (feet) weight (lbs)
foot
size(inches)
sample 6 130 8
Since posterior numerator is greater in the female case, we predict the
sample is female.
• Advantages:
– Fast to train (single scan). Fast to classify
– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantages:
– Assumes independence of features
Advantages/Disadvantages of Naïve Bayes
Conclusions
 Naïve Bayes based on the independence assumption
 Training is very easy and fast; just requiring considering each
attribute in each class separately
 Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
 A popular generative model
 Performance competitive to most of state-of-the-art
classifiers even in presence of violating independence
assumption
 Many successful applications, e.g., spam mail filtering
 Apart from classification, naïve Bayes can do more…
9/16/2013 64
Conclusions
• Naïve Bayes is:
– Really easy to implement and often works well
– Often a good first thing to try
– Commonly used as a “punching bag” for smarter
algorithms
65
Acknowledgements
 Introduction to Machine Learning, Alphaydin
 Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000
 Pattern Recognition and Analysis Course – A.K. Jain, MSU
 Pattern Classification” by Duda et al., John Wiley & Sons.
 http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html
 http://en.wikipedia.org/wiki/Naive_Bayes_classifier
 Some Material adopted from Dr. Adam Prugel-Bennett
Material
in
these
slides
has
been
taken
from,
the
following
resources

More Related Content

What's hot

What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
Simplilearn
 
Tính toán khoa học: Chương 3: Đường cong khớp
Tính toán khoa học: Chương 3: Đường cong khớpTính toán khoa học: Chương 3: Đường cong khớp
Tính toán khoa học: Chương 3: Đường cong khớp
Chien Dang
 
Les systèmes experts
Les systèmes expertsLes systèmes experts
Les systèmes experts
Bruno Delb
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
Xueping Peng
 
Chapitre 4-Apprentissage non supervisé (1) (1).pdf
Chapitre 4-Apprentissage non supervisé (1) (1).pdfChapitre 4-Apprentissage non supervisé (1) (1).pdf
Chapitre 4-Apprentissage non supervisé (1) (1).pdf
ZizoAziz
 

What's hot (20)

NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
New 7 QC Tools Module 4 | Matrix diagram
New 7 QC Tools Module 4 | Matrix diagramNew 7 QC Tools Module 4 | Matrix diagram
New 7 QC Tools Module 4 | Matrix diagram
 
Introduction à la Logique Formelle (Part 1)
Introduction à la Logique Formelle (Part 1)Introduction à la Logique Formelle (Part 1)
Introduction à la Logique Formelle (Part 1)
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
 
Giải số bằng mathlab
Giải số bằng mathlabGiải số bằng mathlab
Giải số bằng mathlab
 
Tính toán khoa học: Chương 3: Đường cong khớp
Tính toán khoa học: Chương 3: Đường cong khớpTính toán khoa học: Chương 3: Đường cong khớp
Tính toán khoa học: Chương 3: Đường cong khớp
 
Les algorithmes de génération des règles d association
Les algorithmes de génération des règles d associationLes algorithmes de génération des règles d association
Les algorithmes de génération des règles d association
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Les systèmes experts
Les systèmes expertsLes systèmes experts
Les systèmes experts
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Luận văn: Các phương pháp xây dựng vành các thương của các vành không giao hoán
Luận văn: Các phương pháp xây dựng vành các thương của các vành không giao hoánLuận văn: Các phương pháp xây dựng vành các thương của các vành không giao hoán
Luận văn: Các phương pháp xây dựng vành các thương của các vành không giao hoán
 
Luận văn: Một số bài toán về dãy số, HAY, 9đ
Luận văn: Một số bài toán về dãy số, HAY, 9đLuận văn: Một số bài toán về dãy số, HAY, 9đ
Luận văn: Một số bài toán về dãy số, HAY, 9đ
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Chapitre 4-Apprentissage non supervisé (1) (1).pdf
Chapitre 4-Apprentissage non supervisé (1) (1).pdfChapitre 4-Apprentissage non supervisé (1) (1).pdf
Chapitre 4-Apprentissage non supervisé (1) (1).pdf
 
How to choose Machine Learning algorithm.
How to choose Machine Learning  algorithm.How to choose Machine Learning  algorithm.
How to choose Machine Learning algorithm.
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
 

Similar to Naive Bayes.pptx

Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
ankitgadgil
 
Introduction to Probability
Introduction to ProbabilityIntroduction to Probability
Introduction to Probability
Todd Bill
 
introduction to probability
introduction to probabilityintroduction to probability
introduction to probability
Todd Bill
 

Similar to Naive Bayes.pptx (20)

Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
 
cps170_bayes_nets.ppt
cps170_bayes_nets.pptcps170_bayes_nets.ppt
cps170_bayes_nets.ppt
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
 
CS571: Language Models
CS571: Language ModelsCS571: Language Models
CS571: Language Models
 
Topic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxTopic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptx
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
naive bayes example.pdf
naive bayes example.pdfnaive bayes example.pdf
naive bayes example.pdf
 
naive bayes example.pdf
naive bayes example.pdfnaive bayes example.pdf
naive bayes example.pdf
 
Introduction to Probability
Introduction to ProbabilityIntroduction to Probability
Introduction to Probability
 
introduction to probability
introduction to probabilityintroduction to probability
introduction to probability
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified models
 
M03 nb-02
M03 nb-02M03 nb-02
M03 nb-02
 
Rough Entropy-Based Gene Selection
Rough Entropy-Based Gene SelectionRough Entropy-Based Gene Selection
Rough Entropy-Based Gene Selection
 
Efficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeEfficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope Volume
 
file_5.pptx
file_5.pptxfile_5.pptx
file_5.pptx
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 

Recently uploaded

Recently uploaded (20)

WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 

Naive Bayes.pptx

  • 2. Naïve Bayes Classifier Thomas Bayes 1702 - 1761 We will start off with a visual intuition, before looking at the math…
  • 3. Antenna Length Grasshoppers Katydids 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Remember this example? Let’s get lots more data…
  • 4. Antenna Length With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now… 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Katydids Grasshoppers
  • 5. We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us use two normal distributions for ease of visualization in the following slides…
  • 6. p(cj | d) =probabilityofclasscj, given thatwehaveobservedd 3 Antennae length is 3 •We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it? •We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid. • There is a formal way to discuss the most probable classification…
  • 7. 10 2 P(Grasshopper | 3)=10 /(10+2) =0.833 P(Katydid |3) =2/(10+2)=0.166 3 Antennae length is 3 p(cj | d) =probabilityofclasscj, giventhatwehaveobservedd
  • 8. 9 3 P(Grasshopper | 7)=3/(3+9)=0.250 P(Katydid |7) =9 /(3+9) =0.750 7 Antennae length is 7 p(cj | d) =probabilityofclasscj, giventhatwehaveobservedd
  • 9. 6 6 P(Grasshopper | 5)=6/(6+6)=0.500 P(Katydid |5) =6 /(6+6) =0.500 5 Antennae length is 5 p(cj | d) =probabilityofclasscj, giventhatwehaveobservedd
  • 10. Bayes Classifiers Example Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. P(X|C)P(C) P(X) P(C|X) LikelihoodPrior Evidence Posterior
  • 11. Probability Basics • Prior, conditional and joint probability • Bayesian Rule – – Conditional probability:P(X |X ), P(X |X) 1 2 2 1 – – – P(X|C)P(C) P(X) P(C|X) Prior probability:P(X) Joint probability:X  (X ,X ), P(X)  P(X ,X ) 1 2 1 2 Relationship:P(X ,X )  P(X |X )P(X )  P(X |X )P(X ) 1 2 2 1 1 1 2 2 2 1 2 1 2 1 1 2 1 2 Independence:P(X |X )  P(X ), P(X |X )  P(X ), P(X ,X )  P(X )P(X ) LikelihoodPrior Evidence Posterior
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Probabilistic Classification • Establishing a probabilistic model for classification – Discriminative model – Generative model • MAP classification rule – MAP: Maximum A Posterior – • Generative classification with the MAP rule – Apply Bayesian rule to convert: P(C|X) C  c1,,cL,X (X1,,Xn) P(X|C) C c1,,cL ,X (X1,,Xn) L ,c 1 * c  c , c  c , Assign x to c* ifP(C  c* |X  x)  P(C  c|X  x) P(C|X) P(X|C)P(C) P(X|C)P(C) P(X)
  • 18. Naïve Bayes • Bayes classification • Naïve Bayes classification – Making the assumption that all input attributes are independent P(C|X) P(X|C)P(C)  P(X1,,Xn|C)P(C) Difficulty: learning the joint probabilityP(X ,,X |C) 1 n P(X1,X2 ,,Xn |C)  P(X1|X2 ,,Xn;C)P(X2 ,,Xn |C) P(X1|C)P(X2,,Xn |C)  P(X1|C)P(X2|C) P(Xn|C) – MAP classification rule L 1 * * * * 1 n 1 n ,c c  c , c  c , [P(x |c )  P(x |c )]P(c )  [P(x |c)  P(x |c)]P(c),
  • 20. Example • Learning Phase Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperatur e Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Ye s Play=N o Wind Play=Yes Play=No Strong 3/9 3/5 High 3/9 4/5 Weak 6/9 2/5 Normal 6/9 1/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14
  • 21. Example • Test Phase – Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Look up tables P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Outlook=Sunny|Play=Yes) = 2/9 P(Temperature=Cool|Play=Yes) = 3/9 P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 – MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
  • 22. Example • Test Phase – Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Look up tables P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 – MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
  • 23. 9/16/2013 3131 Naïve Bayes Example Probability positive negative P(Y) 0.5 0.5 P(small | Y) 0.4 0.4 P(medium | Y) 0.1 0.2 P(large | Y) 0.5 0.4 P(red | Y) 0.9 0.3 P(blue | Y) 0.05 0.3 P(green | Y) 0.05 0.4 P(square | Y) 0.05 0.4 P(triangle | Y) 0.05 0.3 P(circle | Y) 0.9 0.3 Test Instance: <medium ,red, circle>
  • 24. 3232 Naïve Bayes Example Probability positive negative P(Y) 0.5 0.5 P(medium | Y) 0.1 0.2 P(red | Y) 0.9 0.3 P(circle | Y) 0.9 0.3 P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1 P(X) = (0.0405 + 0.009) = 0.0495 9/16/2013 Test Instance: <medium ,red, circle> P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X) 0.5 * 0.1 * 0.9 * 0.9 = 0.0405 / P(X) = 0.0405 / 0.0495 = 0.8181 P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X) 0.5 * 0.2 * 0.3 * 0.3 = 0.009 / P(X) = 0.009 / 0.0495 = 0.1818
  • 25. Assume that we have two classes c1 = male, and c2 = female. We have a person whose sex we do not know, say “drew” or d. Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, I.e which is greater p(male | drew) or p(female | drew) p(male | drew) = p(drew | male )p(male) p(drew) (Note: “Drew can be a male or female name”) What is the probability of being called “drew” given that you are a male? What is the probability of being a male? What is the probability of being named “drew”? (actually irrelevant, since it is that same for all classes) Drew Carey Drew Barrymore
  • 26. p(cj | d) = p(d | cj ) p(cj) p(d) Officer Drew Name Sex Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male This is Officer . Is Officer Drew a Male or Female? Luckily, we have a small database with names and sex. We can use it to apply Bayes rule…
  • 27. p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 p(female | drew) = 2/5 * 5/8 = 0.250 3/8 3/8 Officer Drew p(cj | d) = p(d | cj ) p(cj) p(d) Name Sex Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male Officer Drew is more likely to be a Female.
  • 28. Officer Drew IS a female Officer Drew
  • 29. Name Over 170CM Eye Hair length Sex Drew No Blue Short Male Claudia Yes Brown Long Female Drew No Blue Long Female Drew No Blue Long Female Alberto Yes Brown Short Male Karin No Blue Long Female Nina Yes Brown Short Female Sergio Yes Blue Long Male j j p(c | d) = p(d | c ) p(c ) j p(d) So far we have only considered Bayes Classification when we have one attribute (the “antennae length”, or the “name”). But we may have many features. How do we use all the features?
  • 30. • Tosimplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) The probability of class cj generating instance d, equals…. The probability of class cj generating the observed value for feature 1, multiplied by.. The probability of class cj generating the observed value for feature 2, multiplied by..
  • 31. • Tosimplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * …. Officer Drew is blue-eyed, over 170cm tall, and has long hair p(officer drew| Female) = 2/5 * 3/5 * …. p(officer drew| Male) = 2/3 * 2/3 * ….
  • 32. p(d1|cj) p(d2|cj) p(dn|cj) cj The Naive Bayes classifiers is often represented as this type of graph… Note the direction of the arrows, which state that each class causes certain features, with a certain probability …
  • 33. Naïve Bayes is fast and space efficient Sex Over190cm Male Yes 0.15 No 0.85 Female Yes 0.01 No 0.99 cj … We can look up all the probabilities with a single scan of the database and store them in a (small) table… p(d1|cj) p(d2|cj) p(dn|cj) Sex Long Hair Male Yes 0.05 No 0.95 Female Yes 0.70 No 0.30 Sex Male Female
  • 34. Naïve Bayes is NOT sensitive to irrelevant features... p(Jessica | Female) = 9,000/10,000 p(Jessica | Male) = 9,001/10,000 * 9,975/10,000 * …. * 2/10,000 * …. Suppose we are trying to classify a persons sex based on several features, including eye color. (Of course, eye color is completely irrelevant to a persons gender) p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * …. Almost the same! However, this assumes that we have good enough estimates of the probabilities, so the more data the better.
  • 35. An obvious point. I have used a simple two class problem, and two possible values for each example, for my previous examples. However we can have an arbitrary number of classes, or feature values Animal Mass >10kg Cat Yes 0.15 No 0.85 Dog Yes 0.91 No 0.09 Pig Yes 0.99 No 0.01 cj … p(d1|cj) p(d2|cj) p(dn|cj) Animal Cat Dog Pig Animal Color Cat Black 0.33 White 0.23 Brown 0.44 Dog Black 0.97 White 0.03 Brown 0.90 Pig Black 0.04 White 0.01 Brown 0.95
  • 36. Naïve Bayesian Classifier p(d1|cj) p(d2|cj) p(dn|cj) j p(d|c) Problem! Naïve Bayes assumes independence of features… Sex Over6 foot Male Yes 0.15 No 0.85 Female Yes 0.01 No 0.99 Sex Over200 pounds Male Yes 0.11 No 0.80 Female Yes 0.05 No 0.95
  • 37. Naïve Bayesian Classifier p(d1|cj) p(d2|cj) p(dn|cj) j p(d|c) Solution Consider the relationships between attributes… Sex Over 6 foot Male Yes 0.15 No 0.85 Female Yes 0.01 No 0.99 Sex Over 200 pounds Male Yes and Over 6 foot 0.11 No and Over 6 foot 0.59 Yes and NOT Over 6 foot 0.05 No and NOT Over 6 foot 0.35 Female Yes and Over 6 foot 0.01
  • 38. Relevant Issues during test • Violation of Independence Assumption – For many real world tasks,P(X1,,Xn|C)P(X1|C)P(Xn|C) – Nevertheless, naïve Bayes works surprisingly well anyway! • Zero conditional probability Problem – If no example contains the attribute valueXjajk , Pˆ(Xjajk |Cci)0 – – For a remedy, conditional probabilities estimated with 1 i jk i n i In this circumstance,Pˆ(x |c)Pˆ(a |c) Pˆ(x|c)0 n m n mp nc : numberof training examplesfor whichXj  ajk and C  ci n:numberof training examplesfor whichC  ci p :prior estimate(usually, p  1/t fort possible values of Xj ) m : weight to prior(numberof "virtual" examples, m  1) c Pˆ(Xj ajk |C  ci ) 
  • 39. 9/16/2013 4949 Estimating Probabilities • Normally, probabilities are estimated based on observed frequencies in the training data. • If D contains nk examples in category yk, and nijk of these nk examples have the jth value for feature Xi, xij, then: • However, estimating such probabilities from small training sets is error-prone. – If due only to chance, a rare feature, Xi, is always false in the training data, yk :P(Xi=true | Y=yk) = 0. – If Xi=true then occurs in a test example, X, the result is that yk: P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0 k ijk n n P( Xi  xij | Y  yk ) 
  • 40. 9/16/2013 5050 Probability Estimation Example Probability positive negative P(Y) 0.5 0.5 P(small | Y) 0.5 0.5 P(medium | Y) 0.0 0.0 P(large | Y) 0.5 0.5 P(red | Y) 1.0 0.5 P(blue | Y) 0.0 0.5 P(green | Y) 0.0 0.0 P(square | Y) 0.0 0.0 P(triangle | Y) 0.0 0.5 P(circle | Y) 1.0 0.5 Ex Size Color Shape Category 1 small red circle positive 2 large red circle positive 3 small red triangle negitive 4 large blue circle negitive Test Instance X: <medium, red, circle> P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0 P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0
  • 41. 9/16/2013 5151 Smoothing • Toaccount for estimation from small samples, probability estimates are adjusted or smoothed. • Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m. • For binary features, p is simply assumed to be 0.5. k ijk n m n  mp P( Xi  xij | Y  yk ) 
  • 42. 9/16/2013 5252 Laplace Smothing Example • Assume training set contains 10 positive examples: – 4: small – 0: medium – 6: large • Estimate parameters as follows (if m=1, p=1/3) – P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394 – P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 – P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576 – P(small or medium or large | positive) = 1.0
  • 43. 9/16/2013 53 Numerical Stability • It is often the case that machine learning algorithms need to work with very small numbers – Imagine computing the probability of 2000 independent coin flips – MATLAB thinks that (.5)2000=0
  • 44. 54 Stochastic Language Models 0.2 the 0.1 a 0.01 guy 0.01 fruit 0.03 said 0.02 likes …9/16/2013 the guy likes the fruit 0.2 0.01 0.02 0.2 0.01 multiply • Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model Model M P(s | M) = 0.00000008 13.2.1
  • 45. 9/16/2013 55 Numerical Stability • Instead of comparing P(Y=5|X1,…,Xn) with P(Y=6|X1,…,Xn), – Compare their logarithms
  • 46. 5656 Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. • Class with highest final un-normalized log probability score is still the most probable. 9/16/2013
  • 47. Relevant Issues • Continuous-valued Input Attributes – Numberless values for an attribute – Conditional probability modeled with the normal distribution – – Test Phase: • Calculate conditional probabilities with all the normal distributions • Apply the MAP rule to make a decision ji j i        exp 1 P(X |C  c )  ˆ 2 ji ji :mean(avearage) of attributevalues Xj of examples for whichC  ci ji :standard deviation of attributevalues Xj of examples for whichC  ci 2 (Xjji )2 2 1 n 1 L Learning Phase:for X  (X ,,X ), C  c,,c 1 n  ,X )   for X  (X, i Output: n L normal distributions andP(C  c ) i  1,,L
  • 48. Data with Numeric Attributes sex height (feet) weight (lbs) foot size(inches) male 6 180 12 male 5.92 (5'11") 190 11 male 5.58 (5'7") 170 12 male 5.92 (5'11") 165 10 female 5 100 6 female 5.5 (5'6") 150 8 female 5.42 (5'5") 130 7 female 5.75 (5'9") 150 9 sex mean (height) variance (height) mean (weight) variance (weight) mean (foot size) variance (foot size) male 5.855 3.5033e- 02 176.25 1.2292e+0 2 11.25 9.1667e- 01 female 5.4175 9.7225e- 02 132.5 5.5833e+0 2 7.5 1.6667e+0 0
  • 49. Data with Numeric Attributes sex height (feet) weight (lbs) foot size(inches) sample 6 130 8
  • 50. Since posterior numerator is greater in the female case, we predict the sample is female.
  • 51. • Advantages: – Fast to train (single scan). Fast to classify – Not sensitive to irrelevant features – Handles real and discrete data – Handles streaming data well • Disadvantages: – Assumes independence of features Advantages/Disadvantages of Naïve Bayes
  • 52. Conclusions  Naïve Bayes based on the independence assumption  Training is very easy and fast; just requiring considering each attribute in each class separately  Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions  A popular generative model  Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption  Many successful applications, e.g., spam mail filtering  Apart from classification, naïve Bayes can do more…
  • 53. 9/16/2013 64 Conclusions • Naïve Bayes is: – Really easy to implement and often works well – Often a good first thing to try – Commonly used as a “punching bag” for smarter algorithms
  • 54. 65 Acknowledgements  Introduction to Machine Learning, Alphaydin  Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000  Pattern Recognition and Analysis Course – A.K. Jain, MSU  Pattern Classification” by Duda et al., John Wiley & Sons.  http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html  http://en.wikipedia.org/wiki/Naive_Bayes_classifier  Some Material adopted from Dr. Adam Prugel-Bennett Material in these slides has been taken from, the following resources