SlideShare a Scribd company logo
1 of 6
Download to read offline
Overview
                                                                                                                          • Sample data set with frequencies and probabilities
                      Naive Bayes Classifiers                                                                              • Classification based on Bayes rule
            Connectionist and Statistical Language Processing                                                             • Maximum a posterior and maximum likelihood

                                          Frank Keller                                                                    • Properties of Bayes classifiers
                                       keller@coli.uni-sb.de
                                                                                                                          • Naive Bayes classifiers
                                      Computerlinguistik                                                                  • Parameter estimation, properties, example
                                   Universit¨ t des Saarlandes
                                            a
                                                                                                                          • Dealing with sparse data

                                                                                                                          • Application: email classification

                                                                                                                        Literature: Witten and Frank (2000: ch. 4), Mitchell (1997: ch. 6).
                                                                              Naive Bayes Classifiers – p.1/22                                                                     Naive Bayes Classifiers – p.2/22




          A Sample Data Set                                                                                        Frequencies and Probabilities

           Fictional data set that describes the weather conditions for                                                 Frequencies and probabilities for the weather data:
           playing some unspecified game.
                                                                                                                        outlook         temperature          humidity         windy                   play
outlook     temp.   humidity   windy    play      outlook      temp.   humidity         windy            play              yes no         yes no              yes no         yes no            yes no
sunny       hot     high       false    no        sunny        mild    high             false            no     sunny      2      3   hot 2    2      high    3    4    false 6   2            9         5
sunny       hot     high       true     no        sunny        cool    normal           false            yes    overcast 4        0   mild 4   2      normal 6     1    true 3    3
overcast    hot     high       false    yes       rainy        mild    normal           false            yes    rainy      3      2   cool 3   1
rainy       mild    high       false    yes       sunny        mild    normal           true             yes               yes no         yes no              yes no         yes no            yes no
rainy       cool    normal     false    yes       overcast     mild    high             true             yes    sunny      2/9 3/5    hot 2/9 2/5     high    3/9 4/5   false 6/9 2/5          9/14 5/14
rainy       cool    normal     true     no        overcast     hot     normal           false            yes    overcast 4/9 0/5      mild 4/9 2/5    normal 6/9 1/5    true 3/9 3/5
overcast    cool    normal     true     yes       rainy        mild    high             true             no     rainy      3/9 2/5    cool 3/9 1/5

                                                                              Naive Bayes Classifiers – p.3/22                                                                     Naive Bayes Classifiers – p.4/22
Classifying an Unseen Example                                                              Classifying an Unseen Example

Now assume that we have to classify the following new                                      Now take into account the overall probability of a given class.
instance:                                                                                  Multiply it with the probabilities of the attributes:
 outlook   temp.    humidity   windy    play                                               P(yes) = 0.0082 · 9/14 = 0.0053
 sunny     cool     high       true     ?                                                  P(no) = 0.0577 · 5/14 = 0.0206
                                                                                           Now choose the class so that it maximizes this probability. This
Key idea: compute a probability for each class based on the                                means that the new instance will be classified as no.
probability distribution in the training data.
First take into account the the probability of each attribute. Treat
all attributes equally important, i.e., multiply the probabilities:
P(yes) = 2/9 · 3/9 · 3/9 · 3/9 = 0.0082
P(no) = 3/5 · 1/5 · 4/5 · 3/5 = 0.0577

                                                         Naive Bayes Classifiers – p.5/22                                                         Naive Bayes Classifiers – p.6/22




Bayes Rule                                                                                 Maximum A Posteriori

This procedure is based on Bayes Rule, which says: if you have                             Based on Bayes Rule, we can compute the maximum a
a hypothesis h and data D which bears on the hypothesis, then:                             posteriori hypothesis for the data:
                                                                                           (2)             hMAP = arg max P(h|D)
                              P(D|h)P(h)                                                                                  h∈H
(1)                  P(h|D) =                                                                                                P(D|h)P(h)
                                P(D)                                                                               = arg max
                                                                                                                         h∈H    P(D)
P(h): independent probability of h: prior probability                                                              = arg max P(D|h)P(h)
P(D): independent probability of D                                                                                        h∈H
P(D|h): conditional probability of D given h: likelihood
                                                                                           H : set of all hypotheses
P(h|D): cond. probability of h given D: posterior probability
                                                                                           Note that we can drop P(D) as the probability of the data is
                                                                                           constant (and independent of the hypothesis).

                                                         Naive Bayes Classifiers – p.7/22                                                         Naive Bayes Classifiers – p.8/22
Maximum Likelihood                                                                                      Properties of Bayes Classifiers

Now assume that all hypotheses are equally probable a priori,                                              • Incrementality: with each training example, the prior and
i.e, P(hi ) = P(h j ) for all hi , h j ∈ H .                                                                  the likelihood can be updated dynamically: flexible and
                                                                                                              robust to errors.
This is called assuming a uniform prior. It simplifies computing
the posterior:                                                                                             • Combines prior knowledge and observed data: prior
                                                                                                              probability of a hypothesis multiplied with probability of the
(3)                   hML = arg max P(D|h)                                                                    hypothesis given the training data.
                                     h∈H
                                                                                                           • Probabilistic hypotheses: outputs not only a
This hypothesis is called the maximum likelihood hypothesis.
                                                                                                              classification, but a probability distribution over all classes.
                                                                                                           • Meta-classification: the outputs of several classifiers can
                                                                                                              be combined, e.g., by multiplying the probabilities that all
                                                                                                              classifiers predict for a given class.

                                                                      Naive Bayes Classifiers – p.9/22                                                                     Naive Bayes Classifiers – p.10/22




Naive Bayes Classifier                                                                                   Naive Bayes: Parameter Estimation
Assumption: training set consists of instances described as                                             Estimating P(v j ) is simple: compute the relative frequency of
conjunctions of attributes values, target classification based                                           each target class in the training set.
on finite set of classes V .
                                                                                                        Estimating P(a1 , a2 , . . . , an |v j ) is difficult: typically not enough
The task of the learner is to predict the correct class for a new                                       instances for each attribute combination in the training set:
instance a1 , a2 , . . . , an .                                                                         sparse data problem.
Key idea: assign most probable class vMAP using Bayes Rule.                                             Independence assumption: attribute values are conditionally
                                                                                                        independent given the target value: naive Bayes.
(4)       vMAP = arg max P(v j |a1 , a2 , . . . , an )
                            v j ∈V
                                                                                                        (5)               P(a1 , a2 , . . . , an |v j ) = ∏ P(ai |v j )
                               P(a1 , a2 , . . . , an |v j )P(v j )
                  = arg max                                                                                                                                 i
                        v j ∈V    P(a1 , a2 , . . . , an )                                              Hence we get the following classifier:
                  = arg max P(a1 , a2 , . . . , an |v j )P(v j )                                        (6)                vNB = arg max P(v j ) ∏ P(ai |v j )
                            v j ∈V                                                                                                       v j ∈V
                                                                                                                                                        i
                                                                  Naive Bayes Classifiers – p.11/22                                                                        Naive Bayes Classifiers – p.12/22
Naive Bayes: Properties                                                                             Naive Bayes: Example
                                                                                                    Apply Naive Bayes to the weather training data. The hypothesis
  • Estimating   P(ai |v j ) instead of P(a1 , a2 , . . . , an |v j ) greatly
                                                                                                    space is V = {yes, no}. Classify the following new instance:
     reduces the number of parameters (and data sparseness).
  • The learning step in Naive Bayes consists of estimating                                          outlook    temp.     humidity         windy       play
     P(ai |v j ) and P(v j ) based on the frequencies in the                                         sunny      cool      high             true        ?
     training data.
  • There is no explicit search during training (as opposed to                                      vNB = arg          max        P(v j ) ∏ P(ai |v j )
                                                                                                                  v j ∈{yes,no}           i
     decision trees).
                                                                                                         = arg         max        P(v j )P(outlook = sunny|v j )P(temp = cool|v j )
  • An unseen instance is classified by computing the class                                                        v j ∈{yes,no}
     that maximizes the posterior.                                                                             P(humidity = high|v j )P(windy = true|v j )
  • When conditional independence is satisfied, Naive Bayes
                                                                                                    Compute priors:
     corresponds to MAP classification.
                                                                                                           P(play = yes) = 9/14 P(play = no) = 5/14
                                                                 Naive Bayes Classifiers – p.13/22                                                                 Naive Bayes Classifiers – p.14/22




Naive Bayes: Example                                                                                Naive Bayes: Sparse Data
Compute conditionals (examples):
P(windy = true|play = yes) = 3/9                                                                    Conditional probabilities can be estimated directly as relative
                                                                                                    frequencies:
P(windy = true|play = no) = 3/5
                                                                                                                                                  nc
Then compute the best class:                                                                                                      P(ai |v j ) =
                                                                                                                                                  n
P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(true|yes)
                                                                                                    where n is the total number of training instances with class v j ,
= 9/14 · 2/9 · 3/9 · 3/9 · 3/9 = 0.0053
                                                                                                    and nc is the number of instances with attribute ai and class vi .
P(no)P(sunny|no)P(cool|no)P(high|no)P(true|no)
                                                                                                    Problem: this provides a poor estimate if nc is very small.
= 5/14 · 3/5 · 1/5 · 4/5 · 3/5 = 0.0206
                                                                                                    Extreme case: if nc = 0, then the whole posterior will be zero.
Now classify the unseen instance:
vNB = arg       max          P(v j )P(sunny|v j )P(cool|v j )P(high|v j )P(true|v j )
             v j ∈{yes,no}
     = no
                                                                 Naive Bayes Classifiers – p.15/22                                                                 Naive Bayes Classifiers – p.16/22
Naive Bayes: Sparse Data                                                                     Application: Email Classification
                                                                                             Training data: a corpus of email messages, each message
Solution: use the m-estimate of probabilities:                                               annotated as spam or no spam.
                                     nc + mp
                     P(ai |v j ) =                                                           Task: classify new email messages as spam/no spam.
                                      n+m
                                                                                             To use a naive Bayes classifier for this task, we have to first find
p: prior estimate of the probability                                                         an attribute representation of the data.
m: equivalent sample size (constant)                                                         Treat each text position as an attribute, with as its value the
In the absence of other information, assume a uniform prior:                                 word at this position. Example: email starts: get rich.
p=   1
     k
                                                                                             The naive Bayes classifier is then:

where k is the number of values that the attribute ai can take.                              vNB = arg              max            P(v j ) ∏ P(ai |v j )
                                                                                                              v j ∈{spam,nospam}           i
                                                                                                   = arg            max            P(v j )P(a1 = get|v j )P(a2 = rich|v j )
                                                                                                              v j ∈{spam,nospam}

                                                          Naive Bayes Classifiers – p.17/22                                                                     Naive Bayes Classifiers – p.18/22




Application: Email Classification                                                             Application: Email Classification

Using naive Bayes means we assume that words are                                             Training: estimate priors:
independent of each other. Clearly incorrect, but doesn’t hurt
                                                                                             P(v j ) =   n
                                                                                                         N
a lot for our task.
                                                                                             Estimate likelihoods using the m-estimate:
The classifier uses P(ai = wk |v j ), i.e., the probability that the
                                                                                                                 nk +1
i-th word in the email is the k-word in our vocabulary, given the                            P(wk |v j ) =   n+|Vocabulary|
email has been classified as v j .
                                                                                             N : total number of words in all emails
Simplify by assuming that position is irrelevant: estimate                                   n: number of words in emails with class v j
P(wk |v j ), i.e., the probability that word wk occurs in the email,                         nk : number of times word wk occurs in emails with class v j
given class v j .                                                                            |Vocabulary|: size of the vocabulary
Create a vocabulary: make a list of all words in the training                                Testing: to classify a new email, assign it the class with the
corpus, discard words with very high or very low frequency.                                  highest posterior probability. Ignore unknown words.

                                                          Naive Bayes Classifiers – p.19/22                                                                     Naive Bayes Classifiers – p.20/22
Summary                                                                                 References
 • Bayes classifier combines prior knowledge with observed                               Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill.

   data: assigns a posterior probability to a class based on its                        Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and

   prior probability and its likelihood given the training data.                           Techniques with Java Implementations. San Diego, CA: Morgan Kaufmann.


 • Computes the maximum a posterior (MAP) hypothesis or
   the maximum likelihood (ML) hypothesis.
 • Naive Bayes classifier assumes conditional independence
   between attributes and assigns the MAP class to new
   instances.
 • Likelihoods can be estimated based on frequencies.
   Problem: sparse data. Solution: using the m-estimate
   (adding a constant).

                                                     Naive Bayes Classifiers – p.21/22                                                                                   Naive Bayes Classifiers – p.22/22

More Related Content

Similar to Naive Bayes Classifiers Overview

UNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningUNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningmichaelaaron25322
 
Machine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdfMachine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdfSubhamKumar3239
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learningbutest
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Artificial Intelligence Notes Unit 3
Artificial Intelligence Notes Unit 3Artificial Intelligence Notes Unit 3
Artificial Intelligence Notes Unit 3DigiGurukul
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningLibya Thomas
 
Ml4 naive bayes
Ml4 naive bayesMl4 naive bayes
Ml4 naive bayesankit_ppt
 
Bayesian case studies, practical 1
Bayesian case studies, practical 1Bayesian case studies, practical 1
Bayesian case studies, practical 1Robin Ryder
 
Decision tree.10.11
Decision tree.10.11Decision tree.10.11
Decision tree.10.11okeee
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 

Similar to Naive Bayes Classifiers Overview (15)

NaiveBayes.ppt
NaiveBayes.pptNaiveBayes.ppt
NaiveBayes.ppt
 
NaiveBayes.ppt
NaiveBayes.pptNaiveBayes.ppt
NaiveBayes.ppt
 
NaiveBayes.ppt
NaiveBayes.pptNaiveBayes.ppt
NaiveBayes.ppt
 
UNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningUNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learning
 
Machine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdfMachine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdf
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
 
Artificial Intelligence Notes Unit 3
Artificial Intelligence Notes Unit 3Artificial Intelligence Notes Unit 3
Artificial Intelligence Notes Unit 3
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
Ml4 naive bayes
Ml4 naive bayesMl4 naive bayes
Ml4 naive bayes
 
Bayesian case studies, practical 1
Bayesian case studies, practical 1Bayesian case studies, practical 1
Bayesian case studies, practical 1
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Decision tree.10.11
Decision tree.10.11Decision tree.10.11
Decision tree.10.11
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 

More from George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

More from George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Naive Bayes Classifiers Overview

  • 1. Overview • Sample data set with frequencies and probabilities Naive Bayes Classifiers • Classification based on Bayes rule Connectionist and Statistical Language Processing • Maximum a posterior and maximum likelihood Frank Keller • Properties of Bayes classifiers keller@coli.uni-sb.de • Naive Bayes classifiers Computerlinguistik • Parameter estimation, properties, example Universit¨ t des Saarlandes a • Dealing with sparse data • Application: email classification Literature: Witten and Frank (2000: ch. 4), Mitchell (1997: ch. 6). Naive Bayes Classifiers – p.1/22 Naive Bayes Classifiers – p.2/22 A Sample Data Set Frequencies and Probabilities Fictional data set that describes the weather conditions for Frequencies and probabilities for the weather data: playing some unspecified game. outlook temperature humidity windy play outlook temp. humidity windy play outlook temp. humidity windy play yes no yes no yes no yes no yes no sunny hot high false no sunny mild high false no sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5 sunny hot high true no sunny cool normal false yes overcast 4 0 mild 4 2 normal 6 1 true 3 3 overcast hot high false yes rainy mild normal false yes rainy 3 2 cool 3 1 rainy mild high false yes sunny mild normal true yes yes no yes no yes no yes no yes no rainy cool normal false yes overcast mild high true yes sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14 rainy cool normal true no overcast hot normal false yes overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5 overcast cool normal true yes rainy mild high true no rainy 3/9 2/5 cool 3/9 1/5 Naive Bayes Classifiers – p.3/22 Naive Bayes Classifiers – p.4/22
  • 2. Classifying an Unseen Example Classifying an Unseen Example Now assume that we have to classify the following new Now take into account the overall probability of a given class. instance: Multiply it with the probabilities of the attributes: outlook temp. humidity windy play P(yes) = 0.0082 · 9/14 = 0.0053 sunny cool high true ? P(no) = 0.0577 · 5/14 = 0.0206 Now choose the class so that it maximizes this probability. This Key idea: compute a probability for each class based on the means that the new instance will be classified as no. probability distribution in the training data. First take into account the the probability of each attribute. Treat all attributes equally important, i.e., multiply the probabilities: P(yes) = 2/9 · 3/9 · 3/9 · 3/9 = 0.0082 P(no) = 3/5 · 1/5 · 4/5 · 3/5 = 0.0577 Naive Bayes Classifiers – p.5/22 Naive Bayes Classifiers – p.6/22 Bayes Rule Maximum A Posteriori This procedure is based on Bayes Rule, which says: if you have Based on Bayes Rule, we can compute the maximum a a hypothesis h and data D which bears on the hypothesis, then: posteriori hypothesis for the data: (2) hMAP = arg max P(h|D) P(D|h)P(h) h∈H (1) P(h|D) = P(D|h)P(h) P(D) = arg max h∈H P(D) P(h): independent probability of h: prior probability = arg max P(D|h)P(h) P(D): independent probability of D h∈H P(D|h): conditional probability of D given h: likelihood H : set of all hypotheses P(h|D): cond. probability of h given D: posterior probability Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis). Naive Bayes Classifiers – p.7/22 Naive Bayes Classifiers – p.8/22
  • 3. Maximum Likelihood Properties of Bayes Classifiers Now assume that all hypotheses are equally probable a priori, • Incrementality: with each training example, the prior and i.e, P(hi ) = P(h j ) for all hi , h j ∈ H . the likelihood can be updated dynamically: flexible and robust to errors. This is called assuming a uniform prior. It simplifies computing the posterior: • Combines prior knowledge and observed data: prior probability of a hypothesis multiplied with probability of the (3) hML = arg max P(D|h) hypothesis given the training data. h∈H • Probabilistic hypotheses: outputs not only a This hypothesis is called the maximum likelihood hypothesis. classification, but a probability distribution over all classes. • Meta-classification: the outputs of several classifiers can be combined, e.g., by multiplying the probabilities that all classifiers predict for a given class. Naive Bayes Classifiers – p.9/22 Naive Bayes Classifiers – p.10/22 Naive Bayes Classifier Naive Bayes: Parameter Estimation Assumption: training set consists of instances described as Estimating P(v j ) is simple: compute the relative frequency of conjunctions of attributes values, target classification based each target class in the training set. on finite set of classes V . Estimating P(a1 , a2 , . . . , an |v j ) is difficult: typically not enough The task of the learner is to predict the correct class for a new instances for each attribute combination in the training set: instance a1 , a2 , . . . , an . sparse data problem. Key idea: assign most probable class vMAP using Bayes Rule. Independence assumption: attribute values are conditionally independent given the target value: naive Bayes. (4) vMAP = arg max P(v j |a1 , a2 , . . . , an ) v j ∈V (5) P(a1 , a2 , . . . , an |v j ) = ∏ P(ai |v j ) P(a1 , a2 , . . . , an |v j )P(v j ) = arg max i v j ∈V P(a1 , a2 , . . . , an ) Hence we get the following classifier: = arg max P(a1 , a2 , . . . , an |v j )P(v j ) (6) vNB = arg max P(v j ) ∏ P(ai |v j ) v j ∈V v j ∈V i Naive Bayes Classifiers – p.11/22 Naive Bayes Classifiers – p.12/22
  • 4. Naive Bayes: Properties Naive Bayes: Example Apply Naive Bayes to the weather training data. The hypothesis • Estimating P(ai |v j ) instead of P(a1 , a2 , . . . , an |v j ) greatly space is V = {yes, no}. Classify the following new instance: reduces the number of parameters (and data sparseness). • The learning step in Naive Bayes consists of estimating outlook temp. humidity windy play P(ai |v j ) and P(v j ) based on the frequencies in the sunny cool high true ? training data. • There is no explicit search during training (as opposed to vNB = arg max P(v j ) ∏ P(ai |v j ) v j ∈{yes,no} i decision trees). = arg max P(v j )P(outlook = sunny|v j )P(temp = cool|v j ) • An unseen instance is classified by computing the class v j ∈{yes,no} that maximizes the posterior. P(humidity = high|v j )P(windy = true|v j ) • When conditional independence is satisfied, Naive Bayes Compute priors: corresponds to MAP classification. P(play = yes) = 9/14 P(play = no) = 5/14 Naive Bayes Classifiers – p.13/22 Naive Bayes Classifiers – p.14/22 Naive Bayes: Example Naive Bayes: Sparse Data Compute conditionals (examples): P(windy = true|play = yes) = 3/9 Conditional probabilities can be estimated directly as relative frequencies: P(windy = true|play = no) = 3/5 nc Then compute the best class: P(ai |v j ) = n P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(true|yes) where n is the total number of training instances with class v j , = 9/14 · 2/9 · 3/9 · 3/9 · 3/9 = 0.0053 and nc is the number of instances with attribute ai and class vi . P(no)P(sunny|no)P(cool|no)P(high|no)P(true|no) Problem: this provides a poor estimate if nc is very small. = 5/14 · 3/5 · 1/5 · 4/5 · 3/5 = 0.0206 Extreme case: if nc = 0, then the whole posterior will be zero. Now classify the unseen instance: vNB = arg max P(v j )P(sunny|v j )P(cool|v j )P(high|v j )P(true|v j ) v j ∈{yes,no} = no Naive Bayes Classifiers – p.15/22 Naive Bayes Classifiers – p.16/22
  • 5. Naive Bayes: Sparse Data Application: Email Classification Training data: a corpus of email messages, each message Solution: use the m-estimate of probabilities: annotated as spam or no spam. nc + mp P(ai |v j ) = Task: classify new email messages as spam/no spam. n+m To use a naive Bayes classifier for this task, we have to first find p: prior estimate of the probability an attribute representation of the data. m: equivalent sample size (constant) Treat each text position as an attribute, with as its value the In the absence of other information, assume a uniform prior: word at this position. Example: email starts: get rich. p= 1 k The naive Bayes classifier is then: where k is the number of values that the attribute ai can take. vNB = arg max P(v j ) ∏ P(ai |v j ) v j ∈{spam,nospam} i = arg max P(v j )P(a1 = get|v j )P(a2 = rich|v j ) v j ∈{spam,nospam} Naive Bayes Classifiers – p.17/22 Naive Bayes Classifiers – p.18/22 Application: Email Classification Application: Email Classification Using naive Bayes means we assume that words are Training: estimate priors: independent of each other. Clearly incorrect, but doesn’t hurt P(v j ) = n N a lot for our task. Estimate likelihoods using the m-estimate: The classifier uses P(ai = wk |v j ), i.e., the probability that the nk +1 i-th word in the email is the k-word in our vocabulary, given the P(wk |v j ) = n+|Vocabulary| email has been classified as v j . N : total number of words in all emails Simplify by assuming that position is irrelevant: estimate n: number of words in emails with class v j P(wk |v j ), i.e., the probability that word wk occurs in the email, nk : number of times word wk occurs in emails with class v j given class v j . |Vocabulary|: size of the vocabulary Create a vocabulary: make a list of all words in the training Testing: to classify a new email, assign it the class with the corpus, discard words with very high or very low frequency. highest posterior probability. Ignore unknown words. Naive Bayes Classifiers – p.19/22 Naive Bayes Classifiers – p.20/22
  • 6. Summary References • Bayes classifier combines prior knowledge with observed Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill. data: assigns a posterior probability to a class based on its Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and prior probability and its likelihood given the training data. Techniques with Java Implementations. San Diego, CA: Morgan Kaufmann. • Computes the maximum a posterior (MAP) hypothesis or the maximum likelihood (ML) hypothesis. • Naive Bayes classifier assumes conditional independence between attributes and assigns the MAP class to new instances. • Likelihoods can be estimated based on frequencies. Problem: sparse data. Solution: using the m-estimate (adding a constant). Naive Bayes Classifiers – p.21/22 Naive Bayes Classifiers – p.22/22