Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Naive Bayes Presentation

No Downloads

Total views

8,962

On SlideShare

0

From Embeds

0

Number of Embeds

7

Shares

0

Downloads

517

Comments

4

Likes

4

No notes for slide

- 1. Naive Bayes Md Enamul Haque Chowdhury ID : CSE013083972D University of Luxembourg (Based on Ke Chen and Ashraf Uddin Presentation)
- 2. Contents Background Bayes Theorem Bayesian Classifier Naive Bayes Uses of Naive Bayes classification Relevant Issues Advantages and Disadvantages Some NBC Applications Conclusions 1
- 3. Background There are three methods to establish a classifier a) Model a classification rule directly Examples: k-NN, decision trees, perceptron, SVM b) Model the probability of class memberships given input data Example: perceptron with the cross-entropy cost c) Make a probabilistic model of data within each class Examples: Naive Bayes, Model based classifiers a) and b) are examples of discriminative classification c) is an example of generative classification b) and c) are both examples of probabilistic classification 2
- 4. Bayes Theorem Given a hypothesis h and data D which bears on the hypothesis: P(h): independent probability of h: prior probability P(D): independent probability of D P(D|h): conditional probability of D given h: likelihood P(h|D): conditional probability of h given D: posterior probability 3
- 5. Maximum A Posterior Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D. H: set of all hypothesis. h argmaxP(h | D) h H MAP P D h P h ( | ) ( ) P D ( ) argmax hH argmaxP(D| h)P(h) hH Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis). 4
- 6. Maximum Likelihood Now assume that all hypothesis are equally probable a prior, i.e. P(hi ) = P(hj ) for all hi, hj belong to H. This is called assuming a uniform prior. It simplifies computing the posterior: h argmaxP(D| h) h H ML This hypothesis is called the maximum likelihood hypothesis. 5
- 7. Bayesian Classifier The classification problem may be formalized using a-posterior probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook= sunny, windy=true,…) Idea: assign to sample X the class label C such that P(C|X) is maximal 6
- 8. Estimating a-posterior probabilities Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) P(X) is constant for all classes P(C) = relative freq of class C samples C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum Problem: computing P(X|C) is unfeasible! 7
- 9. Naive Bayes Bayes classification ( ) ( ) ( ) ( , , | ) ( ) 1 P C| P |C P C P X X C P C n X X Difficulty: learning the joint probability Naive Bayes classification -Assumption that all input features are conditionally independent! P X X X C P X X X C P X X C ( , , , | ) ( | , , , ) ( , , | ) n n n 1 2 1 2 2 -MAP classification rule: for P X C P X X C ( | ) ( , , | ) 1 2 P X C P X C P X C ( | ) ( | ) ( | ) 1 2 n n ( , , , ) 1 2 n x x x x * [P(x | c ) P(x | c )]P(c ) [P(x | c) P(x | c)]P(c), c c , c c , ,c n 1 n 1 L * * * 1 8
- 10. Naive Bayes Algorithm: Discrete-Valued Features -Learning Phase: Given a training set S, c (c c , ,c ) For each target value of 1 i i L ˆ ( ) estimate ( ) with examples in ; P C c P C c i i x X j n k ,N For every feature value of each feature ( 1, , ; 1, ) jk j j ˆ ( | ) estimate ( | ) with examples in ; P X x C c P X x C c X N L j j , Output: conditional probability tables; for elements -Test Phase: Given an unknown instance ( , , ) 1 n X a a Look up tables to assign the label c* to X´ if S S j jk i j jk i [Pˆ(a | c * ) Pˆ(a | c * )]Pˆ(c * ) [Pˆ(a | c) Pˆ(a | c)]Pˆ(c), c c * , c c , ,c 1 n 1 n 1 L 9
- 11. Example 10
- 12. Example Learning Phase : Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 11
- 13. Example Test Phase : -Given a new instance, predict its label x´=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) -Look up tables achieved in the learning phrase P(Outlook=Sunny|Play=Yes) = 2/9 P(Temperature=Cool|Play=Yes) = 3/9 P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 -Decision making with the MAP rule: P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Yes|x´): [ P(Sunny|Yes) P(Cool|Yes) P(High|Yes) P(Strong|Yes) ] P(Play=Yes) = 0.0053 P(No|x´): [ P(Sunny|No) P(Cool|No) P(High|No) P(Strong|No) ] P(Play=No) = 0.0206 Given the fact P(Yes|x´) < P(No|x´) , we label x´ to be “No”. 12
- 14. Naive Bayes Algorithm: Continuous-valued Features - Numberless values for a feature - Conditional probability often modeled with the normal distribution ( ) ˆ ( | ) 2 j ji 1 2 exp 2 X c 2 : mean (avearage) of feature values of examples for whichC ji j i ji j i - Learning Phase: Output: normal distributions and - Test Phase: Given an unknown instance -Instead of looking-up tables, calculate conditional probabilities with all the normal distributions achieved in the learning phrase -Apply the MAP rule to make a decision ji ji j i C c X P X C c : standard deviation of feature values X of examples for which n L for (X , , X ), C c , ,c 1 1 X P C c i L i nL ( ) 1, , ( , , ) 1 n X a a 13
- 15. Naive Bayes Example: Continuous-valued Features -Temperature is naturally of continuous value. Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8 No: 27.3, 30.1, 17.4, 29.5, 15.1 -Estimate mean and variance for each class N N 1 2 2 n x x 1 , N n 1 ( ) n n N 1 21.64, 2.35 Yes Yes 23.88, 7.09 No No -Learning Phase: output two Gaussian models for P(temp|C) 1 ( 21.64) 1 ( 23.88) 50.25 exp 7.09 2 ˆ ( | ) 11.09 exp 2.35 2 ˆ ( | ) 2 2 x P x No x P x Yes 14
- 16. Uses of Naive Bayes classification Text Classification Spam Filtering Hybrid Recommender System - Recommender Systems apply machine learning and data mining techniques for filtering unseen information and can predict whether a user would like a given resource Online Application - Simple Emotion Modeling 15
- 17. Why text classification? Learning which articles are of interest Classify web pages by topic Information extraction Internet filters 16
- 18. Examples of Text Classification CLASSES=BINARY “spam” / “not spam” CLASSES =TOPICS “finance” / “sports” / “politics” CLASSES =OPINION “like” / “hate” / “neutral” CLASSES =TOPICS “AI” / “Theory” / “Graphics” CLASSES =AUTHOR “Shakespeare” / “Marlowe” / “Ben Jonson” 17
- 19. Naive Bayes Approach Build the Vocabulary as the list of all distinct words that appear in all the documents of the training set. Remove stop words and markings The words in the vocabulary become the attributes, assuming that classification is independent of the positions of the words Each document in the training set becomes a record with frequencies for each word in the Vocabulary. Train the classifier based on the training data set, by computing the prior probabilities for each class and attributes. Evaluate the results on Test data 18
- 20. Text Classification Algorithm: Naive Bayes Tct – Number of particular word in particular class Tct’ – Number of total words in particular class B´ – Number of distinct words in all class 19
- 21. Relevant Issues Violation of Independence Assumption Zero conditional probability Problem 20
- 22. Violation of Independence Assumption Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computations involved and, in this sense, is considered “naive.” 21
- 23. Improvement Bayesian belief network are graphical models, which unlike naive Bayesian classifiers, allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also be used for classification. 22
- 24. Zero conditional probability Problem If a given class and feature value never occur together in the training set then the frequency-based probability estimate will be zero. This is problematic since it will wipe out all information in the other probabilities when they are multiplied. It is therefore often desirable to incorporate a small-sample correction in all probability estimates such that no probability is ever set to be exactly zero. 23
- 25. Naive Bayes Laplace Correction To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count 24
- 26. Example Suppose that for the class buys computer D (yes) in some training database, D, containing 1000 tuples. we have 0 tuples with income D low, 990 tuples with income D medium, and 10 tuples with income D high. The probabilities of these events, without the Laplacian correction, are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000), respectively. Using the Laplacian correction for the three quantities, we pretend that we have 1 more tuple for each income-value pair. In this way, we instead obtain the following probabilities : respectively. The “corrected” probability estimates are close to their “uncorrected” counterparts, yet the zero probability value is avoided. 25
- 27. Advantages • Advantages : Easy to implement Requires a small amount of training data to estimate the parameters Good results obtained in most of the cases 26
- 28. Disadvantages Disadvantages: Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables -E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modelled by Naïve Bayesian Classifier 27
- 29. Some NBC Applications Credit scoring Marketing applications Employee selection Image processing Speech recognition Search engines… 28
- 30. Conclusions Naive Bayes is: - Really easy to implement and often works well - Often a good first thing to try - Commonly used as a “punching bag” for smarter algorithms 29
- 31. References http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch6.pdf Data Mining: Concepts and Techniques, 3rd Edition, Han & kamber & Pei ISBN: 9780123814791 http://en.wikipedia.org/wiki/Naive_Bayes_classifier http://www.slideshare.net/ashrafmath/naive-bayes-15644818 http://www.slideshare.net/gladysCJ/lesson-71-naive-bayes-classifier 30
- 32. Questions ?

No public clipboards found for this slide

Login to see the comments