Upcoming SlideShare
×

# Lesson 7.1 Naive Bayes Classifier

12,588
-1

Published on

The objective of this lesson is to introduce the popular Naive Bayes classification model. The list of references used in the preparation of these slides can be found at the course website page: https://sites.google.com/site/gladyscjaprendizagem/program/bayesian-network-classifiers

Published in: Education, Technology
11 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
12,588
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
605
0
Likes
11
Embeds 0
No embeds

No notes for slide

### Lesson 7.1 Naive Bayes Classifier

1. 1. Machine Learning Gladys Castillo, UA Naive Bayes Gladys Castillo University of Aveiro
2. 2. Machine Learning Gladys Castillo, U.A. 2 The Supervised Classification Problem A classifier is a function f: X  C that assigns a class label c  C ={c1,…, cm} to objects described by a set of attributes X ={X1, X2, …, Xn}  X Supervised Learning Algorithm hC hC C(N+1) Inputs: Attribute values of x(N+1) Output: class of x(N+1) Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1))  C Given: a dataset D with N labeled examples of <X, C> Build: a classifier, a hypothesis hC: X  C that can correctly predict the class labels of new objects Learning Phase: c(N)<x(N), c(N)> ……… c(2)<x(2), c(2)> c(1)<x(1), c(1)> CCXXnn……XX11DD c(N)<x(N), c(N)> ……… c(2)<x(2), c(2)> c(1)<x(1), c(1)> CCXXnn……XX11DD )1( 1x )1( nx )2( 1x )2( nx )( 1 N x )(N nx )1( 1 N x )1( 2 N x )1( N nx… attribute X2 attributeX1 x x x x x x x x x x o o o o o o o o o o o o o give credit don´t give credit
3. 3. Machine Learning Gladys Castillo, U.A. 3 Statistical Classifiers  Treat the attributes X= {X1, X2, …, Xn} and the class C as random variables  A random variable is defined by its probability density function  Give the probability P(cj | x) that x belongs to a particular class rather than a simple classification Probability density function of a random variable and few observations )(xf )(xf  instead of having the map X  C , we have X  P(C| X) The class c* attached to an example is the class with bigger P(cj| x)
4. 4. Machine Learning Gladys Castillo, U.A. 4 Bayesian Classifiers Bayesian because the class c* attached to an example x is determined by the Bayes’ Theorem P(X,C)= P(X|C).P(C) P(X,C)= P(C|X).P(X) Bayes theorem is the main tool in Bayesian inference )( )|()( )|( XP CXPCP XCP  We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior distribution.
5. 5. Machine Learning Gladys Castillo, U.A. 5 Bayes Theorem Example Given:  A doctor knows that meningitis causes stiff neck 50% of the time  Prior probability of any patient having meningitis is 1/50,000  Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis ? 0002.0 20/1 50000/15.0 )( )()|( )|(    SP MPMSP SMP likelihood prior prior posterior  prior x likelihood )( )|()( )|( XP CXPCP XCP  Before observing the data, our prior beliefs can be expressed in a prior probability distribution that represents the knowledge we have about the unknown features. After observing the data our revised beliefs are captured by a posterior distribution over the unknown features.
6. 6. Machine Learning Gladys Castillo, U.A. 6 )( )|()( )|( x x x P cPcP cP jj j  Bayesian Classifier Bayes Theorem Maximum a posteriori classification How to determine P(cj | x) for each class cj ? P(x) can be ignored because it is the same for all the classes (a normalization constant) )|()(max)( 1 * jj ...mj Bayes cPcParghc xx   Classify x to the class which has bigger posterior probability Bayesian Classification Rule
7. 7. Machine Learning Gladys Castillo, U.A. 7  “Naïve” because of its very naïve independence assumption: Naïve Bayes (NB) Classifier all the attributes are conditionally independent given the class Duda and Hart (1973); Langley (1992) P(x | cj) can be decomposed into a product of n terms, one term for each attribute  “Bayes” because the class c* attached to an example x is determined by the Bayes’ Theorem )|()(max)( 1 * jj ...mj Bayes cPcParghc xx   when the attribute space is high dimensional direct estimation is hard unless we introduce some assumptions    n i jiij ...mj NB cxXPcParghc 1 1 * )|()(max)(x NB Classification Rule
8. 8. Machine Learning Gladys Castillo, U.A. 8  Xi continuous  The attribute is discretized and then treats as a discrete attribute  A Normal distribution is usually assumed 2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj  Xi discrete 1. Estimate P(cj) for each class cj Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation) ),;()|( ijijkjki xgcxXP  Given a training dataset D of N labeled examples (assuming complete data) N N cP j j )(ˆ j ijk jki N N cxXP  )|(ˆ 2 2 )( 2 2 1 ),;( e x xg       onde Nj - number of examples of the class cj Nijk - number of examples of the class cj having the value xk for the attribute Xi Two options The mean ij and the standard deviation ij are estimated from D
9. 9. Machine Learning Gladys Castillo, U.A. 9 Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj  A Normal distribution is usually assumed  Xi | cj ~ N(μij, σ2 ij) - the mean ij and the standard deviation ij are estimated from D Continuous Attributes Normal or Gaussian Distribution For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = 0.0273 Thedensityprobabilityfunction -  2 323 N(0,2) -  2 323 N(0,2) f(x) is symmetrical around its mean ),;()|( ijijkjki xgcxXP  2 2 )( 2 2 1 ),;( e x xg       
10. 10. Machine Learning Gladys Castillo, U.A. 10 1. Estimate P(cj) for each class cj 2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj X1 discrete X2 continuos Naïve Bayes Probability Estimates )1011.73()|( 2 .,x;gCxXP  TrainingdatasetD 5 3 )(ˆ CP 3 2 )|(ˆ  CaXP i Class X1 X2 + a 1.0 + b 1.2 + a 3.0 - b 4.4 - b 4.5  Two classes: + (positive) , - (negative)  Two attributes:  X1 – discrete which takes values a and b  X2 – continuous Binary Classification Problem 3 1 )|(ˆ  CbXP i 5 2 )(ˆ CP )0704.45()|( 2 .,x;gCxXP  2 0 )|(ˆ  CaXP i 2 2 )|(ˆ  CbXP i 2+= 1.73, 2+ = 1.10 2-= 4.45, 2 2+ = 0.07 Example from John & Langley (1995)
11. 11. Machine Learning Gladys Castillo, U.A. 11 Probability Estimates Discrete Attributes  for each class:  P(No) = 7/10  P(Yes) = 3/10  for each attribute value and class:  Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book N N cP j j )(ˆ j ijk jki N N cxXP  )|(ˆ To compute Nijk we need to count the number of examples of the class cj having the value xk for the attribute Xi
12. 12. Machine Learning Gladys Castillo, U.A. 12  for each pair attribute-class (Xi, cj) Example  for (Income, Evade = No)  If Evade = No  sample mean = 110  sample standard deviation = 54.54 0072.0 )54.54(2 1 )|120( )2975(2 )110120( 2    eNoIncomeP  Probability Estimates Continuos Attributes Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book 2 2 )( 2 2 1 )()|( e ij ijkx ij ijijkjki ,;xgcxXP       
13. 13. Machine Learning Gladys Castillo, U.A. 13 The Balance-scale Problem  Each example has 4 numerical attributes:  the left weight (Left_W)  the left distance (Left_D)  the right weight (Right_W)  right distance (Right_D) The dataset was generated to model psychological experimental results Left_W  Each example is classified into 3 classes: the balance-scale:  tip to the right (Right)  tip to the left (Left)  is balanced (Balanced)  3 target rules:  If LD x LW > RD x RW  tip to the left  If LD x LW < RD x RW  tip to the right  If LD x LW = RD x RW  it is balanced Balance-scale dataset from the UCI repository Right_W Left_D Right_D
14. 14. Machine Learning Gladys Castillo, U.A. 14 The Balance-scale Problem Left_W Left_D Right_W Right_D Balanced2643 ... 2 1 LeftLeft_W_W BalanceBalance--Scale DataSetScale DataSet ............ Left235 Right245 ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D Balanced2643 ... 2 1 LeftLeft_W_W BalanceBalance--Scale DataSetScale DataSet ............ Left235 Right245 ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D Adapted from © João Gama’s slides “Aprendizagem Bayesiana” Discretization is applied: each attribute is mapped to 5 intervals
15. 15. Machine Learning Gladys Castillo, U.A. 15 Balanced2643 ... 2 1 LeftLeft_W_W BalanceBalance--Scale DataSetScale DataSet ............ Left235 Right245 ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D Balanced2643 ... 2 1 LeftLeft_W_W BalanceBalance--Scale DataSetScale DataSet ............ Left235 Right245 ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D The Balance-scale Problem Learning Phase Build Contingency Tables Contingency Tables AttributeAttribute:: LeftLeft_W_W 2534496686RightRight 9108810BalancedBalanced 7271614214LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: LeftLeft_W_W 2534496686RightRight 9108810BalancedBalanced 7271614214LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: LeftLeft_D_D 2737495790RightRight 8109108BalancedBalanced 7770593816LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: LeftLeft_D_D 2737495790RightRight 8109108BalancedBalanced 7770593816LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: RightRight_W_W 7970583716RightRight 8910108BalancedBalanced 2833496387LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: RightRight_W_W 7970583716RightRight 8910108BalancedBalanced 2833496387LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: RightRight_D_D 8267573717RightRight 9108108BalancedBalanced 2535446591LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass AttributeAttribute:: RightRight_D_D 8267573717RightRight 9108108BalancedBalanced 2535446591LeftLeft I5I5I4I4I3I3I2I2I1I1ClassClass 260 LeftLeft ClassesClasses CountersCounters 56526045 TotalTotalRightRightBalancedBalanced 260 LeftLeft ClassesClasses CountersCounters 56526045 TotalTotalRightRightBalancedBalanced 565 examples Assuming complete data, the computation of all the required estimates requires a simple scan through the data, an operation of time complexity O(N x n), where N is the number of training examples and n is the number of attributes.
16. 16. Machine Learning Gladys Castillo, U.A. 16 We need to estimate the posterior probabilities P(cj | x) for each class How NB classifies this example? The class for this example is the class which has bigger posterior probability P(Left|x) P(Balanced|x) P(Right|x) 0.277796 0.135227 0.586978 Class = Right P(cj |x) = P(cj ) x P(Left_W=1 | cj) x P(Left_D=5 | cj ) x P(Right_W=4 | cj ) x P(Right_D=2 | cj ), cj  {Left, Balanced, Right} The class counters and contingency tables are used to compute the posterior probabilities for each class 1 LeftLeft_W_W Right245 ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D 1 LeftLeft_W_W Right245 ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D ? max The Balance-scale Problem Classification Phase    n i jiij ...mj NB cxXPcParghc 1 1 * )|()(max)(x
17. 17. Machine Learning Gladys Castillo, U.A. 17 Iris Dataset  From the statistician Douglas Fisher  Three flower types (classes):  Setosa  Virginica  Versicolour  Four continuous attributes  Sepal width and length  Petal width and length Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. Iris dataset from the UCI repository
18. 18. Machine Learning Gladys Castillo, U.A. 18 Iris Dataset
19. 19. Machine Learning Gladys Castillo, U.A. 19 Iris Dataset
20. 20. Machine Learning Gladys Castillo, U.A. 20 Scatter Plot Array of Iris Attributes The attributes petal width and petal length provide a moderate separation of the Irish species
21. 21. Machine Learning Gladys Castillo, U.A. 21 Naïve Bayes Iris Dataset – Continuous Attributes Normal Probability Density Functions Attribute: PetalWidth P(PetalWidth|Setosa) P(PetalWidth|Versicolor) P(PetalWidth|Virginica) Class Iris-setosa mean: 0.244, standard deviation: 0.107 Class Iris-versicolor mean: 1.326, standard deviation: 0.198 Class Iris-virginica mean: 2.026, standard deviation: 0.275
22. 22. Machine Learning Gladys Castillo, U.A. 22 Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198 Naïve Bayes Iris Dataset – Continuous Attributes NB Classifier Model Class Iris-setosa (0.327): ========================== Attribute sepallength mean: 5.006, standard deviation: 0.352 Attribute sepalwidth mean: 3.418, standard deviation: 0.381 Attribute petallength mean: 1.464, standard deviation: 0.174 Attribute petalwidth mean: 0.244, standard deviation: 0.107 Class Iris-virginica (0.327): ============================= Attribute sepallength mean: 6.588, standard deviation: 0.636 Attribute sepalwidth mean: 2.974, standard deviation: 0.322 Attribute petallength mean: 5.552, standard deviation: 0.552 Attribute petalwidth mean: 2.026, standard deviation: 0.275
23. 23. Machine Learning Gladys Castillo, U.A. 23 Naïve Bayes Iris Dataset – Continuous Attributes Classification Phase maximum values
24. 24. Machine Learning Gladys Castillo, U.A. 24 Naïve Bayes Iris Dataset – Continuous Attributes Estimate P(cj | x) for each class: P(setosa|x) P(versicolor|x) P(virginica|x) 0 0.995 0.005 Classe = versicolor P(cj |x) = P(cj ) x P(sepalLength =5 | cj) x P(sepalWidth =3 | cj ) x P(petalLength =2 | cj ) x P(petalWidth =2 | cj ), cj  {setosa, versicolor, virgínica} ? max How NB classifies this example? The class for this example is the class which has bigger posterior probability
25. 25. Machine Learning Gladys Castillo, U.A. 25 Naïve Bayes Iris Dataset – Continuous Attributes Estimate P(cj | x) for the versicolor class Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198 P(versicolor |x) = P(versicolor ) x P(sepalLength = 5 | versicolor ) x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor) x P(petalWidth =2 | versicolor ) P(versicolor |x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314) x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)    n i jiij ...mj NB cxXPcParghc 1 1 * )|()(max)(x
26. 26. Machine Learning Gladys Castillo, U.A. 26 Naïve Bayes Continuous Attributes - Discretization For continuos attributes After BinDiscretization, bins=3
27. 27. Machine Learning Gladys Castillo, U.A. 27 Naïve Bayes Continuous Attributes - Discretization NB Classifier Model Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900
28. 28. Machine Learning Gladys Castillo, U.A. 28 Naïve Bayes Continuous Attributes - Discretization Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000 Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900 Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Attribute: Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Attribute: Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 We can build a conditional probability table (CPT) for each attribute
29. 29. Machine Learning Gladys Castillo, U.A. 29 Naïve Bayes Continuous Attributes - Discretization Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000 Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900 Attribute: Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Attribute: Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900
30. 30. Machine Learning Gladys Castillo, U.A. 30 Naïve Bayes Iris Dataset – Discretized Attributes Classification (Implementation) Phase maximum values Discretized Examples
31. 31. Machine Learning Gladys Castillo, U.A. 31 Naïve Bayes Classification Phase How to classify a new example ? P(setosa|x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa ) P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor) x P(petalWidth =r3 | versicolor ) P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica) x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica) x P(petalWidth =r3 | virginica ) Compute the conditional posterior probabilities for each class sepallengt sepalwidth petallength petallength r1 r2 r1 r3 Example with discretized attributes sepallengt sepalwidth petallength petallength 5 3 2 2
32. 32. Machine Learning Gladys Castillo, U.A. 32 Naïve Bayes Continuous Attributes - Discretization Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 Atributo Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 Atributo Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Atributo Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Atributo Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900 Atributo Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900 P(setosa | x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth = r2 | setosa ) x P(petalLength = r1 | setosa ) x P(petalWidth = r3 | setosa ) P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0 r1 petallengthpetallength r1 sepallengtsepallengt r3r2 petallengthpetallengthsepalwidthsepalwidth r1 petallengthpetallength r1 sepallengtsepallengt r3r2 petallengthpetallengthsepalwidthsepalwidth
33. 33. Machine Learning Gladys Castillo, U.A. 33 Naïve Bayes Continuous Attributes - Discretization Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 Atributo Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 Atributo Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Atributo Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Atributo Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900 Atributo Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900 P(versicolor | x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth = r2 | versicolor ) x P(petalLength = r1 | versicolor ) x P(petalWidth = r3 | versicolor ) P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0 r1 petallengthpetallength r1 sepallengtsepallengt r3r2 petallengthpetallengthsepalwidthsepalwidth r1 petallengthpetallength r1 sepallengtsepallengt r3r2 petallengthpetallengthsepalwidthsepalwidth
34. 34. Machine Learning Gladys Castillo, U.A. 34 Naïve Bayes Continuous Attributes - Discretization Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Class Probabilities Setosa Versicolor Virgínica 0.327 0.327 0.327 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 Atributo Sepalwidth Range1 Range2 Range3 Setosa 0.020 0.720 0.260 Versicolor 0.460 0.540 0.000 Virginica 0.380 0.580 0.040 Atributo Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Atributo Petallength Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.960 0.040 Virginica 0.000 0.100 0.900 Atributo Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900 Atributo Petalwidth Range1 Range2 Range3 Setosa 1.000 0.000 0.000 Versicolor 0.000 0.980 0.020 Virginica 0.000 0.100 0.900 P(virginica | x) = P(virginica) x P(sepalLength = r1 | virginica ) x P(sepalWidth = r2 | virginica ) x P(petalLength = r1 | virginica ) x P(petalWidth = r3 | virginica ) P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0 r1 petallengthpetallength r1 sepallengtsepallengt r3r2 petallengthpetallengthsepalwidthsepalwidth r1 petallengthpetallength r1 sepallengtsepallengt r3r2 petallengthpetallengthsepalwidthsepalwidth If all the class probabilities are zero  we can not determine the class for this example
35. 35. Machine Learning Gladys Castillo, U.A. 35 Naïve Bayes Laplace Correction  Nijk  number of examples in D such that Xi = xk and C = cj  Nj  number of examples in D of class cj  k  number of possible values of Xi Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 Atributo Sepallength Range1 Range2 Range3 Setosa 0.940 0.060 0.000 Versicolor 0.220 0.720 0.060 Virginica 0.020 0.640 0.340 To avoid zero probabilities due to zero counters we can use the Laplace correction j ijk jki N N cxXP  )|(ˆ kN N cxXP j ijk jki    1 )|(ˆ To calculate the conditional probabilities, instead of using the estimate we can use the Laplace correction:
36. 36. Machine Learning Gladys Castillo, U.A. 36 Bayesian SPAM Filtering  Binary Classification Problem Is an email a SPAM?  1 – is SPAM  0 – is not SPAM  57 real attributes  Word frequencies  Character frequencies  Capital letters avg frequencies Spambase dataset from the UCI repository Spam Not Spam Total 1813 2788 4601
37. 37. Machine Learning Gladys Castillo, U.A. 37 Bayesian SPAM Filtering An Implementation using RapidMiner Studio 6.0.007 Greedy Feature Selection using “forward selection” and CFS(1) as the performance measure (1) - Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. Thesis submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy at the University of Waikato.
38. 38. Machine Learning Gladys Castillo, U.A. 38 Bayesian SPAM Filtering k-fold Cross-Validation Results (k= 10) Results without feature subset selection Results with feature subset selection - only 11 attributes selected, performs better
39. 39. Machine Learning Gladys Castillo, U.A. 39 Bayesian SPAM Filters Confussion Matrix  Spam Precision = 𝑇𝑃 𝑇𝑃+𝐹𝑃 percentage of e-mails classified as SPAM which truly are  Spam Recall = TPR = 𝑇𝑃 𝑇𝑃+𝐹𝑁 percentage of e-mails classified as SPAM over the total of SPAMs  Not Spam Precision = 𝑇𝑁 𝐹𝑁+𝑇𝑁 percentage of e-mails classified as “Not SPAM” which truly are  Not Spam Recall = 𝑇𝑁 𝐹𝑃+𝑇𝑁 percentage of e-mails classified as “Not SPAM” over the total of valid emails  True Positive (TP) - number of e-mails classified as SPAM which truly are  False Positive = number of e-mails classified as SPAM which are valid  False Negative = number of e-mails classified as “Not SPAM” which are SPAM  True Negative – number of e-mails classified as “Not Spam” which truly are Spam precision Not Spam precision Spam recall Not Spam recall Actual Predicted Yes (+) No (-) Yes (+) TP FP No (-) FN TN Concept Learning Problem: Is an email a SPAM? Accuracy: 91.74% Predicted Spam (1) Not Spam (0) Precision Spam (1) 1558 125 92.57% Not Spam (0) 255 2633 91.25% Recall 85.93% 95.52% A good Spam filter must achieve high precision and recall, thus reducing the number of false positives
40. 40. Machine Learning Gladys Castillo, U.A. 40  NB is one of more simple and effective classifiers  NB has a very strong unrealistic independence assumption: Naïve Bayes Performance all the attributes are conditionally independent given the value of class Bias-variance decomposition of Test Error Naive Bayes for Nursery Dataset 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00% 16.00% 18.00% 500 2000 3500 5000 6500 8000 9500 11000 12500 # Examples Variance Biasin practice: independence assumption is violated  HIGH BIAS it can lead to poor classification However, NB is efficient due to its high variance management less parameters  LOW VARIANCE
41. 41. Machine Learning Gladys Castillo, U.A. 41 References 1. Castillo G. (2006) Adaptive Learning Algorithms for Bayesian Network Classifiers. PhD Thesis. Chapter 3.3. Naïve Bayes Classifier 2. Domingos, P. and Pazzani M. (1997) On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning, 29(2-3):103–130 3. Duda, R.O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley & Sons, Inc. 4. John, G.H. and Langley P. (1995) Estimating Continuous Distributions in Bayesian Classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338-345. Morgan Kaufmann, San Mateo. 5. Mitchell T. (1997) Machine Learning, Chapter 6. Bayesian Learning, Tom Mitchell, McGrau Hill 6. Tan, P., Steinbach, M., Kumar, V. (2006) Introduction to Data Mining . Chapter 5: Classification- Alternative Techniques. Section 5.1 (pag 166-180) , Addison Wesley Higher Education