Dbm630 lecture07

450 views
376 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
450
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
50
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dbm630 lecture07

  1. 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 7 Classification and Prediction Naïve Bayes, Regression and SVM by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
  2. 2. Topics Statistical Modeling: Naïve Bayes Classification  sparseness problem  missing value  numeric attributes Regression  Linear Regression  Regression Tree Support Vector Machine 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  3. 3. Statistical Modeling “Opposite” of 1R: use all the attributes Two assumptions: Attributes are  equally important  statistically independent (given the class value) This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) Although based on assumptions that are almost never correct, this scheme works well in practice! 3 Classification – Naïve Bayes
  4. 4. An Example: Evaluating the Weather Attributes(Revised)Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error sunny hot high false no Outlook sunny  no 2/5 4/14 overcast  yes 0/4 sunny hot high true no rainy  yes 2/5overcast hot high false yes Temp. hot  no* 2/4 5/14 rainy mild high false yes mild  yes 2/6 rainy cool normal false yes cool  yes 1/4 Humidity high  no 3/7 4/14 rainy cool normal true no normal  yes 1/7overcast cool normal true yes Windy false  yes 2/8 5/14 sunny mild high false no true  no* 3/6 sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes 1R chooses the attribute thatovercast mild high true yes produces rules with the smallest number of errors, i.e., rule 1 or 3overcast hot normal false yes rainy mild high true no 4 Classification – Naïve Bayes
  5. 5. Probabilities for the Weather Data Probabilistic model5 Classification – Naïve Bayes
  6. 6. Bayes’s Rule Probability of event H given evidence E: p( E | H )  p( H ) p( H | E )  p( E ) A priori probability of H: p(H)  Probability of event before evidence has been seen A posteriori probability of H: p(H|E)  Probability of event after evidence has been seen6 Classification – Naïve Bayes
  7. 7. Naïve Bayes for Classification Classification learning: what’s the probability of the class given an instance?  Evidence E = instance  Event H = class value for instance Naïve Bayes assumption: "independent feature model“, i.e., the presence (or absence) of a particular attribute (or feature) of a class is unrelated to the presence (or absence) of any other attribute, therefore: p( E | H )  p( H ) p( H | E )  p( E ) p( E1 | H )  p( E2 | H )  p( En | H )  p( H ) p( H | E1 , E2 ,, En )  p( E ) 7 Classification – Naïve Bayes
  8. 8. Naïve Bayes for Classification p( play  y | outlook  s, temp  c, humid  h, windy  t )  p(out  s | pl  y )  p(te  c | pl  y )  p(hu  h | pl  y )  p( wi  t | pl  y )  p( pl  y ) p(out  s, te  c, hu  h, wi  t ) 2 3 3 3 9      9 9 9 9 14 p(out  s, te  c, hu  h, wi  t )8 Classification – Naïve Bayes
  9. 9. The Sparseness Problem(The “zero-frequency problem”) What if an attribute value doesn’t occur with every class value (e. g. “Outlook = overcast” for class “no”)?  Probability will be zero! P(outlook=overcast|play=no) = 0  A posteriori probability will also be zero! (No matter how likely the other values are!) P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0 Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) 9 Classification – Naïve Bayes
  10. 10. Modified Probability Estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes We can apply an equal weight, or weights don’t need to be equal (if they sum to 1, That is, p1 + p2 + p3 = 1) Equal weight Normalized weight (p1 + p2 + p3 = 1) 2 m 2  ( m  p1 ) 3 sunny  sunny  9m 9m 4  ( m  p2 ) 4 m overcast  3 9m overcast  9m 3  (m  p 3 ) rainy  3 m 9m rainy  3 9m10 Classification – Naïve Bayes
  11. 11. Missing Value Problem Training: instance is not included in the frequency count for attribute value-class combination Classification: attribute will be omitted from calculation 11 Classification – Naïve Bayes
  12. 12. Dealing with Numeric Attributes Common assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by:  The sample mean :  The standard deviation : -  The density function f(x): 12 Classification – Naïve Bayes
  13. 13. An Example: Evaluating the WeatherAttributes (Numeric) Outlook Temp. Humidity Windy Play sunny 85 85 false no sunny 80 90 true no overcast 83 86 false yes rainy 70 96 false yes rainy 68 80 false yes rainy 65 70 true no overcast 64 65 true yes sunny 72 95 false no sunny 69 70 false yes rainy 75 80 false yes sunny 75 70 true yes overcast 72 90 true yes overcast 81 75 false yes rainy 71 91 true no13 Classification – Naïve Bayes
  14. 14. Statistics for the Weather Data Example for density value: (66−73)2 1 = 66 = 2∗6.22 = 0.0340 26.2 - (99−86.2)2 1 ℎ = 99 = 2∗9.72 = 0.0380 29.714 Classification – Naïve Bayes
  15. 15. Classify a New Case Classify a new case (if any missing values in both training and classifying , omit them) The case we would like to predict15 L6: Statistical Classification Approach
  16. 16. Probability Densities Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because  is cancelled out Exact relationship: 16 Classification – Naïve Bayes
  17. 17. Discussion of Naïve Bayes Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e. g. identical attributes) Note also: many numeric attributes are not normally distributed (  kernel density estimators) 17 Classification – Naïve Bayes
  18. 18. General Bayesian Classification Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 18 Classification – Naïve Bayes
  19. 19. Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem P(h|D)  P(D|h)P(h) P(D) MAP (maximum a posteriori) hypothesis h  arg max P(h | D)  arg max P(D | h)P(h). MAP hH hH Difficulty: need initial knowledge of many probabilities, significant computational cost If assume P(hi) = P(hj) then method can further simplify, and choose the Maximum Likelihood (ML) hypothesis h  arg max P(h | D)  arg max P(D | h ). i i ML hH i hHi 19 Classification – Naïve Bayes
  20. 20. Naïve Bayes Classifiers Assumption: attributes are conditionally independent: c  arg max P(c | {v v ... v }) i 1, 2, , J MAP cC i  arg max  P(v | c )P(c ). J j i i cC i j 1 Greatly reduces the computation cost, only count the class distribution. However, it is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation:  Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes  Decision trees, that reason on one attribute at the time, considering most important attributes first  Association rules that reason a class by several attributes 20 Classification – Naïve Bayes
  21. 21. Bayesian Belief Network (An Example) The conditional probability table Storm BusTourGroup (CPT) for the variable Campfire (S,B) (S, ~B) (~S, B) (~S, ~B) C 0.4 0.1 0.8 0.2 Lightening Campfire ~C 0.6 0.9 0.2 0.8 • Network represents a set of conditional independence assertions. • Directed acyclic graph Thunder Forestfire Also called Bayes Nets Attributes (variables) are often correlated. Each variable is conditionally independent given its predecessors 21 Classification – Naïve Bayes
  22. 22. Bayesian Belief Network(Dependence and Independence) 0.7 0.85 The conditional probability table (CPT) Storm BusTourGroup for the variable Campfire (S,B) (S, ~B)(~S, B)(~S, ~B) Lightening Campfire C 0.4 0.1 0.8 0.2 ~C 0.6 0.9 0.2 0.8 Thunder Forestfire Represents joint probability distribution over all variables, e.g., P(Storm, BusTourGroup,…,ForestFire) n In general, P( y , y ,..., y )   P( y | Parents(Y )) 1 2 n i i i 1 where Parents(Yi) denotes immediate predecessors of Yi in graph So, Joint distribution is fully defined by graph, plus the table p(yi|Parents(Yi)) 22 Classification – Naïve Bayes
  23. 23. Bayesian Belief Network(Inference in Bayes Nets) Infer the values of one or more network variables, given observed values of others  Bayes net contains all information needed for this inference  If only one variable with unknown value, it is easy to infer it.  In general case, the problem is NP hard.  Anyway, there are three types of inference.  Top-down inference: p(Campfire|Storm)  Bottom-up inference: p(Storm|Campfire)  Hybrid inference: p(BusTourGroup|Storm,Campfire) 23 Classification – Naïve Bayes
  24. 24. Bayesian Belief Network(Training Bayesian Belief Networks) Several variants of this learning task  Network structure might be known or unknown  Training examples might provide values of all network variables, or just some If structure known and observe all variables  Then it is easy as training a Naïve Bayes classifier If structure known but some variables observed, e.g. observe ForestFire, Storm, BustourGroup, Thunder but not Lightening, Campfire  Use gradient ascent.  Converge to network h that maximize P(D|h) 24 Classification – Naïve Bayes
  25. 25. Numerical Modeling: Regression Numerical model is used for prediction Counterparts exist for all schemes that we previously discussed  Decision trees, statistical models, etc. All classification schemes can be applied to regression problems using discretization  Prediction: weighted average of intervals’ midpoints (weighted according to class probabilities) Regression is more difficult than classification (i. e. percent correct vs. mean squared error) 25 Prediction – Regression
  26. 26. Linear Regression Work most naturally with numeric attributes Standard technique for numeric prediction  Outcome kis linear combination of attributes Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk j 0 Weights are calculated from the X (1)training data  Predicted value for) first instance k Y   w j x j  w0 x0  w1 x1  ...  wk xk (1 (1) (1) (1) j 0 26 Prediction – Regression
  27. 27. Minimize the Squared Error (I) k+1 coefficients are chosen so that the squared error on the training data is minimized Squared error: 2 n  (i )  k  y   wjx j  (i )    i 1 j 0  Coefficient can be derived using standard matrix operations  Can be done if there are more instances than attributes (roughly speaking)  If there are less instances, a lot of solutions  Minimization of absolute error is more difficult! 27 Prediction – Regression
  28. 28. Minimize the Squared Error (II) Y  X w 2 n  (i) k (i )  min   y   w j x j  2  min    i 1 j 0  2   (0)   (0) (0) (0) w   y   xk(1)   0     (1)   x01) ( x 1 (1)    y    x0 x  xk   w1   min       1         (n)  (n) (n)     x0 xk  wk   (n)  y x       1 Ynx1 Xnxk wkx128 Prediction – Regression
  29. 29. Example : Find the linear regression of salary dataSalary data k Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xkX = {x1} Y j 0 For simplicity, x0 = 1Year experience Salary (in $1000s)3 308 57 Therefore, Y  w  w x 0 1 19 6413 72 With the method of least square error, i 1 s  ( x1i  x)( y i  y )3 36 w  3 .5  ( x  x) 1 s i 26 43 i 1 111 59 21 90 w  y  w x  23.55 0 1 11 20  Predicted line is estimated by Y = 23.55 + 3.5 x116 83 x = 9.1 and y = 55.4 s = # training instances = 10 Prediction for X = 10, Y = 23.55+3.5(10) = 58.55 29 Prediction – Regression
  30. 30. Classification using Linear Regression (One withthe others) Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that do not  Prediction: predict class corresponding to model with largest output value(membership value) For linear regression, this is known as multi-response For example, the data has three classes {A, B, C}. linear regression Linear Regression Model 1: predict 1 for class A and 0 for not A Linear Regression Model 2: predict 1 for class B and 0 for not B Linear Regression Model 3: predict 1 for class C and 0 for not C30 Prediction – Regression
  31. 31. Classification using Linear Regression (PairwiseRegression) Another way of using regression for classification:  A regression function for every pair of classes,using only instances from these two classes  An output of +1 is assigned to one member of the pair, an output of –1 to the other Prediction is done by voting  Class that receives most votes is predicted  Alternative: “don’t know” if there is no agreement For example, the data has three classes {A, B, C}. More likely to be accurate but more expensive Linear Regression Model 1: predict +1 for class A and -1 for class B Linear Regression Model 2: predict +1 for class A and -1 for class C Linear Regression Model 3: predict +1 for class B and -1 for class C31 Prediction – Regression
  32. 32. Regression Tree Regression tree is a decision tree with averaged and Model Tree numeric values at the leaves. Model tree is a tree whose leaves contain linear cycle main memory cache channels perfor regressions. =7.5 CHMIN 7.5 time min max (Kb) min max mace MYCT MMIN MMAX CACH CHMIN CHMAX PRP CACH MMAX 25 =8.5 28 1 125 256 6000 16 128 198 (8.5,28] =28000 28000 6 2 29 8000 32000 32 8 32 269 19.3 CHMAX MMAX MMAX 157(21/73.7 3 29 8000 32000 32 8 32 220 (28/8.7%) %) =58 584 29 8000 32000 32 8 32 172 =2500 (2500,4250] 4250 =10000100005 29 8000 16000 32 8 16 132 19.3 29.8 CACH 75.7 133 MMIN 783 (28/8.7%) (37/8.18%) (10/24.6%) (16/28.8%) (5/35.9%)… ... ... ... ... ... ... ... =0.5 =12000 12000207 125 2000 8000 0 2 14 52 (0.5,8.5] MYCT208 480 512 8000 32 0 0 67 59.3 281 492209 480 1000 4000 0 0 0 45 =550 550 (24/16.9%) (11/56%) (7/53.9%) 37.3 18.3PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN + (19/11.3%) (7/3.83%) 0.0056 MMAX + 0.6410 CACH - Regression 0.2700 CHMIN + 1.480 CHMAX CHMIN =7.5 7.5 Tree LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN CACH MMAX LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN =8.5 8.5 =28000 28000 + 0.946 CHMAX LM3: PRP = 38.1 + 0.12 MMIN MMAX LM4 LM5(21/45.5 LM6 LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH =4250 4250 (50/22.1%) %) (23/63.5%) + 0.969 CHMAX LM1 CACH LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH (65/7.32%) =0.5 (0.5,8.5] - 9.39 CHMIN LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN LM2 LM3 + 4.98 CHMAX (26/6.37%) (24/14.5%) Model Tree Prediction – Regression
  33. 33. Support Vector Machine (SVM) SVM is related to statistical learning theory SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet Union researcher SVM becomes popular because of its success in handwritten digit recognition  1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4. SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning. SVM is popularly used in classification task 33 Support Vector Machines
  34. 34. What is a good Decision Boundary? A two-class, linearly separable classification Class 2 problem Many decision boundaries!  The Perceptron algorithm can be used to find such a boundary  Different algorithms have been proposed Class 1 Are all decision boundaries equally good? 34 Support Vector Machines
  35. 35. Examples of Bad Decision Boundaries BEST Class 2 Class 2 Class 1 Class 135 Support Vector Machines
  36. 36. Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible  We should maximize the margin, m  Distance between the origin and the line wtx=k is k/||w|| w 2 m || w || Class 2 m w x  b 1 T Class 1 w xb  0 T 36 w x  b  1 T Support Vector Machines
  37. 37.  2  2 4 6 w1 w2    b   1 w1 w2    b b b   1  1 1Example  4  4 4 2 3 2 / 3 w1 w2    b   1 w   2 2 / 3 6  w1 w2    b   1 b  5  3 Distance between 2 hyperplanes 3 2 2 m7 Class 2 m 2 || w ||654 w supports32 Class 1 w x  b 1 T m10 0 1 2 3 4 5 6 7 w xb  0 T37 w x  b  1 T Support Vector Machines
  38. 38. Best boundary:Example m 2 Solve = maximize m || w || or minimize As we also want to prevent data points falling into the margin, we add the following constraints for each point i,7 + 1 ℎ Class 2 and6 + − 1 ℎ For n point, this can be rewriten as:5 = + ≥ ≤ ≤ 4 w32 Class 1 w x  b 1 T m10 0 1 2 3 4 5 6 7 w xb  0 T38 w x  b  1 T Support Vector Machines
  39. 39.  Previously, it is difficult to solve Primal form because it depends on ||w||, the norm of w, which involves a square root  We alter the equation by 1 substituting ||w|| with 2 (the 2 factor of 1/2 being used for mathematical 7 convenience) Class 2  This is called “Quadratic 6 programming (QP) optimization” 5 problem Minimize in (w, b) 4 w 3 subject to (for any i = 1, …, n) Class 1 2 + ≥ ≤ ≤ m 1  How to solve this optimization and 0 more information on SVM, e.g., dual 0 1 2 3 4 5 6 7 form, kernel, can be found in the ref [1][1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor Nello Cristianini , Cambridge University Press, 2000. http://www.support-vector.net Vector Machines 39 Support
  40. 40. Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear decision boundary How to generalize it to become nonlinear? Key idea: transform xi to a higher dimensional space to “make life easier”  Input space: the space the point xi are located  Feature space: the space of f(xi) after transformation Why transform?  Linear operation in the feature space is equivalent to non-linear operation in input space  Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable 40 Support Vector Machines
  41. 41. Transforming the Data f( ) f( ) f( ) f( ) f( ) f( ) f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Input space Feature space Note: feature space is of higher dimension than the input space in practice Computation in the feature space can be costly because it is high dimensional  The feature space is typically infinite-dimensional! The kernel trick can help (more info. in ref [1])[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor Nello Cristianini , 41 Cambridge University Press, 2000. http://www.support-vector.net Support Vector Machines
  42. 42. Why SVM Work? The feature space is often very high dimensional. Why don’t we have the curse of dimensionality? A classifier in a high-dimensional space has many parameters and is hard to estimate Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier Typically, a classifier with many parameters is very flexible, but there are also exceptions  Let xi=10i where i ranges from 1 to n. The classifier can classify all xi correctly for all possible combination of class labels on xi  This 1-parameter classifier is very flexible 42 Support Vector Machines
  43. 43. Why SVM works? Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the flexibility (capacity) of a classifier  This is formalized by the “VC-dimension” of a classifier Consider a linear classifier in two-dimensional space If we have three training data points, no matter how those points are labeled, we can classify them perfectly 43 Support Vector Machines
  44. 44. VC-dimension However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect We can see that 3 is the critical number The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible 44 Support Vector Machines
  45. 45. Other Aspects of SVM How to use SVM for multi-class classification?  Original SVM is for binary classification  One can change the QP formulation to become multi-class  More often, multiple binary classifiers are combined  One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise classifiers “intelligently” How to interpret the SVM discriminant function value as probability?  By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training Some SVM software (like libsvm) have these features built-in  A list of SVM implementation can be found at http://www.kernel-machines.org/software.html  Some implementation (such as LIBSVM) can handle multi-class classification  SVMLight is among one of the earliest implementation of SVM  Several Matlab toolboxes for SVM are also available 45 Support Vector Machines
  46. 46. Strengths and Weaknesses of SVM Strengths  Training is relatively easy  No local optimal, unlike in neural networks  It scales relatively well to high dimensional data  Tradeoff between classifier complexity and error can be controlled explicitly  Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors Weaknesses  Need to choose a “good” kernel function.46 Support Vector Machines
  47. 47. Example: Predicting a class labelusing naïve Bayesian classification RID age income student Credit_rating Class:buys_computer 1 =30 High No Fair No 2 =30 High No Excellent No 3 31…40 High No Fair Yes 4 40 Medium No Fair Yes 5 40 Low Yes Fair Yes 6 40 Low Yes Excellent No 7 31…40 Low Yes Excellent Yes 8 =30 Medium No Fair no 9 =30 Low Yes Fair Yes 10 40 Medium Yes Fair Yes 11 =30 Medium Yes Excellent Yes 12 31…40 Medium No Excellent Yes 13 31…40 High Yes Fair YesUnknownsample 14 40 medium no Excellent No 15 =30 medium yes fair 47 Data Warehousing and Data Mining by Kritsada Sriphaew
  48. 48. Exercise:Outlook Temperature Humidity Windy PlaySunny Hot High False N Using naïve Bayesain classifier to predict those unknown dataSunny Hot High True N samplesOvercast Hot High False YRainy Mild High False YRainy Cool Normal False YRainy Cool Normal True NOvercast Cool Normal True YSunny Mild high False NSunny Cool Normal False YRainy Mild Normal False YSunny Mild Normal True YOvercast Hot Normal False YOvercast Mild High True YRainy Mild High True NSunny Cool Normal False Unknown data samplesRainy 48 Mild High False

×