1.
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 7 Classification and Prediction Naïve Bayes, Regression and SVM by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
2.
Topics Statistical Modeling: Naïve Bayes Classification sparseness problem missing value numeric attributes Regression Linear Regression Regression Tree Support Vector Machine 2 Data Warehousing and Data Mining by Kritsada Sriphaew
3.
Statistical Modeling “Opposite” of 1R: use all the attributes Two assumptions: Attributes are equally important statistically independent (given the class value) This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) Although based on assumptions that are almost never correct, this scheme works well in practice! 3 Classification – Naïve Bayes
4.
An Example: Evaluating the Weather Attributes(Revised)Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error sunny hot high false no Outlook sunny no 2/5 4/14 overcast yes 0/4 sunny hot high true no rainy yes 2/5overcast hot high false yes Temp. hot no* 2/4 5/14 rainy mild high false yes mild yes 2/6 rainy cool normal false yes cool yes 1/4 Humidity high no 3/7 4/14 rainy cool normal true no normal yes 1/7overcast cool normal true yes Windy false yes 2/8 5/14 sunny mild high false no true no* 3/6 sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes 1R chooses the attribute thatovercast mild high true yes produces rules with the smallest number of errors, i.e., rule 1 or 3overcast hot normal false yes rainy mild high true no 4 Classification – Naïve Bayes
5.
Probabilities for the Weather Data Probabilistic model5 Classification – Naïve Bayes
6.
Bayes’s Rule Probability of event H given evidence E: p( E | H ) p( H ) p( H | E ) p( E ) A priori probability of H: p(H) Probability of event before evidence has been seen A posteriori probability of H: p(H|E) Probability of event after evidence has been seen6 Classification – Naïve Bayes
7.
Naïve Bayes for Classification Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve Bayes assumption: "independent feature model“, i.e., the presence (or absence) of a particular attribute (or feature) of a class is unrelated to the presence (or absence) of any other attribute, therefore: p( E | H ) p( H ) p( H | E ) p( E ) p( E1 | H ) p( E2 | H ) p( En | H ) p( H ) p( H | E1 , E2 ,, En ) p( E ) 7 Classification – Naïve Bayes
8.
Naïve Bayes for Classification p( play y | outlook s, temp c, humid h, windy t ) p(out s | pl y ) p(te c | pl y ) p(hu h | pl y ) p( wi t | pl y ) p( pl y ) p(out s, te c, hu h, wi t ) 2 3 3 3 9 9 9 9 9 14 p(out s, te c, hu h, wi t )8 Classification – Naïve Bayes
9.
The Sparseness Problem(The “zero-frequency problem”) What if an attribute value doesn’t occur with every class value (e. g. “Outlook = overcast” for class “no”)? Probability will be zero! P(outlook=overcast|play=no) = 0 A posteriori probability will also be zero! (No matter how likely the other values are!) P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0 Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) 9 Classification – Naïve Bayes
10.
Modified Probability Estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes We can apply an equal weight, or weights don’t need to be equal (if they sum to 1, That is, p1 + p2 + p3 = 1) Equal weight Normalized weight (p1 + p2 + p3 = 1) 2 m 2 ( m p1 ) 3 sunny sunny 9m 9m 4 ( m p2 ) 4 m overcast 3 9m overcast 9m 3 (m p 3 ) rainy 3 m 9m rainy 3 9m10 Classification – Naïve Bayes
11.
Missing Value Problem Training: instance is not included in the frequency count for attribute value-class combination Classification: attribute will be omitted from calculation 11 Classification – Naïve Bayes
12.
Dealing with Numeric Attributes Common assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by: The sample mean : The standard deviation : - The density function f(x): 12 Classification – Naïve Bayes
14.
Statistics for the Weather Data Example for density value: (66−73)2 1 = 66 = 2∗6.22 = 0.0340 26.2 - (99−86.2)2 1 ℎ = 99 = 2∗9.72 = 0.0380 29.714 Classification – Naïve Bayes
15.
Classify a New Case Classify a new case (if any missing values in both training and classifying , omit them) The case we would like to predict15 L6: Statistical Classification Approach
16.
Probability Densities Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because is cancelled out Exact relationship: 16 Classification – Naïve Bayes
17.
Discussion of Naïve Bayes Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e. g. identical attributes) Note also: many numeric attributes are not normally distributed ( kernel density estimators) 17 Classification – Naïve Bayes
18.
General Bayesian Classification Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 18 Classification – Naïve Bayes
19.
Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem P(h|D) P(D|h)P(h) P(D) MAP (maximum a posteriori) hypothesis h arg max P(h | D) arg max P(D | h)P(h). MAP hH hH Difficulty: need initial knowledge of many probabilities, significant computational cost If assume P(hi) = P(hj) then method can further simplify, and choose the Maximum Likelihood (ML) hypothesis h arg max P(h | D) arg max P(D | h ). i i ML hH i hHi 19 Classification – Naïve Bayes
20.
Naïve Bayes Classifiers Assumption: attributes are conditionally independent: c arg max P(c | {v v ... v }) i 1, 2, , J MAP cC i arg max P(v | c )P(c ). J j i i cC i j 1 Greatly reduces the computation cost, only count the class distribution. However, it is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first Association rules that reason a class by several attributes 20 Classification – Naïve Bayes
21.
Bayesian Belief Network (An Example) The conditional probability table Storm BusTourGroup (CPT) for the variable Campfire (S,B) (S, ~B) (~S, B) (~S, ~B) C 0.4 0.1 0.8 0.2 Lightening Campfire ~C 0.6 0.9 0.2 0.8 • Network represents a set of conditional independence assertions. • Directed acyclic graph Thunder Forestfire Also called Bayes Nets Attributes (variables) are often correlated. Each variable is conditionally independent given its predecessors 21 Classification – Naïve Bayes
22.
Bayesian Belief Network(Dependence and Independence) 0.7 0.85 The conditional probability table (CPT) Storm BusTourGroup for the variable Campfire (S,B) (S, ~B)(~S, B)(~S, ~B) Lightening Campfire C 0.4 0.1 0.8 0.2 ~C 0.6 0.9 0.2 0.8 Thunder Forestfire Represents joint probability distribution over all variables, e.g., P(Storm, BusTourGroup,…,ForestFire) n In general, P( y , y ,..., y ) P( y | Parents(Y )) 1 2 n i i i 1 where Parents(Yi) denotes immediate predecessors of Yi in graph So, Joint distribution is fully defined by graph, plus the table p(yi|Parents(Yi)) 22 Classification – Naïve Bayes
23.
Bayesian Belief Network(Inference in Bayes Nets) Infer the values of one or more network variables, given observed values of others Bayes net contains all information needed for this inference If only one variable with unknown value, it is easy to infer it. In general case, the problem is NP hard. Anyway, there are three types of inference. Top-down inference: p(Campfire|Storm) Bottom-up inference: p(Storm|Campfire) Hybrid inference: p(BusTourGroup|Storm,Campfire) 23 Classification – Naïve Bayes
24.
Bayesian Belief Network(Training Bayesian Belief Networks) Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it is easy as training a Naïve Bayes classifier If structure known but some variables observed, e.g. observe ForestFire, Storm, BustourGroup, Thunder but not Lightening, Campfire Use gradient ascent. Converge to network h that maximize P(D|h) 24 Classification – Naïve Bayes
25.
Numerical Modeling: Regression Numerical model is used for prediction Counterparts exist for all schemes that we previously discussed Decision trees, statistical models, etc. All classification schemes can be applied to regression problems using discretization Prediction: weighted average of intervals’ midpoints (weighted according to class probabilities) Regression is more difficult than classification (i. e. percent correct vs. mean squared error) 25 Prediction – Regression
26.
Linear Regression Work most naturally with numeric attributes Standard technique for numeric prediction Outcome kis linear combination of attributes Y w j x j w0 x0 w1 x1 w2 x2 ... wk xk j 0 Weights are calculated from the X (1)training data Predicted value for) first instance k Y w j x j w0 x0 w1 x1 ... wk xk (1 (1) (1) (1) j 0 26 Prediction – Regression
27.
Minimize the Squared Error (I) k+1 coefficients are chosen so that the squared error on the training data is minimized Squared error: 2 n (i ) k y wjx j (i ) i 1 j 0 Coefficient can be derived using standard matrix operations Can be done if there are more instances than attributes (roughly speaking) If there are less instances, a lot of solutions Minimization of absolute error is more difficult! 27 Prediction – Regression
28.
Minimize the Squared Error (II) Y X w 2 n (i) k (i ) min y w j x j 2 min i 1 j 0 2 (0) (0) (0) (0) w y xk(1) 0 (1) x01) ( x 1 (1) y x0 x xk w1 min 1 (n) (n) (n) x0 xk wk (n) y x 1 Ynx1 Xnxk wkx128 Prediction – Regression
29.
Example : Find the linear regression of salary dataSalary data k Y w j x j w0 x0 w1 x1 w2 x2 ... wk xkX = {x1} Y j 0 For simplicity, x0 = 1Year experience Salary (in $1000s)3 308 57 Therefore, Y w w x 0 1 19 6413 72 With the method of least square error, i 1 s ( x1i x)( y i y )3 36 w 3 .5 ( x x) 1 s i 26 43 i 1 111 59 21 90 w y w x 23.55 0 1 11 20 Predicted line is estimated by Y = 23.55 + 3.5 x116 83 x = 9.1 and y = 55.4 s = # training instances = 10 Prediction for X = 10, Y = 23.55+3.5(10) = 58.55 29 Prediction – Regression
30.
Classification using Linear Regression (One withthe others) Any regression technique can be used for classification Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that do not Prediction: predict class corresponding to model with largest output value(membership value) For linear regression, this is known as multi-response For example, the data has three classes {A, B, C}. linear regression Linear Regression Model 1: predict 1 for class A and 0 for not A Linear Regression Model 2: predict 1 for class B and 0 for not B Linear Regression Model 3: predict 1 for class C and 0 for not C30 Prediction – Regression
31.
Classification using Linear Regression (PairwiseRegression) Another way of using regression for classification: A regression function for every pair of classes,using only instances from these two classes An output of +1 is assigned to one member of the pair, an output of –1 to the other Prediction is done by voting Class that receives most votes is predicted Alternative: “don’t know” if there is no agreement For example, the data has three classes {A, B, C}. More likely to be accurate but more expensive Linear Regression Model 1: predict +1 for class A and -1 for class B Linear Regression Model 2: predict +1 for class A and -1 for class C Linear Regression Model 3: predict +1 for class B and -1 for class C31 Prediction – Regression
33.
Support Vector Machine (SVM) SVM is related to statistical learning theory SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet Union researcher SVM becomes popular because of its success in handwritten digit recognition 1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4. SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning. SVM is popularly used in classification task 33 Support Vector Machines
34.
What is a good Decision Boundary? A two-class, linearly separable classification Class 2 problem Many decision boundaries! The Perceptron algorithm can be used to find such a boundary Different algorithms have been proposed Class 1 Are all decision boundaries equally good? 34 Support Vector Machines
35.
Examples of Bad Decision Boundaries BEST Class 2 Class 2 Class 1 Class 135 Support Vector Machines
36.
Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Distance between the origin and the line wtx=k is k/||w|| w 2 m || w || Class 2 m w x b 1 T Class 1 w xb 0 T 36 w x b 1 T Support Vector Machines
37.
2 2 4 6 w1 w2 b 1 w1 w2 b b b 1 1 1Example 4 4 4 2 3 2 / 3 w1 w2 b 1 w 2 2 / 3 6 w1 w2 b 1 b 5 3 Distance between 2 hyperplanes 3 2 2 m7 Class 2 m 2 || w ||654 w supports32 Class 1 w x b 1 T m10 0 1 2 3 4 5 6 7 w xb 0 T37 w x b 1 T Support Vector Machines
38.
Best boundary:Example m 2 Solve => maximize m || w || or minimize As we also want to prevent data points falling into the margin, we add the following constraints for each point i,7 + 1 ℎ Class 2 and6 + − 1 ℎ For n point, this can be rewriten as:5 => + ≥ ≤ ≤ 4 w32 Class 1 w x b 1 T m10 0 1 2 3 4 5 6 7 w xb 0 T38 w x b 1 T Support Vector Machines
39.
Previously, it is difficult to solve Primal form because it depends on ||w||, the norm of w, which involves a square root We alter the equation by 1 substituting ||w|| with 2 (the 2 factor of 1/2 being used for mathematical 7 convenience) Class 2 This is called “Quadratic 6 programming (QP) optimization” 5 problem Minimize in (w, b) 4 w 3 subject to (for any i = 1, …, n) Class 1 2 + ≥ ≤ ≤ m 1 How to solve this optimization and 0 more information on SVM, e.g., dual 0 1 2 3 4 5 6 7 form, kernel, can be found in the ref [1][1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini , Cambridge University Press, 2000. http://www.support-vector.net Vector Machines 39 Support
40.
Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear decision boundary How to generalize it to become nonlinear? Key idea: transform xi to a higher dimensional space to “make life easier” Input space: the space the point xi are located Feature space: the space of f(xi) after transformation Why transform? Linear operation in the feature space is equivalent to non-linear operation in input space Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable 40 Support Vector Machines
41.
Transforming the Data f( ) f( ) f( ) f( ) f( ) f( ) f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Input space Feature space Note: feature space is of higher dimension than the input space in practice Computation in the feature space can be costly because it is high dimensional The feature space is typically infinite-dimensional! The kernel trick can help (more info. in ref [1])[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini , 41 Cambridge University Press, 2000. http://www.support-vector.net Support Vector Machines
42.
Why SVM Work? The feature space is often very high dimensional. Why don’t we have the curse of dimensionality? A classifier in a high-dimensional space has many parameters and is hard to estimate Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier Typically, a classifier with many parameters is very flexible, but there are also exceptions Let xi=10i where i ranges from 1 to n. The classifier can classify all xi correctly for all possible combination of class labels on xi This 1-parameter classifier is very flexible 42 Support Vector Machines
43.
Why SVM works? Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the flexibility (capacity) of a classifier This is formalized by the “VC-dimension” of a classifier Consider a linear classifier in two-dimensional space If we have three training data points, no matter how those points are labeled, we can classify them perfectly 43 Support Vector Machines
44.
VC-dimension However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect We can see that 3 is the critical number The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible 44 Support Vector Machines
45.
Other Aspects of SVM How to use SVM for multi-class classification? Original SVM is for binary classification One can change the QP formulation to become multi-class More often, multiple binary classifiers are combined One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise classifiers “intelligently” How to interpret the SVM discriminant function value as probability? By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training Some SVM software (like libsvm) have these features built-in A list of SVM implementation can be found at http://www.kernel-machines.org/software.html Some implementation (such as LIBSVM) can handle multi-class classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available 45 Support Vector Machines
46.
Strengths and Weaknesses of SVM Strengths Training is relatively easy No local optimal, unlike in neural networks It scales relatively well to high dimensional data Tradeoff between classifier complexity and error can be controlled explicitly Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors Weaknesses Need to choose a “good” kernel function.46 Support Vector Machines
47.
Example: Predicting a class labelusing naïve Bayesian classification RID age income student Credit_rating Class:buys_computer 1 <=30 High No Fair No 2 <=30 High No Excellent No 3 31…40 High No Fair Yes 4 >40 Medium No Fair Yes 5 >40 Low Yes Fair Yes 6 >40 Low Yes Excellent No 7 31…40 Low Yes Excellent Yes 8 <=30 Medium No Fair no 9 <=30 Low Yes Fair Yes 10 >40 Medium Yes Fair Yes 11 <=30 Medium Yes Excellent Yes 12 31…40 Medium No Excellent Yes 13 31…40 High Yes Fair YesUnknownsample 14 >40 medium no Excellent No 15 <=30 medium yes fair 47 Data Warehousing and Data Mining by Kritsada Sriphaew
48.
Exercise:Outlook Temperature Humidity Windy PlaySunny Hot High False N Using naïve Bayesain classifier to predict those unknown dataSunny Hot High True N samplesOvercast Hot High False YRainy Mild High False YRainy Cool Normal False YRainy Cool Normal True NOvercast Cool Normal True YSunny Mild high False NSunny Cool Normal False YRainy Mild Normal False YSunny Mild Normal True YOvercast Hot Normal False YOvercast Mild High True YRainy Mild High True NSunny Cool Normal False Unknown data samplesRainy 48 Mild High False
Views
Actions
Embeds 0
Report content