Unblocking The Main Thread Solving ANRs and Frozen Frames
Dbm630 lecture07
1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
Lecture 7
Classification and Prediction
Naïve Bayes, Regression and SVM
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
1
2. Topics
Statistical Modeling: Naïve Bayes Classification
sparseness problem
missing value
numeric attributes
Regression
Linear Regression
Regression Tree
Support Vector Machine
2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. Statistical Modeling
“Opposite” of 1R: use all the attributes
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
This means that knowledge about the value of a
particular attribute doesn’t tell us anything about the
value of another attribute (if the class is known)
Although based on assumptions that are almost never
correct, this scheme works well in practice!
3 Classification – Naïve Bayes
4. An Example: Evaluating the Weather Attributes
(Revised)
Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error
sunny hot high false no Outlook sunny no 2/5 4/14
overcast yes 0/4
sunny hot high true no
rainy yes 2/5
overcast hot high false yes Temp. hot no* 2/4 5/14
rainy mild high false yes mild yes 2/6
rainy cool normal false yes cool yes 1/4
Humidity high no 3/7 4/14
rainy cool normal true no
normal yes 1/7
overcast cool normal true yes
Windy false yes 2/8 5/14
sunny mild high false no true no* 3/6
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes 1R chooses the attribute that
overcast mild high true yes
produces rules with the smallest
number of errors, i.e., rule 1 or 3
overcast hot normal false yes
rainy mild high true no
4 Classification – Naïve Bayes
5. Probabilities for the Weather Data
Probabilistic model
5 Classification – Naïve Bayes
6. Bayes’s Rule
Probability of event H given evidence E:
p( E | H ) p( H )
p( H | E )
p( E )
A priori probability of H: p(H)
Probability of event before evidence has
been seen
A posteriori probability of H: p(H|E)
Probability of event after evidence has been
seen
6 Classification – Naïve Bayes
7. Naïve Bayes for Classification
Classification learning: what’s the probability of the class given
an instance?
Evidence E = instance
Event H = class value for instance
Naïve Bayes assumption: "independent feature model“, i.e.,
the presence (or absence) of a particular attribute (or
feature) of a class is unrelated to the presence (or absence)
of any other attribute, therefore:
p( E | H ) p( H )
p( H | E )
p( E )
p( E1 | H ) p( E2 | H ) p( En | H ) p( H )
p( H | E1 , E2 ,, En )
p( E )
7 Classification – Naïve Bayes
8. Naïve Bayes for Classification
p( play y | outlook s, temp c, humid h, windy t )
p(out s | pl y ) p(te c | pl y ) p(hu h | pl y ) p( wi t | pl y ) p( pl y )
p(out s, te c, hu h, wi t )
2 3 3 3 9
9 9 9 9 14
p(out s, te c, hu h, wi t )
8 Classification – Naïve Bayes
9. The Sparseness Problem
(The “zero-frequency problem”)
What if an attribute value doesn’t occur with every class value (e. g.
“Outlook = overcast” for class “no”)?
Probability will be zero! P(outlook=overcast|play=no) = 0
A posteriori probability will also be zero! (No matter how likely the
other values are!)
P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0
Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
Result: probabilities will never be zero! (also: stabilizes probability
estimates)
9 Classification – Naïve Bayes
10. Modified Probability Estimates
In some cases adding a constant different from 1 might
be more appropriate
Example: attribute outlook for class yes
We can apply an equal weight, or weights don’t need to
be equal (if they sum to 1, That is, p1 + p2 + p3 = 1)
Equal weight Normalized weight (p1 + p2 + p3 = 1)
2
m 2 ( m p1 )
3 sunny
sunny 9m
9m
4 ( m p2 )
4
m overcast
3 9m
overcast
9m 3 (m p 3 )
rainy
3
m 9m
rainy 3
9m
10 Classification – Naïve Bayes
11. Missing Value Problem
Training: instance is not included in the frequency
count for attribute value-class combination
Classification: attribute will be omitted from
calculation
11 Classification – Naïve Bayes
12. Dealing with Numeric Attributes
Common assumption: attributes have a normal or
Gaussian probability distribution (given the class)
The probability density function for the normal
distribution is defined by:
The sample mean :
The standard deviation :
-
The density function f(x):
12 Classification – Naïve Bayes
13. An Example: Evaluating the Weather
Attributes (Numeric)
Outlook Temp. Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
13 Classification – Naïve Bayes
14. Statistics for the Weather Data
Example for density value:
(66−73)2
1
������ ������������������������������������������������������������������ = 66 ������������������ = ������ 2∗6.22 = 0.0340
2������6.2
-
(99−86.2)2
1
������ ℎ������������������������������������������ = 99 ������������ = ������ 2∗9.72 = 0.0380
2������9.7
14 Classification – Naïve Bayes
15. Classify a New Case
Classify a new case (if any missing values in both training and
classifying , omit them)
The case we would
like to predict
15 L6: Statistical Classification Approach
16. Probability Densities
Relationship between probability and density:
But: this doesn’t change calculation of a posteriori
probabilities because is cancelled out
Exact relationship:
16 Classification – Naïve Bayes
17. Discussion of Naïve Bayes
Naïve Bayes works surprisingly well
(even if independence assumption is clearly violated)
Why? Because classification doesn’t require accurate
probability estimates as long as maximum probability
is assigned to correct class
However: adding too many redundant attributes will
cause problems (e. g. identical attributes)
Note also: many numeric attributes are not normally
distributed ( kernel density estimators)
17 Classification – Naïve Bayes
18. General Bayesian Classification
Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches
to certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis
is correct. Prior knowledge can be combined with
observed data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
18 Classification – Naïve Bayes
19. Bayesian Theorem
Given training data D, posteriori probability of a hypothesis h,
P(h|D) follows the Bayes theorem
P(h|D) P(D|h)P(h)
P(D)
MAP (maximum a posteriori) hypothesis
h arg max P(h | D) arg max P(D | h)P(h).
MAP hH hH
Difficulty: need initial knowledge of many probabilities, significant
computational cost
If assume P(hi) = P(hj) then method can further simplify, and
choose the Maximum Likelihood (ML) hypothesis
h arg max P(h | D) arg max P(D | h ).
i i
ML hH i hHi
19 Classification – Naïve Bayes
20. Naïve Bayes Classifiers
Assumption: attributes are conditionally independent:
c arg max P(c | {v v ... v }) i 1, 2, , J
MAP cC i
arg max P(v | c )P(c ).
J
j i i
cC i
j 1
Greatly reduces the computation cost, only count the class distribution.
However, it is seldom satisfied in practice, as attributes (variables) are often
correlated.
Attempts to overcome this limitation:
Bayesian networks, that combine Bayesian reasoning with causal relationships
between attributes
Decision trees, that reason on one attribute at the time, considering most
important attributes first
Association rules that reason a class by several attributes
20 Classification – Naïve Bayes
21. Bayesian Belief Network
(An Example)
The conditional probability table
Storm BusTourGroup
(CPT) for the variable Campfire
(S,B) (S, ~B) (~S, B) (~S, ~B)
C 0.4 0.1 0.8 0.2
Lightening Campfire
~C 0.6 0.9 0.2 0.8
• Network represents a set of conditional
independence assertions.
• Directed acyclic graph
Thunder Forestfire
Also called Bayes Nets
Attributes (variables) are often correlated.
Each variable is conditionally independent given its predecessors
21 Classification – Naïve Bayes
22. Bayesian Belief Network
(Dependence and Independence)
0.7 0.85
The conditional probability table (CPT)
Storm BusTourGroup
for the variable Campfire
(S,B) (S, ~B)(~S, B)(~S, ~B)
Lightening Campfire C 0.4 0.1 0.8 0.2
~C 0.6 0.9 0.2 0.8
Thunder Forestfire
Represents joint probability distribution over all variables, e.g., P(Storm,
BusTourGroup,…,ForestFire) n
In general, P( y , y ,..., y ) P( y | Parents(Y ))
1 2 n i i
i 1
where Parents(Yi) denotes immediate predecessors of Yi in graph
So, Joint distribution is fully defined by graph, plus the table
p(yi|Parents(Yi))
22 Classification – Naïve Bayes
23. Bayesian Belief Network
(Inference in Bayes Nets)
Infer the values of one or more network variables, given
observed values of others
Bayes net contains all information needed for this inference
If only one variable with unknown value, it is easy to infer it.
In general case, the problem is NP hard.
Anyway, there are three types of inference.
Top-down inference: p(Campfire|Storm)
Bottom-up inference: p(Storm|Campfire)
Hybrid inference: p(BusTourGroup|Storm,Campfire)
23 Classification – Naïve Bayes
24. Bayesian Belief Network
(Training Bayesian Belief Networks)
Several variants of this learning task
Network structure might be known or unknown
Training examples might provide values of all network variables, or
just some
If structure known and observe all variables
Then it is easy as training a Naïve Bayes classifier
If structure known but some variables observed, e.g. observe
ForestFire, Storm, BustourGroup, Thunder but not Lightening,
Campfire
Use gradient ascent.
Converge to network h that maximize P(D|h)
24 Classification – Naïve Bayes
25. Numerical Modeling: Regression
Numerical model is used for prediction
Counterparts exist for all schemes that we previously
discussed
Decision trees, statistical models, etc.
All classification schemes can be applied to
regression problems using discretization
Prediction: weighted average of intervals’ midpoints
(weighted according to class probabilities)
Regression is more difficult than classification (i. e.
percent correct vs. mean squared error)
25 Prediction – Regression
26. Linear Regression
Work most naturally with numeric attributes
Standard technique for numeric prediction
Outcome kis linear combination of attributes
Y w j x j w0 x0 w1 x1 w2 x2 ... wk xk
j 0
Weights are calculated from the X (1)training data
Predicted value for) first instance
k
Y w j x j w0 x0 w1 x1 ... wk xk
(1 (1) (1) (1)
j 0
26 Prediction – Regression
27. Minimize the Squared Error (I)
k+1 coefficients are chosen so that the squared error
on the training data is minimized
Squared error:
2
n
(i )
k
y wjx j
(i )
i 1
j 0
Coefficient can be derived using standard matrix
operations
Can be done if there are more instances than attributes
(roughly speaking)
If there are less instances, a lot of solutions
Minimization of absolute error is more difficult!
27 Prediction – Regression
28. Minimize the Squared Error (II)
Y X w
2
n (i) k (i )
min y w j x j
2
min
i 1
j 0
2
(0) (0) (0) (0)
w
y xk(1) 0
(1) x01)
(
x
1
(1)
y x0 x xk w1
min
1
(n) (n) (n)
x0 xk wk
(n)
y x
1
Ynx1 Xnxk wkx1
28 Prediction – Regression
29. Example : Find the linear regression of salary data
Salary data k
Y w j x j w0 x0 w1 x1 w2 x2 ... wk xk
X = {x1} Y j 0
For simplicity, x0 = 1
Year experience Salary (in $1000s)
3 30
8 57
Therefore, Y w w x 0 1 1
9 64
13 72
With the method of least square error,
i 1
s
( x1i x)( y i y )
3 36
w 3 .5
( x x)
1 s i 2
6 43 i 1 1
11 59
21 90
w y w x 23.55
0 1 1
1 20 Predicted line is estimated by
Y = 23.55 + 3.5 x1
16 83
x = 9.1 and y = 55.4
s = # training instances = 10 Prediction for X = 10, Y = 23.55+3.5(10) = 58.55
29 Prediction – Regression
30. Classification using Linear Regression (One with
the others)
Any regression technique can be used for
classification
Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and
0 for those that do not
Prediction: predict class corresponding to model with
largest output value(membership value)
For linear regression, this is known as multi-response
For example, the data has three classes {A, B, C}.
linear regression
Linear Regression Model 1: predict 1 for class A and 0 for not A
Linear Regression Model 2: predict 1 for class B and 0 for not B
Linear Regression Model 3: predict 1 for class C and 0 for not C
30 Prediction – Regression
31. Classification using Linear Regression (Pairwise
Regression)
Another way of using regression for classification:
A regression function for every pair of classes,using only
instances from these two classes
An output of +1 is assigned to one member of the pair, an
output of –1 to the other
Prediction is done by voting
Class that receives most votes is predicted
Alternative: “don’t know” if there is no agreement
For example, the data has three classes {A, B, C}.
More likely to be accurate but more expensive
Linear Regression Model 1: predict +1 for class A and -1 for class B
Linear Regression Model 2: predict +1 for class A and -1 for class C
Linear Regression Model 3: predict +1 for class B and -1 for class C
31 Prediction – Regression
33. Support Vector Machine (SVM)
SVM is related to statistical learning theory
SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet
Union researcher
SVM becomes popular because of its success in handwritten digit
recognition
1.1% test error rate for SVM. This is the same as the error rates of a
carefully constructed neural network, LeNet 4.
SVM is now regarded as an important example of “kernel methods”,
one of the key area in machine learning.
SVM is popularly used in classification task
33 Support Vector Machines
34. What is a good Decision Boundary?
A two-class, linearly
separable classification
Class 2
problem
Many decision boundaries!
The Perceptron algorithm can
be used to find such a
boundary
Different algorithms have been
proposed
Class 1
Are all decision boundaries
equally good?
34 Support Vector Machines
35. Examples of Bad Decision Boundaries
BEST
Class 2 Class 2
Class 1 Class 1
35 Support Vector Machines
36. Large-margin Decision Boundary
The decision boundary should be as far away from the
data of both classes as possible
We should maximize the margin, m
Distance between the origin and the line wtx=k is
k/||w||
w 2
m
|| w ||
Class 2
m
w x b 1
T
Class 1
w xb 0
T
36
w x b 1
T
Support Vector Machines
37. 2 2 4 6
w1 w2 b 1 w1 w2 b b b 1 1 1
Example 4
4
4 2 3
2 / 3
w1 w2 b 1 w
2 2 / 3
6
w1 w2 b 1 b 5
3 Distance between 2 hyperplanes
3 2
2 m
7
Class 2
m 2
|| w ||
6
5
4 w supports
3
2
Class 1 w x b 1
T
m
1
0
0 1 2 3 4 5 6 7 w xb 0
T
37
w x b 1
T
Support Vector Machines
38. Best boundary:
Example m
2
Solve => maximize m
|| w || or minimize ������
As we also want to prevent data points
falling into the margin, we add the following
constraints for each point i,
7 ������ ������ ������������ + ������ 1 ������������������ ������������ ������������ ������ℎ������ ������������������������������ ������������������������������
Class 2 and
6 ������ ������ ������������ + ������ − 1 ������������������ ������������ ������������ ������ℎ������ ������������������������������������ ������������������������������
For n point, this can be rewriten as:
5 => ������������ ������������ ������������ + ������ ≥ ������ ������������������ ������������������ ������ ≤ ������ ≤ ������
4 w
3
2
Class 1 w x b 1
T
m
1
0
0 1 2 3 4 5 6 7 w xb 0
T
38
w x b 1
T
Support Vector Machines
39. Previously, it is difficult to solve
Primal form because it depends on ||w||, the
norm of w, which involves a square
root
We alter the equation by
1
substituting ||w|| with ������ 2 (the
2
factor of 1/2 being used for mathematical
7 convenience)
Class 2 This is called “Quadratic
6 programming (QP) optimization”
5
problem
Minimize in (w, b)
4 w ������
������
������ ������
3 subject to (for any i = 1, …, n)
Class 1
2 ������������ ������������ ������������ + ������ ≥ ������ ������������������ ������������������ ������ ≤ ������ ≤ ������
m
1
How to solve this optimization and
0 more information on SVM, e.g., dual
0 1 2 3 4 5 6 7 form, kernel, can be found in the ref [1]
[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
Cambridge University Press, 2000. http://www.support-vector.net Vector Machines
39 Support
40. Extension to Non-linear Decision Boundary
So far, we have only considered large-margin classifier with a linear
decision boundary
How to generalize it to become nonlinear?
Key idea: transform xi to a higher dimensional space to “make life
easier”
Input space: the space the point xi are located
Feature space: the space of f(xi) after transformation
Why transform?
Linear operation in the feature space is equivalent to non-linear operation in input
space
Classification can become easier with a proper transformation. In the XOR
problem, for example, adding a new feature of x1x2 make the problem linearly
separable
40 Support Vector Machines
41. Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
Input space Feature space
Note: feature space is of higher dimension
than the input space in practice
Computation in the feature space can be costly because it is high
dimensional
The feature space is typically infinite-dimensional!
The kernel trick can help (more info. in ref [1])
[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
41
Cambridge University Press, 2000. http://www.support-vector.net Support Vector Machines
42. Why SVM Work?
The feature space is often very high dimensional. Why don’t we have
the curse of dimensionality?
A classifier in a high-dimensional space has many parameters and is
hard to estimate
Vapnik argues that the fundamental problem is not the number of
parameters to be estimated. Rather, the problem is about the flexibility
of a classifier
Typically, a classifier with many parameters is very flexible, but there
are also exceptions
Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible
combination of class labels on xi
This 1-parameter classifier is very flexible
42 Support Vector Machines
43. Why SVM works?
Vapnik argues that the flexibility of a classifier should not be
characterized by the number of parameters, but by the flexibility
(capacity) of a classifier
This is formalized by the “VC-dimension” of a classifier
Consider a linear classifier in two-dimensional space
If we have three training data points, no matter how those points
are labeled, we can classify them perfectly
43 Support Vector Machines
44. VC-dimension
However, if we have four points, we can find a labeling such that
the linear classifier fails to be perfect
We can see that 3 is the critical number
The VC-dimension of a linear classifier in a 2D space is 3
because, if we have 3 points in the training set, perfect
classification is always possible irrespective of the labeling,
whereas for 4 points, perfect classification can be impossible
44 Support Vector Machines
45. Other Aspects of SVM
How to use SVM for multi-class classification?
Original SVM is for binary classification
One can change the QP formulation to become multi-class
More often, multiple binary classifiers are combined
One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise
classifiers “intelligently”
How to interpret the SVM discriminant function value as probability?
By performing logistic regression on the SVM output of a set of data (validation
set) that is not used for training
Some SVM software (like libsvm) have these features built-in
A list of SVM implementation can be found at http://www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle multi-class classification
SVMLight is among one of the earliest implementation of SVM
Several Matlab toolboxes for SVM are also available
45 Support Vector Machines
46. Strengths and Weaknesses of SVM
Strengths
Training is relatively easy
No local optimal, unlike in neural networks
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error can be
controlled explicitly
Non-traditional data like strings and trees can be used as
input to SVM, instead of feature vectors
Weaknesses
Need to choose a “good” kernel function.
46 Support Vector Machines
47. Example: Predicting a class label
using naïve Bayesian classification
RID age income student Credit_rating Class:buys_computer
1 <=30 High No Fair No
2 <=30 High No Excellent No
3 31…40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31…40 Low Yes Excellent Yes
8 <=30 Medium No Fair no
9 <=30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <=30 Medium Yes Excellent Yes
12 31…40 Medium No Excellent Yes
13 31…40 High Yes Fair Yes
Unknown
sample 14 >40 medium no Excellent No
15 <=30 medium yes fair
47
Data Warehousing and Data Mining by Kritsada Sriphaew
48. Exercise:
Outlook Temperature Humidity Windy Play
Sunny Hot High False N Using naïve Bayesain classifier
to predict those unknown data
Sunny Hot High True N
samples
Overcast Hot High False Y
Rainy Mild High False Y
Rainy Cool Normal False Y
Rainy Cool Normal True N
Overcast Cool Normal True Y
Sunny Mild high False N
Sunny Cool Normal False Y
Rainy Mild Normal False Y
Sunny Mild Normal True Y
Overcast Hot Normal False Y
Overcast Mild High True Y
Rainy Mild High True N
Sunny Cool Normal False
Unknown data samples
Rainy
48 Mild High False