Dbm630 lecture07

DBM630: Data Mining and
Data Warehousing

MS.IT. Rangsit University
Semester 2/2011

Lecture 7
Classification and Prediction
Naïve Bayes, Regression and SVM

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1

Topics
 Statistical Modeling: Naïve Bayes Classification
 sparseness problem
 missing value
 numeric attributes
 Regression
 Linear Regression
 Regression Tree
 Support Vector Machine

2 Data Warehousing and Data Mining by Kritsada Sriphaew

Statistical Modeling
 “Opposite” of 1R: use all the attributes
 Two assumptions: Attributes are
 equally important
 statistically independent (given the class value)
 This means that knowledge about the value of a
particular attribute doesn’t tell us anything about the
value of another attribute (if the class is known)
 Although based on assumptions that are almost never
correct, this scheme works well in practice!

3 Classification – Naïve Bayes

An Example: Evaluating the Weather Attributes
(Revised)
Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error

sunny hot high false no Outlook sunny  no 2/5 4/14
overcast  yes 0/4
sunny hot high true no
rainy  yes 2/5
overcast hot high false yes Temp. hot  no* 2/4 5/14
rainy mild high false yes mild  yes 2/6
rainy cool normal false yes cool  yes 1/4
Humidity high  no 3/7 4/14
rainy cool normal true no
normal  yes 1/7
overcast cool normal true yes
Windy false  yes 2/8 5/14
sunny mild high false no true  no* 3/6
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes 1R chooses the attribute that
overcast mild high true yes
produces rules with the smallest
number of errors, i.e., rule 1 or 3
overcast hot normal false yes
rainy mild high true no


Probabilities for the Weather Data

Probabilistic model

Bayes’s Rule
 Probability of event H given evidence E:

p( E | H )  p( H )
p( H | E ) 
p( E )
 A priori probability of H: p(H)
 Probability of event before evidence has
been seen
 A posteriori probability of H: p(H|E)
 Probability of event after evidence has been
seen

Naïve Bayes for Classification
 Classification learning: what’s the probability of the class given
an instance?
 Evidence E = instance
 Event H = class value for instance

 Naïve Bayes assumption: "independent feature model“, i.e.,
the presence (or absence) of a particular attribute (or
feature) of a class is unrelated to the presence (or absence)
of any other attribute, therefore:
p( E | H )  p( H )
p( H | E ) 
p( E )
p( E1 | H )  p( E2 | H )  p( En | H )  p( H )
p( H | E1 , E2 ,, En ) 
p( E )

Naïve Bayes for Classification

p( play  y | outlook  s, temp  c, humid  h, windy  t ) 
p(out  s | pl  y )  p(te  c | pl  y )  p(hu  h | pl  y )  p( wi  t | pl  y )  p( pl  y )
p(out  s, te  c, hu  h, wi  t )

2 3 3 3 9
   
 9 9 9 9 14
p(out  s, te  c, hu  h, wi  t )

The Sparseness Problem
(The “zero-frequency problem”)

 What if an attribute value doesn’t occur with every class value (e. g.
“Outlook = overcast” for class “no”)?
 Probability will be zero! P(outlook=overcast|play=no) = 0
 A posteriori probability will also be zero! (No matter how likely the
other values are!)
P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0

 Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
 Result: probabilities will never be zero! (also: stabilizes probability
estimates)


Modified Probability Estimates
 In some cases adding a constant different from 1 might
be more appropriate
 Example: attribute outlook for class yes
 We can apply an equal weight, or weights don’t need to
be equal (if they sum to 1, That is, p1 + p2 + p3 = 1)
Equal weight Normalized weight (p1 + p2 + p3 = 1)
2
m 2  ( m  p1 )
3 sunny 
sunny  9m
9m
4  ( m  p2 )
4
m overcast 
3 9m
overcast 
9m 3  (m  p 3 )
rainy 
3
m 9m
rainy  3
9m

Missing Value Problem
 Training: instance is not included in the frequency
count for attribute value-class combination
 Classification: attribute will be omitted from
calculation


Dealing with Numeric Attributes
 Common assumption: attributes have a normal or
Gaussian probability distribution (given the class)
 The probability density function for the normal
distribution is defined by:
 The sample mean :

 The standard deviation :
-
 The density function f(x):


An Example: Evaluating the Weather
Attributes (Numeric)
Outlook Temp. Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no

Statistics for the Weather Data

 Example for density value:
(66−73)2
1
�� = 66 �� = �� 2∗6.22 = 0.0340
2��6.2
-
(99−86.2)2
1
�� ℎ�� = 99 �� = �� 2∗9.72 = 0.0380
2��9.7


Classify a New Case

 Classify a new case (if any missing values in both training and
classifying , omit them)
The case we would
like to predict

15 L6: Statistical Classification Approach

Probability Densities
 Relationship between probability and density:

 But: this doesn’t change calculation of a posteriori
probabilities because  is cancelled out
 Exact relationship:


Discussion of Naïve Bayes
 Naïve Bayes works surprisingly well
(even if independence assumption is clearly violated)
 Why? Because classification doesn’t require accurate
probability estimates as long as maximum probability
is assigned to correct class
 However: adding too many redundant attributes will
cause problems (e. g. identical attributes)
 Note also: many numeric attributes are not normally
distributed (  kernel density estimators)


General Bayesian Classification
 Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches
to certain types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis
is correct. Prior knowledge can be combined with
observed data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities


Bayesian Theorem
 Given training data D, posteriori probability of a hypothesis h,
P(h|D) follows the Bayes theorem
P(h|D)  P(D|h)P(h)
P(D)
 MAP (maximum a posteriori) hypothesis
h  arg max P(h | D)  arg max P(D | h)P(h).
MAP hH hH

 Difficulty: need initial knowledge of many probabilities, significant
computational cost
 If assume P(hi) = P(hj) then method can further simplify, and
choose the Maximum Likelihood (ML) hypothesis
h  arg max P(h | D)  arg max P(D | h ).
i i
ML hH i hHi


Naïve Bayes Classifiers
 Assumption: attributes are conditionally independent:
c  arg max P(c | {v v ... v }) i 1, 2, , J
MAP cC i

 arg max  P(v | c )P(c ).
J
j i i

cC i
j 1

 Greatly reduces the computation cost, only count the class distribution.
 However, it is seldom satisfied in practice, as attributes (variables) are often
correlated.
 Attempts to overcome this limitation:
 Bayesian networks, that combine Bayesian reasoning with causal relationships
between attributes
 Decision trees, that reason on one attribute at the time, considering most
important attributes first
 Association rules that reason a class by several attributes


Bayesian Belief Network
(An Example)
The conditional probability table
Storm BusTourGroup
(CPT) for the variable Campfire
(S,B) (S, ~B) (~S, B) (~S, ~B)

C 0.4 0.1 0.8 0.2

Lightening Campfire
~C 0.6 0.9 0.2 0.8

• Network represents a set of conditional
independence assertions.
• Directed acyclic graph

Thunder Forestfire
Also called Bayes Nets
 Attributes (variables) are often correlated.
 Each variable is conditionally independent given its predecessors


(Dependence and Independence)
0.7 0.85
The conditional probability table (CPT)
Storm BusTourGroup
for the variable Campfire
(S,B) (S, ~B)(~S, B)(~S, ~B)

Lightening Campfire C 0.4 0.1 0.8 0.2
~C 0.6 0.9 0.2 0.8

Thunder Forestfire

 Represents joint probability distribution over all variables, e.g., P(Storm,
BusTourGroup,…,ForestFire) n
 In general, P( y , y ,..., y )   P( y | Parents(Y ))
1 2 n i i
i 1
 where Parents(Yi) denotes immediate predecessors of Yi in graph
 So, Joint distribution is fully defined by graph, plus the table
p(yi|Parents(Yi))

(Inference in Bayes Nets)

 Infer the values of one or more network variables, given
observed values of others
 Bayes net contains all information needed for this inference
 If only one variable with unknown value, it is easy to infer it.
 In general case, the problem is NP hard.
 Anyway, there are three types of inference.

 Top-down inference: p(Campfire|Storm)
 Bottom-up inference: p(Storm|Campfire)
 Hybrid inference: p(BusTourGroup|Storm,Campfire)


(Training Bayesian Belief Networks)
 Several variants of this learning task
 Network structure might be known or unknown
 Training examples might provide values of all network variables, or
just some
 If structure known and observe all variables
 Then it is easy as training a Naïve Bayes classifier
 If structure known but some variables observed, e.g. observe
ForestFire, Storm, BustourGroup, Thunder but not Lightening,
Campfire
 Use gradient ascent.
 Converge to network h that maximize P(D|h)


Numerical Modeling: Regression
 Numerical model is used for prediction
 Counterparts exist for all schemes that we previously
discussed
 Decision trees, statistical models, etc.
 All classification schemes can be applied to
regression problems using discretization
 Prediction: weighted average of intervals’ midpoints
(weighted according to class probabilities)
 Regression is more difficult than classification (i. e.
percent correct vs. mean squared error)

25 Prediction – Regression

Linear Regression
 Work most naturally with numeric attributes
 Standard technique for numeric prediction
 Outcome kis linear combination of attributes
Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk
j 0

 Weights are calculated from the X (1)training data
 Predicted value for) first instance
k
Y   w j x j  w0 x0  w1 x1  ...  wk xk
(1 (1) (1) (1)

j 0


Minimize the Squared Error (I)
 k+1 coefficients are chosen so that the squared error
on the training data is minimized
 Squared error:
2
n 
(i ) 
k
 y   wjx j 
(i )
 

i 1
j 0 
 Coefficient can be derived using standard matrix
operations
 Can be done if there are more instances than attributes
(roughly speaking)
 If there are less instances, a lot of solutions
 Minimization of absolute error is more difficult!


Minimize the Squared Error (II)

Y  X w
2
n  (i) k (i ) 
min   y   w j x j 
2
 min
 

i 1
j 0 
2
  (0)   (0) (0) (0)
w  
y   xk(1)   0  
  (1)   x01)
(
x
1
(1) 
  y    x0 x  xk   w1  
min      
1
     
  (n)  (n) (n) 
 
 x0 xk  wk  
(n)
 y x   
   1

Ynx1 Xnxk wkx1


Example : Find the linear regression of salary data
Salary data k
Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk
X = {x1} Y j 0

For simplicity, x0 = 1
Year experience Salary (in $1000s)
3 30
8 57
Therefore, Y  w  w x 0 1 1
9 64
13 72
With the method of least square error,
i 1
s
 ( x1i  x)( y i  y )
3 36
w  3 .5
 ( x  x)
1 s i 2
6 43 i 1 1

11 59 
21 90
w  y  w x  23.55
0 1 1

1 20  Predicted line is estimated by
Y = 23.55 + 3.5 x1
16 83

x = 9.1 and y = 55.4
s = # training instances = 10 Prediction for X = 10, Y = 23.55+3.5(10) = 58.55

Classification using Linear Regression (One with
the others)
 Any regression technique can be used for
classification
 Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and
0 for those that do not
 Prediction: predict class corresponding to model with
largest output value(membership value)
 For linear regression, this is known as multi-response
For example, the data has three classes {A, B, C}.
linear regression
Linear Regression Model 1: predict 1 for class A and 0 for not A
Linear Regression Model 2: predict 1 for class B and 0 for not B
Linear Regression Model 3: predict 1 for class C and 0 for not C

Classification using Linear Regression (Pairwise
Regression)
 Another way of using regression for classification:
 A regression function for every pair of classes,using only
instances from these two classes
 An output of +1 is assigned to one member of the pair, an
output of –1 to the other
 Prediction is done by voting
 Class that receives most votes is predicted
 Alternative: “don’t know” if there is no agreement
For example, the data has three classes {A, B, C}.
 More likely to be accurate but more expensive
Linear Regression Model 1: predict +1 for class A and -1 for class B
Linear Regression Model 2: predict +1 for class A and -1 for class C
Linear Regression Model 3: predict +1 for class B and -1 for class C

Regression Tree Regression tree is a decision tree with averaged
and Model Tree numeric values at the leaves.
Model tree is a tree whose leaves contain linear
cycle main memory cache channels perfor regressions. <=7.5
CHMIN
>7.5
time min max (Kb) min max mace
MYCT MMIN MMAX CACH CHMIN CHMAX PRP CACH MMAX
25 <=8.5 >28
1 125 256 6000 16 128 198 (8.5,28] <=28000 >28000
6
2 29 8000 32000 32 8 32 269 19.3 CHMAX
MMAX MMAX 157(21/73.7
3 29 8000 32000 32 8 32 220 (28/8.7%)
%) <=58 >58
4 29 8000 32000 32 8 32 172 <=2500 (2500,4250] >4250 <=10000>10000
5 29 8000 16000 32 8 16 132 19.3 29.8 CACH 75.7 133 MMIN 783
(28/8.7%) (37/8.18%) (10/24.6%) (16/28.8%) (5/35.9%)
… ... ... ... ... ... ... ... <=0.5 <=12000 >12000
207 125 2000 8000 0 2 14 52 (0.5,8.5]
MYCT
208 480 512 8000 32 0 0 67 59.3 281 492

209 480 1000 4000 0 0 0 45 <=550 >550 (24/16.9%) (11/56%) (7/53.9%)

37.3 18.3
PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN + (19/11.3%) (7/3.83%)

0.0056 MMAX + 0.6410 CACH -
Regression
0.2700 CHMIN + 1.480 CHMAX CHMIN
<=7.5 >7.5 Tree
LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN CACH MMAX
LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN <=8.5 >8.5
<=28000 >28000
+ 0.946 CHMAX
LM3: PRP = 38.1 + 0.12 MMIN MMAX LM4 LM5(21/45.5 LM6

LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH <=4250 >4250 (50/22.1%) %) (23/63.5%)

+ 0.969 CHMAX LM1 CACH
LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH (65/7.32%)
<=0.5 (0.5,8.5]
- 9.39 CHMIN
LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN LM2 LM3

+ 4.98 CHMAX
(26/6.37%) (24/14.5%) Model Tree

Prediction – Regression

Support Vector Machine (SVM)
 SVM is related to statistical learning theory
 SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet
Union researcher
 SVM becomes popular because of its success in handwritten digit
recognition
 1.1% test error rate for SVM. This is the same as the error rates of a
carefully constructed neural network, LeNet 4.
 SVM is now regarded as an important example of “kernel methods”,
one of the key area in machine learning.
 SVM is popularly used in classification task

33 Support Vector Machines

What is a good Decision Boundary?
 A two-class, linearly
separable classification
Class 2
problem
 Many decision boundaries!
 The Perceptron algorithm can
be used to find such a
boundary
 Different algorithms have been
proposed
Class 1
 Are all decision boundaries
equally good?


Examples of Bad Decision Boundaries

BEST
Class 2 Class 2

Class 1 Class 1


Large-margin Decision Boundary
 The decision boundary should be as far away from the
data of both classes as possible
 We should maximize the margin, m
 Distance between the origin and the line wtx=k is
k/||w||
w 2
m
|| w ||
Class 2

m
w x  b 1
T

Class 1
w xb  0
T

36
w x  b  1
T
Support Vector Machines

 2  2 4 6
w1 w2    b   1 w1 w2    b b b   1  1 1
Example  4
 4
4 2 3
2 / 3
w1 w2    b   1 w 
 2 2 / 3
6 
w1 w2    b   1 b  5
 3 Distance between 2 hyperplanes
3 2
2 m
7
Class 2
m 2
|| w ||
6
5

4 w supports

3
2
Class 1 w x  b 1
T

m
1
0
0 1 2 3 4 5 6 7 w xb  0
T

37
w x  b  1
T

Best boundary:
Example m
2
Solve => maximize m
|| w || or minimize ��
As we also want to prevent data points
falling into the margin, we add the following
constraints for each point i,
7 �� + �� 1 �� ℎ��
Class 2 and
6 �� + �� − 1 �� ℎ��
For n point, this can be rewriten as:
5 => �� + �� ≥ �� ≤ �� ≤ ��
4 w
3
2
Class 1 w x  b 1
T

m
1
0
0 1 2 3 4 5 6 7 w xb  0
T

38
w x  b  1
T

 Previously, it is difficult to solve
Primal form because it depends on ||w||, the
norm of w, which involves a square
root
 We alter the equation by
1
substituting ||w|| with �� 2 (the
2
factor of 1/2 being used for mathematical
7 convenience)
Class 2  This is called “Quadratic
6 programming (QP) optimization”
5
problem
Minimize in (w, b)
4 w ��
��
��
3 subject to (for any i = 1, …, n)
Class 1
2 �� + �� ≥ �� ≤ �� ≤ ��
m
1
 How to solve this optimization and
0 more information on SVM, e.g., dual
0 1 2 3 4 5 6 7 form, kernel, can be found in the ref [1]

[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
Cambridge University Press, 2000. http://www.support-vector.net Vector Machines
39 Support

Extension to Non-linear Decision Boundary
 So far, we have only considered large-margin classifier with a linear
decision boundary
 How to generalize it to become nonlinear?
 Key idea: transform xi to a higher dimensional space to “make life
easier”
 Input space: the space the point xi are located
 Feature space: the space of f(xi) after transformation
 Why transform?
 Linear operation in the feature space is equivalent to non-linear operation in input
space
 Classification can become easier with a proper transformation. In the XOR
problem, for example, adding a new feature of x1x2 make the problem linearly
separable


Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )

Input space Feature space
Note: feature space is of higher dimension
than the input space in practice

 Computation in the feature space can be costly because it is high
dimensional
 The feature space is typically infinite-dimensional!
 The kernel trick can help (more info. in ref [1])
[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
41
Cambridge University Press, 2000. http://www.support-vector.net Support Vector Machines

Why SVM Work?
 The feature space is often very high dimensional. Why don’t we have
the curse of dimensionality?
 A classifier in a high-dimensional space has many parameters and is
hard to estimate
 Vapnik argues that the fundamental problem is not the number of
parameters to be estimated. Rather, the problem is about the flexibility
of a classifier
 Typically, a classifier with many parameters is very flexible, but there
are also exceptions
 Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible
combination of class labels on xi
 This 1-parameter classifier is very flexible


Why SVM works?
 Vapnik argues that the flexibility of a classifier should not be
characterized by the number of parameters, but by the flexibility
(capacity) of a classifier
 This is formalized by the “VC-dimension” of a classifier
 Consider a linear classifier in two-dimensional space
 If we have three training data points, no matter how those points
are labeled, we can classify them perfectly


VC-dimension
 However, if we have four points, we can find a labeling such that
the linear classifier fails to be perfect

 We can see that 3 is the critical number
 The VC-dimension of a linear classifier in a 2D space is 3
because, if we have 3 points in the training set, perfect
classification is always possible irrespective of the labeling,
whereas for 4 points, perfect classification can be impossible


Other Aspects of SVM
 How to use SVM for multi-class classification?
 Original SVM is for binary classification
 One can change the QP formulation to become multi-class
 More often, multiple binary classifiers are combined
 One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise
classifiers “intelligently”
 How to interpret the SVM discriminant function value as probability?
 By performing logistic regression on the SVM output of a set of data (validation
set) that is not used for training
 Some SVM software (like libsvm) have these features built-in
 A list of SVM implementation can be found at http://www.kernel-machines.org/software.html
 Some implementation (such as LIBSVM) can handle multi-class classification
 SVMLight is among one of the earliest implementation of SVM
 Several Matlab toolboxes for SVM are also available


Strengths and Weaknesses of SVM
 Strengths
 Training is relatively easy
 No local optimal, unlike in neural networks
 It scales relatively well to high dimensional data
 Tradeoff between classifier complexity and error can be
controlled explicitly
 Non-traditional data like strings and trees can be used as
input to SVM, instead of feature vectors
 Weaknesses
 Need to choose a “good” kernel function.

Example: Predicting a class label
using naïve Bayesian classification
RID age income student Credit_rating Class:buys_computer
1 <=30 High No Fair No
2 <=30 High No Excellent No
3 31…40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31…40 Low Yes Excellent Yes
8 <=30 Medium No Fair no
9 <=30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <=30 Medium Yes Excellent Yes
12 31…40 Medium No Excellent Yes
13 31…40 High Yes Fair Yes
Unknown
sample 14 >40 medium no Excellent No
15 <=30 medium yes fair

47
Data Warehousing and Data Mining by Kritsada Sriphaew

Exercise:
Outlook Temperature Humidity Windy Play
Sunny Hot High False N Using naïve Bayesain classifier
to predict those unknown data
Sunny Hot High True N
samples
Overcast Hot High False Y
Rainy Mild High False Y
Rainy Cool Normal False Y
Rainy Cool Normal True N
Overcast Cool Normal True Y
Sunny Mild high False N
Sunny Cool Normal False Y
Rainy Mild Normal False Y
Sunny Mild Normal True Y
Overcast Hot Normal False Y
Overcast Mild High True Y
Rainy Mild High True N
Sunny Cool Normal False
Unknown data samples
Rainy
48 Mild High False

Dbm630 lecture07

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Dbm630 lecture07

Similar to Dbm630 lecture07 (6)

More from Tokyo Institute of Technology

More from Tokyo Institute of Technology (11)

Recently uploaded

Recently uploaded (20)

Dbm630 lecture07