Nhan_Chapter 6_Classification 2022.pdf

Chapter 6
UNIVERSITY OF INFORMATION TECHNOLOGY
Faculty of Information Systems
CLASSIFICATION
Cao Thi Nhan

1. Introduction
2. Decision Tree
3. Bayes Classification Methods
4. Neural network
5. K - Nearest Neighbor Classifier
6. Support Vector Machine
CONTENT

Introduction
Supervised vs. Unsupervised Learning
Supervised Learning (classification)
Supervision: The training data (observations,
measurements,…) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised Learning (Clusetering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
or clusters in the data

Introduction
Classification
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Model:
Training: based on training data to identidy classifier.
(xi,yi), with xi: object ith , yi class of ith. F(X)
Testing: new object x will be predicted based on classifier.

Introduction
Testing
Binary Classification

Introduction
Multiclass classification

Introduction
K fold - cross validation: Evaluating Classifier
Accuracy
Randomly partition the data into k mutually
exclusive subsets, each approximately equal size
At i-th iteration, use Di as test set and others as
training set
K = 10
Leave-one-out: k folds where k = # of tuples, for
small sized data

Introduction
Confusion matrix: Given m classes, an entry, CMi,j in a
confusion matrix indicates # of tuples in class i that
they are labeled by the classifier as class j

Introduction
Classification:
1. Decision tree
2. Bayes classification methods
3. Neural network
4. Rough set
5. Regression
6. K- nearest neighbor (k-nn)
7. Support vector machine (SVM)
8. Fuzzy
…

Decision Tree
Introduction
Construct tree
Measure
Conclusion

❑ Duration: 1 min.
❑ Question:
Why did you decide to study this course “Data mining”
Question
14

Should we play baseball today?
fwind : {weak, strong}
ftemperature : {hot, mild, cool}
fhumidity : {high, normal}
foutlook : {sunny, overcast, rainy}
{sunny, mild, normal, strong}
Outlook
Rainy
Overcas
t
Sunny
Yes
Wind Humidity
Yes No
Yes No
Weak Strong Normal High
Playball =
{Yes, No}
{foutlook, ftemperature, fhumidity, fwind}  flearning

Should we play baseball today?
Conditions: {Outlook = Sunny, Temperature = Hot,
Humidity = Normal, Wind = Strong}
Outlook
Rainy
Overcast
Sunny
Yes
Wind Humidity
Yes No
Yes No
The answer: Yes, today we should play baseball.

Description: Decision tree is a tree including root node
and branch node (representing a choice among choices),
and leaf node (representing a decision).
Outlook
Rainy
Overcast
Sunny
Yes
Wind Humidity
Yes No
Yes No
Root node
Leaf
node
Branch
node
Branch
Decision tree

18
Algorithm for Decision Tree
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain, Gini index…)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left

19
Algorithm for Decision Tree
Generate rules based on decision tree
IF (conditon1) [and (condition2) and …] THEN Conclusion
IF outlook = sunny AND humidity = high THEN playball = no
IF outlook = overcast THEN playball = yes
IF outlook = rainy AND wind = weak THEN playball = yes
Outlook
Rainy
Overcast
Sunny
Yes
Wind Humidity
Yes No
Yes No

❑ Members: 3-5
students; Duration:
10 mins.
❑ Question: The
data below is
ready to apply
decision
algorithm? Why?
Propose your
solution.
Group Discussion
20

21
Entropy
Entropy
A measure of uncertainty associated with a random variable
Entropy is used to build the tree
Calculation: Entropy of set S:
S: sample set
N: number of different values of all samples in S
Aj: number of sample corresponding to each j
Fs(Aj): ratio of Aj to S
S is a 14-sample set having 9 samples belong to class
Yes, and 5 samples belong to class No

22
Entropy
Example:
A class has 35 students. 25 students do homework,
and 10 students do not do homework.

23
Information Gain
Information Gain of set of sample S based on attribute
A:
G(S,A): information gain of set S based on attribute A
E(S): entropy of S
m: number of different values of attribute A
Ai: number of sample corresponding to each I of
attribute A
Fs(Ai): ratio of Ai to S
SAi: subset of S including all samples having value Ai

24
Information Gain
Day Outlook Temperature Humidity Wind Play ball
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No

25
Information Gain
G(S,Wind) = ?
S has 14 samples and 2 classes: 9 Yes, 5 No
Wind has 2 different values: Weak, Strong
Wind=Weak (8 samples: 6 Yes, 2 No); Wind=Strong
(6 samples: 3 Yes, 3 No):
– With:

❑ Members: 3-5 students; Duration: 10 mins.
❑ Compute:
1. G(S, Outlook)
2. G(S, Temperature)
3. G(S, Humidity)
Group Discussion
26

27
Decision Tree
Outlook is the root (has maximal Information Gain)

28
Decision Tree
Outlook has 3 different values: sunny, overcast, and
rainy → The root has 3 branches
Which attribute should be chosen at Sunny branch?
(Outlook, Humidity, Temperature, Wind)
➢ Ssunny = {D1, D2, D8, D9, D11}, then it has 5 samples with
Outlook = sunny
➢ Gain(Ssunny, Humidity) = 0.970
➢ Gain(Ssunny, Temperature) = 0.570
➢ Gain(Ssunny, Wind) = 0.019
➢ Select Humidity
Keep doing until all samples are classified or there are
no remaining attributes for further partitioning

29
Information Gain

30
Gini index
Gini index of data D:
With: the relative frequency of class j in D
Example: with data set above:
14 samples: 9 Yes, 5 No
Gini(D)=1 - (9/14)2 - (5/14)2 = 0.459
2
)
(
1
)
(
Gini 
−
=
j
D
j
p
D
)
( D
j
p

31
Gini index
If a data set D is split on A into k subsets D1, D2,…,
Dk the gini index giniA(D) is defined as:
With:
➢ ni: #samples of node i
➢ N: #samples of node A
Select attribute with minimal Gini index for
partitioning

=
=
k
i
i
A i
n
n
D
1
)
(
Gini
)
(
Gini

32
Gini index

33
Gini index
Gini(D)=1 - (9/14)2 - (5/14)2 = 0.459 (9 yes, 5 No)
1. GiniOutlook(D)
= 5/14*Gini(Ssunny) + 4/14*Gini(Sovercast) + 5/14*Gini(Srainy)
= 0.343
With:
Gini(Ssunny) = 0.48 // 2Yes, 3No
Gini(Sovercast) = 0 // 4Yes, 0No
Gini(Srainy) = 0.48 // 3Yes, 2No

❑ Compute:
1. Gini Temperature (D)
2. Gini Humidity (D)
3. Gini Wind (D)
Group Discussion
34

35
Gini index
Gini(D)=1 - (9/14)2 - (5/14)2 = 0.459
1. GiniOutlook(D)= 0.343
2. GiniTemperature(D) = 0.440
3. GiniHumidity(D) = 0.367
4. GiniWind(D) = 0.428
→ Outlook is selected as the root (Gini index is the
minimal value)

36
Decision Tree
Conclusion:
Easy to understand
Data preprocessing
Big data

1. Introduction
2. Decision Tree
3. Bayes Classification Method
4. Neural network
CONTENT

39
Bayes classification
1. Introduction
2. Bayes classification
3. Comments

40
Bayes classification
Introduction
A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
Foundation: Based on Bayes’ Theorem (1763)
Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct — prior knowledge can be
combined with observed data

41
Bayes’ Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori
probability): the probability that the hypothesis holds given
the observed data sample X
P(H) (prior probability): the initial probability
X will plays baseball, regardless of humidity, wind, overcast…
P(X) (prior probability): probability that sample data is
observed
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 
=
=

42
P(X|H) (likelihood): the probability of observing the sample X,
given that the hypothesis holds
Informally, this can be viewed as:
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 
=
=

43
Naïve Bayes Classifier: attributes are conditionally
independent (i.e., no dependence relation among attributes)
P(X|H): X=(x1, x2,…, xk)
P(x1,…,xk|H) = P(x1|H)·…·P(xk|H)
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 
=
=

Bayes’ Classifier – Example
Outlook Temperature Humidity Wind Play ball
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rainy Mild High Weak Yes
Rainy Cool Normal Weak Yes
Rainy Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rainy Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rainy Mild High Strong No

Let X = (Outlook = Rainy, Temp = Cool, Humidity =
Normal, Wind = Weak) → X belongs to class Yes or No?
Compute:
✓ P (Play=Yes) = 9/14; P(Play=No) = 5/14
✓ P(Outlook=Rainy|Play=Yes) = 3/9;
✓ P(Outlook=Rainy|Play=No) = 2/5;

1. P(Play=Yes) *P(X|Play=Yes) = (9/14) *
(3/9) * (3/9) * (6/9) * (6/9) = 0.032
2. P(Play=No) *P(X|Play=No) = (5/14) *
(2/5) * (1/5) * (1/5) * (2/5) = 0.002
Conlusion: X=(Rainy, Cool, Normal, Weak) belongs to
class Play = Yes

❑ Members: 3-5 students;
❑ Duration: 5 mins.
❑ Let X = (Outlook = Sunny, Temp = Hot, Humidity = High,
Wind = Weak), predict X.
Group Discussion
50

❑ Members: 3-5 students;
❑ Duration: 5 mins.
❑ Let X = (Outlook = Overcast, Temp = Hot, Humidity =
High, Wind = Weak), predict X.
Group Discussion
51
Naïve Bayesian prediction requires
each conditional prob. be non-zero.

52
Bayes’ Classifier
Need to avoid the Zero-Probability Problem
Use Laplacian correction (or Laplacian estimator)
P(Ci)=(|Ci,D|+1)/(|D|+m)
P(Xk|Ci)=(# Ci,D {xk}+1)/(|Ci,D|+r)
With:
- m: #classes
- r: #different values of the attribute

Bayes’ Classifier
Let X = (Outlook = Overcast, Temp = Hot, Humidity =
High, Wind = Weak) using Laplacian correction.
Compute:
✓ P(Play=Yes) = (9+1)/(14+2) = 10/16
✓ P(Play=No) = (5+1)/(14+2) = 6/16
✓ P(Outlook=Overcast|Play=Yes)=(4+1)/(9+3)=5/12
✓ P(Outlook=Overcast|Play=No) = 1/8

54
Comments
Advatages
Easy to implement
Good results obtained in most of the cases
Disadvatages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayes Classifier

57
Neural network
1. Introduction
2. Neural network
3. Comments

58
Neural network
Nervous system

59
Neural network
Neural: Soma, Dendrite, Axon
https://science.howstuffworks.com/life/inside-the-mind/human-

Neural network
Artificial Neuron Model
McCulloch-Pitts neuron (1943)
Weights: Wij
Net input
Activation fuction f
thresholds

=
j
j
ij
i x
w
net

Neural network
Activation functions

Neural network

64
Comments
Advantages
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and
outputs
Successful on an array of real-world data, e.g.,
hand-written letters, …
Techniques have recently been developed for very
complicated topics

65
Comments
Disadvantages
Long training time
Require a number of parameters typically best
determined empirically, e.g., the network topology
or “structure.”
Poor interpretability: Difficult to interpret the
symbolic meaning behind the learned weights and
of “hidden units” in the network

K - NEAREST NEIGHBOR CLASSIFIER

68
K - Nearest Neighbor Classifier
1. Introduction
2. K - Nearest Neighbor Classifier
3. Comments

69
Introduction
The k-nearest-neighbor method was first described in
the early 1950s
The idea is to search for the closest match(es) of the
test data in the feature space.
All instances correspond to points in the n-D space
The nearest neighbor are defined: dist(X1, X2) (ex:
Euclidean distance)

70
The training tuples are described by n attributes
Each tuple represents a point in an n-dimensional
space → all the training tuples are stored in an n-
dimensional pattern space
When given an unknown tuple: a k-nearest-neighbor
classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple
These k training tuples are the k “nearest neighbors” of
the unknown tuple.

71
“Closeness” is defined in terms of a distance metric
(such as Euclidean distance)
X1(x11, x12, …x1n), X2(x21, x22, …x2n)
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq

72
K=1, 3, 4, or 7?
K should be an odd number
Neighbours with equal
importance?
Weighted kNN: depending on
their distance to the new-comer:
https://docs.opencv.org/3.4/d5/d26/tutorial_py_knn_understanding.html
2
)
,
(
1
i
x
q
x
d
w

73
Comments
K=?
Extremely slow when classifying test tuples.
"learning" involves only memorizing (storing) the
data, before testing and classifying.
Distance?
Robust to noisy data

75
Support Vector Machine
1. Introduction
2. Support Vector Machine
3. Comments

76
Introduction
A relatively new classification method for both linear
and nonlinear data
It uses a nonlinear mapping to transform the original
training data into a higher dimension
With the new dimension, it searches for the linear
optimal separating hyperplane (i.e., “decision
boundary”)
With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes can
always be separated by a hyperplane
SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by
the support vectors)

77
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
Used for: classification and numeric prediction
Applications:
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests

78
SVM—General Philosophy
Support Vectors
Small Margin Large Margin

April 7, 2022 Data Mining: Concepts and
Techniques
79
SVM—Margins and Support Vectors

80
SVM—When Data Is Linearly
Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)

81
SVM—Linearly Separable
◼ A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
◼ This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints →
Quadratic Programming (QP) → Lagrangian multipliers

82
Why Is SVM Effective on High Dimensional Data?
◼ The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data
◼ The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
◼ If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
◼ The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
◼ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high

83
SVM: Different Kernel functions
◼ Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
◼ Typical Kernel Functions

84
SVM Related Links
SVM Website: http://www.kernel-machines.org/
SVM practical guide: library for SVM
Representative implementations
LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also
various interfaces with java, python, etc.
SVM-light: simpler but performance is not better than
LIBSVM, support only binary classification and only in C
SVM-torch: another recent implementation also written
in C

Nhan_Chapter 6_Classification 2022.pdf

Nhan_Chapter 6_Classification 2022.pdf

Recommended

Recommended

More Related Content

Similar to Nhan_Chapter 6_Classification 2022.pdf

Similar to Nhan_Chapter 6_Classification 2022.pdf (20)

Recently uploaded

Recently uploaded (20)

Nhan_Chapter 6_Classification 2022.pdf