6. u Examples: temperature in Kelvin, length, time, counts
Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, ¹)
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, c2 test
Ordinal The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval For interval attributes, the
7. differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
Ratio For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Attribute
Level
Transformation Comments
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
16. ● Examples: Generic graph and HTML Links
● Data objects are nodes, links are properties
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel
Solution
of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
22. Missing Values
● Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
● Handling missing values
– Eliminate Data Objects (unless many missing)
– Estimate Missing Values (avg., most common val.)
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
53. Classification
Spring 2019
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Machine Learning Methods - Classification
CS 4319
Given a collection of records (training set)
- Each record contains a set of attributes, one of the attributes is
the class.
Find a model for class attribute as a function of the values of
other attributes.
54. A test set is used to estimate the accuracy of the model.
Goal: previously unseen records (test set) should be assigned a
class as accurately as possible.
Machine Learning – Classification Example
CS 4319
categorical
categorical
continuous
class
Test
Set
Training
Set
58. > 80K
There could be more than one tree that fits the same data!
categorical
categorical
continuous
Another Example of Decision Tree
CS 4319
Test Data
Start from the root of tree.
Refund
67. Learning
Algorithm
Induction
Deduction
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure:
If Dt contains records that belong the same class yt, then t is a
leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default
class, yd
If Dt contains records that belong to more than one class, use an
attribute test to split the data into smaller subsets. Recursively
apply the procedure to each subset.
71. CS 4319
British Petroleum designed a decision tree for gas-oil separation
for offshore oil platforms that replaced an earlier rule-based
expert system.
We will do a similar (but simpler) decision tree example
towards the end of the semester.
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Tree Induction
CS 4319
72. How to determine the Best Split
CS 4319
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split
CS 4319
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,
73. Low degree of impurity
Measures of Node Impurity
CS 4319
Gini Index
Entropy
Misclassification error
How to Find the Best Split
CS 4319
B?
Yes
No
Node N3
Node N4
A?
75. M12
M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
CS 4319
Gini Index for a given node t :
Need a measure of node impurity:
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (0.5) when records are equally distributed among all
classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying
most interesting information
79. = 1 – (/)2 – (/)2
=
Gini(Children)
=
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (0.5) when records are equally distributed among all
classes, implying least interesting information
Minimum (0) when all records belong to one class, implying
most interesting information
Splitting Criteria based on Classification Error
CS 4319
80. Splitting Criteria based on Classification Error
CS 4319
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
81. Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting (Next class!)
ANY IDEAS??
Tree Induction
CS 4319
Classification Methods
CS 4319
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Tid
100. Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
123. C2 4 2
Gini= ?
ParentC1
6C2
6
Gini = 0.500
N1
N2C1
1
5C2
4
2Gini=?
)
|
(
max
1
)
(
t
i
P
t
124. Error
i
-
=C1
1C2
5C1
0C2
6C1
2C2
4
Dr. Oner Celepcikay
CS 4319
CS 4319
Machine Learning
Week 6
Data Science Tool I – Classification Part II
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
125. Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Tree InductionGreedy strategy.Split the records based on an
attribute test that optimizes certain criterion.
IssuesDetermine how to split the recordsHow to specify the
attribute test condition?How to determine the best
split?Determine when to stop splitting
Stopping Criteria for Tree InductionStop expanding a node
when all the records belong to the same class
Stop expanding a node when all the records have similar
attribute values
Early termination (to be discussed later)
126. Practical Issues of ClassificationUnderfitting and Overfitting
Missing Values
Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test
errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of
mammals.
127. Overfitting due to Noise
Decision boundary is distorted by noise point
Both humans and dolphins were misclassified as n0n-mammals
b/c Body Temp, Gives_Birth and Four-legged values are
identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-
blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization ErrorsRe-substitution errors: error on
Methods for estimating generalization errors:Optimistic
approach: e’(t) = e(t)Pessimistic approach: For each leaf
(N: number of leaf nodes) For a tree with 30 leaf nodes and 10
errors on training
128. (out of 1000 instances):
Training error = 10/1000 = 1%
2.5%Reduced error pruning (REP): uses validation data set to
estimate generalization
error
Occam’s RazorGiven two models of similar generalization
errors, one should prefer the simpler model over the more
complex model
For complex models, there is a greater chance that it was fitted
accidentally by errors in data
Therefore, one should include model complexity when
evaluating a model
How to Address OverfittingPre-Pruning (Early Stopping
Rule)Stop the algorithm before it becomes a fully-grown
treeTypical stopping conditions for a node: Stop if all instances
belong to the same class Stop if all the attribute values are the
129. sameMore restrictive conditions: Stop if number of instances is
less than some user-specified threshold Stop if class distribution
of instances are independent of the available features (e.g.,
current node does not
improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…Post-pruningGrow decision tree to
its entiretyTrim the nodes of the decision tree in a bottom-up
fashionIf generalization error improves after trimming, replace
sub-tree by a leaf node.Class label of leaf node is determined
from majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 7/30
Pessimistic error (After splitting)
PRUNE OR DO NOT PRUNEClass = Yes20Class = No10Error
130. = ?Class = Yes8Class = No4Class = Yes2Class = No5Class =
Yes6Class = No1Class = Yes4Class = No0
Handling Missing Attribute ValuesMissing values affect
decision tree construction in three different ways:Affects how
impurity measures are computedAffects how to distribute
instance with missing value to child nodesAffects how a test
instance with missing value is classified
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
131. Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Metrics for Performance EvaluationFocus on the predictive
capability of a modelRather than how fast it takes to classify or
build models, scalability, etc.Confusion Matrix:
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)PREDICTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YesabClass=Nocd
Metrics for Performance Evaluation…
Most widely-used metric:PREDICTED CLASS
132. ACTUAL CLASSClass=YesClass=NoClass=Yesa (TP)b
(FN)Class=Noc (FP)d (TN)
Limitation of AccuracyConsider a 2-class problemNumber of
Class 0 examples = 9990Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %Accuracy is misleading because model
does not detect any class 1 example
Cost Matrix
C(i|j): Cost of misclassifying class j example as class i
PREDICTED CLASS
ACTUAL
CLASSC(i|j)Class=YesClass=NoClass=YesC(Yes|Yes)C(No|Yes
)Class=NoC(Yes|No)C(No|No)
Computing Cost of Classification
133. Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255Cost MatrixPREDICTED CLASS ACTUAL
CLASSC(i|j)+-+-1100-10Model M1PREDICTED CLASS
ACTUAL CLASS+-+15040-60250Model M2PREDICTED
CLASS ACTUAL CLASS+-+25045-5200
Cost vs AccuracyCountPREDICTED CLASS
ACTUAL
CLASSClass=YesClass=NoClass=YesabClass=NocdCostPREDI
CTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YespqClass=Noqp
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
134. Methods for Performance EvaluationHow to obtain a reliable
estimate of performance?
Performance of a model may depend on other factors besides the
learning algorithm:Class distributionCost of
misclassificationSize of training and test sets
Methods of EstimationHoldoutReserve 2/3 for training and 1/3
for testing Random subsamplingRepeated holdoutCross
validationPartition data into k disjoint subsetsk-fold: train on k-
1 partitions, test on the remaining oneLeave-one-out:
k=nStratified sampling oversampling vs
undersamplingBootstrapSampling with replacement
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
135. Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
A?
A1
A2
A3
A4
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+
+
+
+
136. =
+
+
+
+
=
Accuracy
Dr. Oner Celepcikay
ITS 632
Data Mining
Algorithms: Clustering
Part I
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
137. Position on slide:
Horizontal - 0"
Vertical - 0"
Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Clustering Analysis
ITS 632
138. Inter-cluster distances are maximized
Intra-cluster distances are minimized
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
139. Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Supervised Learning
Unsupervised Learning
ITS 632
Clustering Analysis
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
143. Six Clusters
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
144. Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical Clustering
A set of nested clusters organized as a hierarchical tree
ITS 632
Types of Clustering
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
146. Original Points
A Partitional Clustering
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Hierarchical Clustering
147. Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Clusters Defined by an Objective Function
Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters
and evaluate the `goodness' of each potential set of clusters by
using the given objective function.
Parameters for the model are determined from the data.
ITS 632
Types of Clustering: Objective Function
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
148. Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Map the clustering problem to a different domain
Proximity matrix defines a weighted graph, where the nodes are
the points being clustered, and the weighted edges represent the
proximities between points
Clustering is equivalent to breaking the graph into connected
components, one for each cluster.
Want to minimize the edge weight between clusters and
maximize the edge weight within clusters
ITS 632
Types of Clustering: Objective Function
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
149. Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
K-means and its variants
Hierarchical clustering
Density-based clustering
ITS 632
Clustering Algorithms
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
150. Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster (closest centroid)
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering
ITS 632
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
151. Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in first few iterations.
K-means Clustering
ITS 632
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
152. Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
153. Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
154. Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
K-Means Animation
http://tech.nitoyon.com/en/blog/2013/11/07/k-means/
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
155. ITS 632
Importance of Choosing Initial Centroids
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
156. Importance of Choosing Initial Centroids
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Multiple runs
Helps, but probability is not on your side
Sample & use hierarchical clustering to find K centroids
157. Select more than K initial centroids and then select among
these initial centroids
Select most widely separated
Postprocessing