•  Compareandcontrastthefourartworksprovided(KongoCr.docx

• Compareandcontrastthefourartworksprovided(Kongo
Crucifix,EthiopianCross,BustofOoni,andJosyAjiboye’s
Oonipainting).
• UseartworksA,BandCtodiscussthelost-waxprocessof
bronzecasting
• Usetheconceptofculturalappropriationtodiscusshow
artworksAandBadaptedChristianity,andartworkD
adaptedmodernisminAfricanart.
Note:
• Interpretationoftheseartworksshoulddiscusstheirorigins,
historicalrelevance,functionsandsymbolism.
• Defineyourtermsproperlysothattheysupportyouranalysisofthe
images.Appropriationmeansadaptation,asinwhenartistsinone
contexttranslateandtransformcross-culturalconceptsfromanother
context.
• Citeallreferences,especiallyonlinereferencesusingfootnotes
• Avoidmerelyduplicatinginformationfromexternalsources:
plagiarismwillbeseriouslypenalized.
(A)Left:KongoKingdomofSaoSalvador,Crucifix,brass,17thCentu
ryA.D.

(B)Right:Ethiopia(Aksum),ProcessionalCross,bronze,14th-
15thCenturyA.D.
(C)Left:YorubaPeoples(IfeKingdom),BustofOoni(king),bronze,1
0th-14thCenturiesA.D.
(D)Right:JosyAjiboye,Ooni,oiloncanvas,1976
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Dr. Oner Celepcikay
ITS 632
Data Mining
Summer 2019Week 3: Data and Data Exploration
4/18/2004 ‹#›
Chapter 2: Data

4/18/2004 ‹#›
What is Data?
● Collection of data objects and
their attributes
● An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
● A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No

4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
4/18/2004 ‹#›
Attribute Values
● Attribute values are numbers or symbols assigned
to an attribute
● Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute
values
u Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of
values
u Example: Attribute values for ID and age are integers
u But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
– Some operations are meaningful on age but meaningless on ID
4/18/2004 ‹#›
Types of Attributes
● There are different types of attributes
– Nominal
u Examples: ID numbers, eye color, zip codes
– Ordinal
u Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
u Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio

u Examples: temperature in Kelvin, length, time, counts
Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, ¹)
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, c2 test
Ordinal The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval For interval attributes, the

differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
Ratio For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Attribute
Level
Transformation Comments
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it

make any difference?
Ordinal An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
4/18/2004 ‹#›
Properties of Attribute Values
● The type of an attribute depends on which of the
following properties it possesses:

– Distinctness: = ¹
– Order: < >
– Addition: + -
– Multiplication: * /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
4/18/2004 ‹#›
Discrete and Continuous Attributes
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented

using a finite number of digits.
– Continuous attributes are typically represented as floating-
point
variables.
4/18/2004 ‹#›
Types of data sets
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
4/18/2004 ‹#›
Important Characteristics of Structured Data

– Dimensionality
u Curse of Dimensionality
– Sparsity
u Only presence counts
– Resolution
u Patterns depend on the scale
– Examples: Texas data, Aleks, Simpson’s Paradox
4/18/2004 ‹#›
Record Data
● Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
4/18/2004 ‹#›
Data Matrix
● If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
● Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
1.12.216.226.2512.65

1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
4/18/2004 ‹#›
Document Data
● Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.
– In practice only non-0 is stored
Document 1
season

tim
eout
lost
w
in
gam
e
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0

4/18/2004 ‹#›
Transaction Data
● A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
4/18/2004 ‹#›
Graph Data

● Examples: Generic graph and HTML Links
● Data objects are nodes, links are properties
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel
Solution
of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers

4/18/2004 ‹#›
Chemical Data
● Benzene Molecule: C6H6
● Nodes are atoms, links are chemical bonds
● helps to identify substructures.
4/18/2004 ‹#›
Ordered Data
● Sequences of transactions
An element of
the sequence
Items/Events

4/18/2004 ‹#›
Ordered Data
● Genomic sequence data
● Similar to sequential data but no time stamps
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
4/18/2004 ‹#›

Ordered Data
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
4/18/2004 ‹#›
Data Quality
● What kinds of data quality problems?
● How can we detect problems with the data?
● What can we do about these problems?
● Examples of data quality problems:
– Noise and outliers

– missing values
– duplicate data
4/18/2004 ‹#›
Data Quality
● Precision: The closeness of repeated measurements (of
the same quantity) to other measurements.
● Bias: A systematic variation of measurements from the
quantity being measured.
● Accuracy: The closeness of measurements to the true
value of the quantity being measurement.
4/18/2004 ‹#›

Noise
● Noise refers to modification of original values
– Examples: distortion of a person�s voice when talking
on a poor phone and �snow� on television screen
Two Sine Waves Two Sine Waves + Noise
4/18/2004 ‹#›
Outliers
● Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set (diff. than noise)
4/18/2004 ‹#›

Missing Values
● Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
● Handling missing values
– Eliminate Data Objects (unless many missing)
– Estimate Missing Values (avg., most common val.)
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

4/18/2004 ‹#›
Duplicate Data
● Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources
– Also attention needed to avoid combining 2 very
similar objects into 1.
● Examples:
– Same person with multiple email addresses
● Data cleaning
– Process of dealing with duplicate data issues
4/18/2004 ‹#›

Data Preprocessing
● Aggregation
● Sampling
● Dimensionality Reduction
● Feature subset selection
● Feature creation
● Discretization and Binarization
● Attribute Transformation
4/18/2004 ‹#›
Aggregation
● Combining two or more attributes (or objects) into

a single attribute (or object)
● Purpose
– Data reduction
u Reduce the number of attributes or objects
– Change of scale
u Cities aggregated into regions, states, countries, etc
– More �stable� data
u Aggregated data tends to have less variability
4/18/2004 ‹#›
Aggregation-Why?
● Less memory & less processing times
– Aggregation allows to use very expensive Algorithms
● High level view of the data
– Store example

● Behavior of groups of objects often more stable
than individual objects.
– A disadvantage of this is losing information or
patterns,
– e.g. if you aggregate days into months, you might
miss the sales peak in Valentine’s Day.
4/18/2004 ‹#›
Aggregation
Standard Deviation of Average
Monthly Precipitation
Standard Deviation of Average
Yearly Precipitation
Variation of Precipitation in Australia

4/18/2004 ‹#›
Sampling
● Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the
data
and the final data analysis.
● Statisticians sample because obtaining the entire set of data
of interest is too expensive or time consuming.
● Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
4/18/2004 ‹#›
Sampling …

● The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
– If mean is of interest then the mean of the sample,
should be similar to mean of the full data.
4/18/2004 ‹#›
Types of Sampling
● Simple Random Sampling

– There is an equal probability of selecting any particular item
● Sampling without replacement
– As each item is selected, it is removed from the population
● Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample.
u In sampling with replacement, the same object can be picked
up
more than once (easier to analyze, probability is constant)
● Stratified sampling
– Split the data into several partitions; then draw random
samples
from each partition (handles representation of less freq. objects)
4/18/2004 ‹#›
Sample Size

8000 points 2000 Points 500 Points
4/18/2004 ‹#›
Sample Size
● What sample size is necessary to get at least one
object from each of 10 groups.
4/18/2004 ‹#›
Curse of Dimensionality
● When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

● Definitions of density and
distance between points,
which is critical for
clustering and outlier
detection, become less
meaningful
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
4/18/2004 ‹#›
Dimensionality Reduction
● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized

– May help to eliminate irrelevant features or reduce
noise
● Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
4/18/2004 ‹#›
Dimensionality Reduction: PCA
● Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
e

4/18/2004 ‹#›
● Find the eigenvectors of the covariance matrix
● The eigenvectors define the new space
● Tends to identify strongest patterns in data.
x2
x1
e
4/18/2004 ‹#›
Dimensions = 10Dimensions = 40Dimensions = 80Dimensions =

120Dimensions = 160Dimensions = 206
4/18/2004 ‹#›
Face detection and recognition
Detection Recognition “Sally”
4/18/2004 ‹#›
Feature Subset Selection
● Another way to reduce dimensionality of data
● Redundant features
– duplicate much or all of the information contained in

one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
● Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
4/18/2004 ‹#›
● Techniques:
– Brute-force approch:
uTry all possible feature subsets as input to data mining
algorithm

– Embedded approaches:
u Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
u Features are selected before data mining algorithm is run
– Wrapper approaches:
u Use the data mining algorithm as a black box to find best
subset
of attributes
4/18/2004 ‹#›
4/18/2004 ‹#›

Feature Creation
● Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
● Three general methodologies:
– Feature Extraction
u domain-specific
– Mapping Data to New Space
– Feature Construction
u combining features (pixels à edges for face recognition)
u e.g. using density instead of mass, volume in identifying
artifacts such as gold, bronze, clay, etc…
4/18/2004 ‹#›
Similarity and Dissimilarity

● Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
● Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
● Proximity refers to a similarity or dissimilarity
4/18/2004 ‹#›
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.

4/18/2004 ‹#›
Similarity/Dissimilarity for Simple Attributes
● An example: quality of a product (e.g. candy)
{poor, fair, OK, good, wonderful}
● P1->Wonderful, P->2 good, P3->OK
● P1 is closer to P2 than it is to P3
● Map ordinal attributes into integers:
{poor=0, fair=1, OK=2, good=3, wonderful=4}
● Estimate the distance values for each pair.
● Normalize if you want [1,1] interval
4/18/2004 ‹#›
Euclidean Distance
● Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
● Standardization is necessary, if scales differ.
å
=
-=
n
k
kk qpdist
1
2)(
4/18/2004 ‹#›

Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
4/18/2004 ‹#›
Minkowski Distance
● Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
r
n

k
r
kk qpdist
1
1
)||( å
=
-=
4/18/2004 ‹#›
Minkowski Distance: Examples
● r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is
just the
number of bits that are different between two binary vectors

● r = 2. Euclidean distance
● r ® ¥. �supremum� (Lmax norm, L¥ norm) distance.
– This is the maximum difference between any component of
the vectors
● Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
4/18/2004 ‹#›
Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L¥ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
4/18/2004 ‹#›
Common Properties of a Distance

● Distances, such as the Euclidean distance,
have some well known properties.
1. d(p, q) ³ 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) £ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
● A distance that satisfies these properties is a
metric
● Examples 2.14 and 2.15
4/18/2004 ‹#›

Common Properties of a Similarity
● Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data
objects), p and q.
4/18/2004 ‹#›
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /
(2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
4/18/2004 ‹#›
Cosine Similarity
● If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of
vector d.
● Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0
+ 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =
(42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 =
(6) 0.5 = 2.245
cos( d1, d2 ) = .3150
4/18/2004 ‹#›
Correlation
● Correlation measures the linear relationship
between objects
● To compute correlation, we standardize data
objects, p and q, and then take their dot product
)(/))(( pstdpmeanpp kk -=¢

)(/))(( qstdqmeanqq kk -=¢
qpqpncorrelatio ¢•¢=),(
4/18/2004 ‹#›
Correlation
● Correlation measures the linear relationship
between objects
● To compute correlation, we standardize data
objects, p and q, and then take their dot product
4/18/2004 ‹#›
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
4/18/2004 ‹#›
Density
● Density-based clustering require a notion of
density
● Examples:
– Euclidean density
u Euclidean density = number of points per unit volume
– Probability density
– Graph-based density

4/18/2004 ‹#›
Euclidean Density – Cell-based
● Simplest approach is to divide region into a
number of rectangular cells of equal volume and
define density as # of points the cell contains
4/18/2004 ‹#›
Euclidean Density – Center-based
● Euclidean density is the number of points within a
specified radius of the point
Dr. Oner Celepcikay
CS 4319
CS 4319

Classification
Spring 2019
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Machine Learning Methods - Classification
CS 4319
Given a collection of records (training set)
- Each record contains a set of attributes, one of the attributes is
the class.
Find a model for class attribute as a function of the values of
other attributes.

A test set is used to estimate the accuracy of the model.
Goal: previously unseen records (test set) should be assigned a
class as accurately as possible.
Machine Learning – Classification Example
CS 4319
categorical
categorical
continuous
class
Test
Set
Training
Set

Model
Learn
Classifier
Refund
MarSt
TaxInc
YES

NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Splitting Attributes
Model: Decision Tree
categorical
categorical
continuous
CS 4319
class

MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K

> 80K
There could be more than one tree that fits the same data!
categorical
categorical
continuous
Another Example of Decision Tree
CS 4319
Test Data
Start from the root of tree.
Refund

MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
CS 4319
Test Data

Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K

CS 4319
Test Data
Refund
MarSt
TaxInc
YES
NO
NO

NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
CS 4319
Test Data
Refund
MarSt

TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
CS 4319
Test Data

Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
CS 4319

Test Data
CS 4319
No
Refund
MarSt
TaxInc
YES
NO

NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
CS 4319
categorical
categorical
continuous
class
Model

Learning
Algorithm
Induction
Deduction
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure:
If Dt contains records that belong the same class yt, then t is a
leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default
class, yd
If Dt contains records that belong to more than one class, use an
attribute test to split the data into smaller subsets. Recursively
apply the procedure to each subset.

Dt
?
CS 4319
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes
No
Refund

Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K
>= 80K
Refund

Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Hunt’s Algorithm
CS 4319
Decision Tree Application to Oil & Gas Data

CS 4319
British Petroleum designed a decision tree for gas-oil separation
for offshore oil platforms that replaced an earlier rule-based
expert system.
We will do a similar (but simpler) decision tree example
towards the end of the semester.
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Tree Induction
CS 4319

How to determine the Best Split
CS 4319
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split
CS 4319
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,

Low degree of impurity
Measures of Node Impurity
CS 4319
Gini Index
Entropy
Misclassification error
How to Find the Best Split
CS 4319
B?
Yes
No
Node N3
Node N4
A?

Yes
No
Node N1
Node N2
Before Splitting:
M0
M1
M2
M3
M4

M12
M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
CS 4319
Gini Index for a given node t :
Need a measure of node impurity:
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (0.5) when records are equally distributed among all
classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying
most interesting information

Examples for computing GINI
CS 4319
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444

CS 4319
A?
Yes
No
Node N1
Node N2
Gini(N1)
= 1 – (4/7)2 – (3/7)2
= 0.4898
Gini(N2)
= 1 – (2/5)2 – (3/5)2
= 0.48
Gini(Children)

= 7/12 * 0.4898 +
5/12 * 0.48
= 0.486
CS 4319
B?
Yes
No
Node N1
Node N2
Gini(N1)
= 1 – (/)2 – (/)2
=
Gini(N2)

= 1 – (/)2 – (/)2
=
Gini(Children)
=
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (0.5) when records are equally distributed among all
classes, implying least interesting information
Minimum (0) when all records belong to one class, implying
most interesting information
Splitting Criteria based on Classification Error
CS 4319

Splitting Criteria based on Classification Error
CS 4319
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.

Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting (Next class!)
ANY IDEAS??
Tree Induction
CS 4319
Classification Methods
CS 4319
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Tid

Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K

No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K

No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?

Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes

Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced

95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K

No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?

YesMarried
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10

Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No

Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced

220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Tid
Refund

Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No

4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No

8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status

Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income

Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund Marital
Status

Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat

No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund Marital
Status

Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat
No

Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable

Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married

120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K

Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat

No
Single
75K
?
YesMarried
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?

No
Married
80K
?
10
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K
No
2 No Married 100K
No
3 No Single 70K
No
4 Yes Married 120K
No
5 No Divorced 95K
Yes
6 No Married 60K
No
7 Yes Divorced 220K

No
8 No Single 85K
Yes
9 No Married 75K
No
10 No Single 90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2

NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes

Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Tid
Refund
Marital
Status

Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes

Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single

85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3

C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes
No
Family
Sports
Luxuryc
1
c
10
c
20

C0: 0
C1: 1
...
c
11
Own Car?�
C0: 6
C1: 4�
C0: 4
C1: 6�
Car Type?�
C0: 1
C1: 3�
C0: 8
C1: 0�
C0: 1
C1: 7�
C0: 1
C1: 0�
C0: 1
C1: 0�
C0: 0
C1: 1�
Student ID?�
...�

Yes�
No�
Family�
Sports�
Luxury�
c1�
c10�
c20�
C0: 0
C1: 1�
...�
c11�
C0: 5
C1: 5
C0: 9
C1: 1
C0: 5
C1: 5�
C0: 9
C1: 1�
C0 N10
C1 N11
C0 N20

C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
C0
N40C1
N41C0
N00C1
N01C0
N10C1
N11C0
N20C1
N21C0
N30C1
N31

å
-
=
j
t
j
p
t
GINI
2
)]
|
(
[
1
)
(
C1
0
C2
6
Gini=0.000
C1
2
C2

4
Gini=0.444
C1
3
C2
3
Gini=0.500
C1
1
C2
5
Gini=0.278C1
1
C2
5
Gini=0.278
C1
0
C2
6
Gini=0.000
C1
2
C2

4
Gini=0.444
C1
3
C2
3
Gini=0.500
C1
0
C2
6
C1
2

C2
4
C1
1
C2
5
C1
0C2
6C1
2C2
4C1
1C2
5
Parent

C1
6
C2
6
Gini = 0.500
N1 N2
C1 4 2
C2 3 3
Gini=0.486
N1 N2
C1 4 2
C2 3 3
Gini=0. 486

ParentC1
6C2
6
Gini = 0.500
N1
N2C1
4
2C2
3
3Gini=0.486
N1 N2
C1 1 5
C2 4 2
Gini=?
N1 N2
C1 1 5

C2 4 2
Gini= ?
ParentC1
6C2
6
Gini = 0.500
N1
N2C1
1
5C2
4
2Gini=?
)
|
(
max
1
)
(
t
i
P
t

Error
i
-
=C1
1C2
5C1
0C2
6C1
2C2
4
Dr. Oner Celepcikay
CS 4319
CS 4319
Machine Learning
Week 6
Data Science Tool I – Classification Part II

Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Tree InductionGreedy strategy.Split the records based on an
attribute test that optimizes certain criterion.
IssuesDetermine how to split the recordsHow to specify the
attribute test condition?How to determine the best
split?Determine when to stop splitting
Stopping Criteria for Tree InductionStop expanding a node
when all the records belong to the same class
Stop expanding a node when all the records have similar
attribute values
Early termination (to be discussed later)

Practical Issues of ClassificationUnderfitting and Overfitting
Missing Values
Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test
errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
* Bats and Whales are misclassified; non-mammals instead of
mammals.

Decision boundary is distorted by noise point
Both humans and dolphins were misclassified as n0n-mammals
b/c Body Temp, Gives_Birth and Four-legged values are
identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-
blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Estimating Generalization ErrorsRe-substitution errors: error on
Methods for estimating generalization errors:Optimistic
approach: e’(t) = e(t)Pessimistic approach: For each leaf
(N: number of leaf nodes) For a tree with 30 leaf nodes and 10
errors on training

(out of 1000 instances):
Training error = 10/1000 = 1%
2.5%Reduced error pruning (REP): uses validation data set to
estimate generalization
error
Occam’s RazorGiven two models of similar generalization
errors, one should prefer the simpler model over the more
complex model
For complex models, there is a greater chance that it was fitted
accidentally by errors in data
Therefore, one should include model complexity when
evaluating a model
How to Address OverfittingPre-Pruning (Early Stopping
Rule)Stop the algorithm before it becomes a fully-grown
treeTypical stopping conditions for a node: Stop if all instances
belong to the same class Stop if all the attribute values are the

sameMore restrictive conditions: Stop if number of instances is
less than some user-specified threshold Stop if class distribution
of instances are independent of the available features (e.g.,
current node does not
improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…Post-pruningGrow decision tree to
its entiretyTrim the nodes of the decision tree in a bottom-up
fashionIf generalization error improves after trimming, replace
sub-tree by a leaf node.Class label of leaf node is determined
from majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 7/30
Pessimistic error (After splitting)
PRUNE OR DO NOT PRUNEClass = Yes20Class = No10Error

= ?Class = Yes8Class = No4Class = Yes2Class = No5Class =
Yes6Class = No1Class = Yes4Class = No0
Handling Missing Attribute ValuesMissing values affect
decision tree construction in three different ways:Affects how
impurity measures are computedAffects how to distribute
instance with missing value to child nodesAffects how a test
instance with missing value is classified
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?

estimates?
Metrics for Performance EvaluationFocus on the predictive
capability of a modelRather than how fast it takes to classify or
build models, scalability, etc.Confusion Matrix:
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)PREDICTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YesabClass=Nocd
Metrics for Performance Evaluation…
Most widely-used metric:PREDICTED CLASS

ACTUAL CLASSClass=YesClass=NoClass=Yesa (TP)b
(FN)Class=Noc (FP)d (TN)
Limitation of AccuracyConsider a 2-class problemNumber of
Class 0 examples = 9990Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %Accuracy is misleading because model
does not detect any class 1 example
Cost Matrix
C(i|j): Cost of misclassifying class j example as class i
PREDICTED CLASS
ACTUAL
CLASSC(i|j)Class=YesClass=NoClass=YesC(Yes|Yes)C(No|Yes
)Class=NoC(Yes|No)C(No|No)
Computing Cost of Classification

Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255Cost MatrixPREDICTED CLASS ACTUAL
CLASSC(i|j)+-+-1100-10Model M1PREDICTED CLASS
ACTUAL CLASS+-+15040-60250Model M2PREDICTED
CLASS ACTUAL CLASS+-+25045-5200
Cost vs AccuracyCountPREDICTED CLASS
ACTUAL
CLASSClass=YesClass=NoClass=YesabClass=NocdCostPREDI
CTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YespqClass=Noqp
estimates?

Methods for Performance EvaluationHow to obtain a reliable
estimate of performance?
Performance of a model may depend on other factors besides the
learning algorithm:Class distributionCost of
misclassificationSize of training and test sets
Methods of EstimationHoldoutReserve 2/3 for training and 1/3
for testing Random subsamplingRepeated holdoutCross
validationPartition data into k disjoint subsetsk-fold: train on k-
1 partitions, test on the remaining oneLeave-one-out:
k=nStratified sampling oversampling vs
undersamplingBootstrapSampling with replacement
Size:
Height: 7.52"
Width: 10.02"

Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
A?
A1
A2
A3
A4
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+
+
+
+

=
+
+
+
+
=
Accuracy
Dr. Oner Celepcikay
ITS 632
Data Mining
Algorithms: Clustering
Part I
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%

Position on slide:
Horizontal - 0"
Vertical - 0"
Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Clustering Analysis
ITS 632

Inter-cluster distances are maximized
Intra-cluster distances are minimized

Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Supervised Learning
Unsupervised Learning
ITS 632
Clustering Analysis
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:

Horizontal - 0"
Vertical - 0"
ITS 632
Notion of a Cluster can be Ambiguous

How many clusters?
Four Clusters

Six Clusters

Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical Clustering
A set of nested clusters organized as a hierarchical tree
ITS 632
Types of Clustering
Size:
Height: 7.52"

Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Partitional Clustering

Original Points
A Partitional Clustering
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Hierarchical Clustering

Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Clusters Defined by an Objective Function
Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters
and evaluate the `goodness' of each potential set of clusters by
using the given objective function.
Parameters for the model are determined from the data.
ITS 632
Types of Clustering: Objective Function

Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Map the clustering problem to a different domain
Proximity matrix defines a weighted graph, where the nodes are
the points being clustered, and the weighted edges represent the
proximities between points
Clustering is equivalent to breaking the graph into connected
components, one for each cluster.
Want to minimize the edge weight between clusters and
maximize the edge weight within clusters
ITS 632
Types of Clustering: Objective Function

Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
K-means and its variants
Hierarchical clustering
Density-based clustering
ITS 632
Clustering Algorithms
Size:

Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster (closest centroid)
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering
ITS 632
Size:
Height: 7.52"

Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in first few iterations.
K-means Clustering
ITS 632
Size:
Height: 7.52"

Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
Size:
Height: 7.52"
Width: 10.02"

Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Size:
Height: 7.52"
Width: 10.02"

Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-Means Animation
http://tech.nitoyon.com/en/blog/2013/11/07/k-means/
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"

ITS 632
Importance of Choosing Initial Centroids
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632

Importance of Choosing Initial Centroids
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Multiple runs
Helps, but probability is not on your side
Sample & use hierarchical clustering to find K centroids

Select more than K initial centroids and then select among
these initial centroids
Select most widely separated
Postprocessing

•  Compareandcontrastthefourartworksprovided(KongoCr.docx

Recommended

Recommended

More Related Content

Similar to •  Compareandcontrastthefourartworksprovided(KongoCr.docx

Similar to •  Compareandcontrastthefourartworksprovided(KongoCr.docx (20)

More from daynamckernon

More from daynamckernon (20)

Recently uploaded

Recently uploaded (20)