Notes Chapter 4.pptx

MBA 643
Dr Danielle Morin
Fall 2022
CHAPTER 4
DESCRIPTIVE DATA MINING

•Hierarchical Clustering
•K-Means Clustering
•Association Rules
•Text Mining

Introduction
The increase in the use of data-mining techniques in
business has been caused largely by three events:
◦ The explosion in the amount of data being produced
and electronically tracked
◦ The ability to electronically warehouse these data
◦ The affordability of computer power to analyze the
data
3

Observation
A set of recorded values of variables associated
with a single entity. It is a row of values in a
spreadsheet or database, in which the columns
correspond to the variables
4
35 years
old
Male dentist single Donation 2016
$1000
Donation 2017
$2000

Supervised or Unsupervised Learning
Data-mining approaches can be separated into two categories:
Supervised learning—For prediction and classification
Unsupervised learning—To detect patterns and relationships in the data
Thought of as high-dimensional descriptive analytics
Designed to describe patterns and relationships in large data sets with
many observations of many variables
There is no outcome variable to predict
No definitive measure of accuracy, but qualitative assessment

Cluster Analysis
Goal: to segment observations into similar groups based on observed
variables
Can be employed during the data-preparation step to identify variables
or observations that can be aggregated or removed from
consideration
Commonly used in marketing to divide customers into different
homogenous groups; known as market segmentation
Used to identify outliers

Cluster Methods
Bottom-up hierarchical clustering starts with each observation
belonging to its own cluster and then sequentially merges the most
similar clusters to create a series of nested clusters
k-means clustering assigns each observation to one of k clusters in
a manner such that the observations assigned to the same cluster
are as similar as possible
Both methods depend on how two observations are similar,
hence, we have to measure similarity between observations

Three influential factors
Hierarchical versus nonhierarchical clustering
The measurement of the distance between observations
The measurement of the distance between clusters

Measurement of Distances between observations
Euclidean distance
Matching coefficients
Jaccard coefficients

Measuring Similarity Between Observations
Euclidean distance: Most common method to measure dissimilarity
between observations, when observations include continuous variables
Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise
measurements of q variables
The Euclidean distance between observations u and v is:
𝒅𝒖,𝒗 = 𝒖𝟏 − 𝒗𝟏
𝟐 + 𝒖𝟐 − 𝒗𝟐
𝟐 + ∙ ∙ ∙ + 𝒖𝒒 − 𝒗𝒒
𝟐
NOTE: This measure of distance is highly influenced by the scale on
which the variables are measured

Calculate the Euclidean Distance
Euclidean distance becomes smaller as a pair of observations become more
similar with respect to their variable values
11
Example
Euclidean distance between
(2 cars, 5 children) and (1 car, 3 children)
(2, 5) and (1, 3)
Distance = (2 − 1)2+(5 − 3)2= 5 = 2.24

Euclidian Distance
Euclidean distance is highly influenced by the scale on which
variables are measured
◦ Common to standardize the units of each variable j of
each observation u
◦ Example: uj, the value of variable j in observation u, is
replaced with its z-score, zj
The conversion to z-scores also makes it easier to identify
outlier measurements, which can distort the Euclidean
distance between observations
12

Age Female Income Married Children CarLoan Mortgage
48 1 17546.00 0 1 0 0
40 0 30085.10 1 3 1 1
51 1 16575.40 1 0 1 0
23 1 20375.40 1 3 0 0
57 1 50576.30 1 0 0 0
57 1 37869.60 1 2 0 0
22 0 8877.07 0 0 0 0
58 0 24946.60 1 0 1 0
37 1 25304.30 1 2 1 0
54 0 24212.10 1 2 1 0
66 1 59803.90 1 0 0 0
52 1 26658.80 0 0 1 1
44 1 15735.80 1 1 0 1
66 1 55204.70 1 1 1 1
36 0 19474.60 1 0 0 1
38 1 22342.10 1 0 1 1
37 1 17729.80 1 2 0 1
46 1 41016.00 1 0 0 1
62 1 26909.20 1 0 0 0
31 0 22522.80 1 0 1 0
61 0 57880.70 1 2 0 0
50 0 16497.30 1 2 0 0
54 0 38446.60 1 0 0 0
27 1 15538.80 0 0 1 1
22 0 12640.30 0 2 1 0
56 0 41034.00 1 0 1 1
45 0 20809.70 1 0 0 1
39 1 20114.00 1 1 0 0
39 1 29359.10 0 3 1 1
61 0 24270.10 1 1 0 0
Example
A Financial advising
company that provides
personalized financial advice
to its clients would like to
segment its customers pool
into several groups (clusters)
to better serve them.
Variables: Age,
Gender (1 if Female and 0 if male),
Annual Income,
whether Married (1) and not married (0),
Number of children,
If a Car loan = 1 and 0 if not,
Mortgage = 1 if a mortgage and 0 if not

Example:
Consider only Age and
Income
Age Income
48 17546.00
40 30085.10
51 16575.40
23 20375.40
57 50576.30
57 37869.60
22 8877.07
58 24946.60
37 25304.30
54 24212.10
66 59803.90
52 26658.80
44 15735.80
66 55204.70
36 19474.60
38 22342.10
37 17729.80
46 41016.00
62 26909.20
31 22522.80
61 57880.70
50 16497.30
54 38446.60
27 15538.80
22 12640.30
56 41034.00
45 20809.70
39 20114.00
39 29359.10
𝐷 = (48 − 40)2+(17546.00 − 30085.10)2=
D = 12539.1
The Euclidean distance between the first
two observations:
This dissimilarity measure is influenced by
the large values of Income
It would be better to use the Z-Score for
each variables to remove different units
Age Income
average 45.97 28011.87
st.dev. 13.04 13703.28
Zage Zincome
0.16 -0.76
-0.46 0.15
0.39 -0.83
-1.76 -0.56
0.85 1.65
0.85 0.72
-1.84 -1.40
0.92 -0.22
-0.69 -0.20
0.62 -0.28
1.54 2.32
0.46 -0.10
-0.15 -0.90
1.54 1.98
-0.76 -0.62
-0.61 -0.41
-0.69 -0.75
0.00 0.95
1.23 -0.08
-1.15 -0.40
1.15 2.18
0.31 -0.84
0.62 0.76
-1.45 -0.91
-1.84 -1.12
0.77 0.95
-0.07 -0.53
-0.53 -0.58
-0.53 0.10
Standardized Distance
between first two
observations =
𝑆𝑡. 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒
= (0.16 − −0.46)2+(−0.76 − 0.15)2
=1.101
Z-score also helps
identifying outliers
(48, 17546.00) & (40, 30085.10)
𝑍𝑎𝑔𝑒 =
(48 − 45.97)
13.04
= 0.16
(0.16, -0.76) & (-0.46, 0.15)

Matching Coefficients
For categorical variables encoded as 0–1, a better measure
of similarity between two observations can be achieved by
counting the number of variables with matching values
The simplest overlap measure is called the matching coefficient and is
computed by:
Example
Matching Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 3/5 = 0.60

Jaccard’s Coefficient
A weakness of the matching coefficient is that if two observations both
have a 0 entry for a categorical variable, this is counted as a sign of
similarity between the two observations
To avoid misstating similarity due to the absence of a feature, a
similarity measure called Jaccard’s coefficient does not count matching
zero entries and is computed by:
Example
Jaccard’s Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 2/(5-1) = 0.50

Example
Calculate the Matching Coefficient and the Jaccard’s Coefficient for
the first five observations. Consider only categorical variables
Age FemaleIncome Married Children CarLoan Mortgage
1 48 1 17546 0 1 0 0
2 40 0 30085 1 3 1 1
3 51 1 16575 1 0 1 0
4 23 1 20375 1 3 0 0
5 57 1 50576 1 0 0 0
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0

Example
Matching Coefficient
Similarity matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 2/4=0.5 2/4=0.5 1
4 3/4=0.75 1/4=0.25 3/4=0.75 1
5 3/4=0.75 1/4 =0.25 3/4=0.75 4/4=1 1
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Matching coefficients between
observations 1 and 3:
(1 0 0 0) (1 1 1 0)
= 2/4 = 0.5

Example
Similarity matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 2/4=0.50 2/4=0.50 1
4 3/4=0.75 1/4=0.25 3/4=0.75 1
5 3/4=0.75 1/4 =0.25 3/4=0.75 4/4=1 1
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Matching coefficient between
Observations 1 and 4 = ¾ = 0.75,
and therefore more similar
than observations 2 and 3 (0.5)

Example
Similarity Matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 1/3=0.33 2/4=0.50 1
4 ½=0.5 ¼=0.25 2/3=0.67 1
5 ½=0.5 ¼=0.25 2/3=0.67 2/2=1 1
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Jaccard coefficient between
observations 1 and 3:
(1 0 0 0) (1 1 1 0)
Jaccard coefficient =
1/(4-1) = 1/3
X X

Example
Similarity Matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 1/3=0.33 2/4=0.50 1
4 ½=0.50 ¼-0.25 2/3=0.67 1
5 ½=0.50 ¼=0.25 2/3=0.67 2/2=1 1
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
According to Jaccard:
Observations 1 and 4 are
equally similar (0.5)
to observations 2 and 3 (0.5)
According to Matching:
Observations 1 and 4 are more
similar (0.75) than observations
2 and 3 (0.5)

Hierarchical Clustering
◦Determines the similarity of two clusters by considering the similarity
between the observations composing either cluster
◦Starts with each observation in its own cluster and then iteratively
combines the two clusters that are the most similar into a single cluster
◦Given a way to measure similarity between observations, there are
several clustering method alternatives for comparing observations in two
clusters to obtain a cluster similarity measure

Cluster similarity measures
◦Single linkage (closest neighbor)
◦Complete linkage (furthest neighbor)
◦Group average linkage
◦Median linkage
◦Centroid linkage
◦Ward’s method
◦McQuitty’s method

Similarity Measures between clusters
24
• The similarity between two clusters is defined by the similarity of the pair of
observations (one from each cluster) that are the most similar
Single linkage
• This clustering method defines the similarity between two clusters as the
similarity of the pair of observations (one from each cluster) that are the most
different
Complete
linkage
• Defines the similarity between two clusters to be the average similarity
computed over all pairs of observations between the two clusters
Group Average
linkage
• Analogous to group average linkage except that it uses the median of the
similarities computer between all pairs of observations between the two
clusters
• Uses the averaging concept of cluster centroids to define between-cluster
similarity
Median linkage
Centroid Linkage

Measuring Similarity Between Clusters
Single linkage will consider
two clusters to be close if an
observation in one of the
clusters is close to at least
one observation in the
other cluster. The cluster
formed by merging two
clusters that are close with
respect to single linkage
may also consist of pairs of
observations that are very
different.
Complete linkage will
consider two clusters to be
close if their most different
pair of observations are
close. This method produces
clusters such that all
member observations of a
cluster are relatively close
to each other. This
clustering can be distorted
by outlier observations.
The single and Complete
linkage methods are based on
a single pair of observation to
determine similarity

Measuring Similarity Between Clusters
Group Average Linkage on
all the pairs of observations.
If Cluster 1 has n1
observations and Cluster 2
has n2 observations, then
the similarity measure is the
average of all n1 x n2 pairs.
This method produces
clusters that are less
dominated by similarity
measures between single
pairs of obseravtions.
Centroid Linkage is based on
the average calculated in
each cluster which is called
the centroid. The similarity
between the clusters is
defined as the similarity of
the centroids.
The Median Linkage is similar
to the Group Average
Linkage, except that it uses
the Median instead of the
average

Example
Consider the following matrix of distances between pairs of 5 objects:
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Single linkage

1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
1 2 4 (3,5)
1 0
2 9 0
4 6 5 0
(3,5) 3 7 8 0
D[1,(3,5)]=min[D(1,3),D(1,5)]=min[3,11]=3
D[2,(3,5)]=min[D(2,3),D(2,5)]=min[7,10]=7
D[4,(3,5)]=min[D(4,3),D(4,5)]=min[9,8]=8
2 4 (1,3,5)
2 0
4 5 0
(1,3,5) 7 6 0
D[2,(1,3,5)]=min[D(2,1),D(2,3),D(2,5)]=min[9,7,10)]=7
D[4,(1,3,5)]=min[D(4,1),D(4,3),D(4,5)]=min[6,9,8)]=6
(2,4) (1,3,5)
(2,4) 0
(1,3,5) 6 0
D[(2,4),(1,3,5)]=min[D[2,(1,3,5),D(4,(1,3,5)]=min[7,6]=6

1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
1 2 4 (3,5)
1 0
2 9 0
4 6 5 0
(3,5) 3 7 8 0
2 4 (1,3,5)
2 0
4 5 0
(1,3,5) 7 6 0
(2,4) (1,3,5)
(2,4) 0
(1,3,5) 6 0
Step Distance Merger Clusters # clusters
1 2 3 with 5 1, 2, 4, (3,5) 4
2 3 1 with (3,5) 2, 4, (1,3,5) 3
3 5 2 with 4 (2,4), (1,3,5) 2
4 6 (2,4) with (1,3,5) (1,2,3,4,5) 1

Step Distance Merger Clusters # clusters
1 2 3 with 5 1, 2, 4, (3,5) 4
2 3 1 with (3,5) 2, 4, (1,3,5) 3
3 5 2 with 4 (2,4), (1,3,5) 2
4 6 (2,4) with (1,3,5) (1,2,3,4,5) 1
DENDROGRAM

Frontline Solver – DataMining – Cluster - Hierarchical Clustering – Distance Matrix– Single Linkage – 4 Clusters
Hierarchical Clustering: Reporting Parameters
Normalized?
Draw Dendrogram?
Maximum Number of Leaves in Dendrogram
Data Type
FALSE
TRUE
5
Distance Matrix
Clustering Method
EUCLIDEAN
SINGLE LINKAGE
Hierarchical Clustering: Model Parameters
Cluster Assignment
# Clusters
TRUE
4
Hierarchical Clustering: Fitting Parameters
Similarity Measure

Clustering Stages
RESULTS
Cluster Labels
Stage Cluster 1 Cluster 2 Distance
Stage1 3 5 2
Stage2 1 3 3
Stage3 2 4 5
Stage4 1 2 6
Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 4
Record 5 3 5

3 5 1 2 4 0
4000
0
1
2
3
4
5
6
7
Distance
Dendrogram

Example
Consider the following matrix of distances between pairs of 5 objects:
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Complete linkage
Frontline Solver – DataMining – Cluster - Hierarchical Clustering – Distance Matrix– Complete Linkage – 4 Clusters

1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Complete linkage
1 2 4 (3,5)
1 0
2 9 0
4 6 5 0
(3,5) 11 10 9 0
D[1,(3,5)]=max[D(1,3),D(1,5)]=max[3,11]=11
D[2,(3,5)]=max[D(2,3),D(2,5)]=max [7,10]=10
D[4,(3,5)]=max[D(4,3),D(4,5)]=max[9,8]=9
1 (2,4) (3,5)
1 0
(2,4) 9 0
(3,5) 11 11 0
D[1,(2,4)]=max[D(1,2),D(1,4)]=max[9,6] =9
D[1,(3,5)]=max[D(1,3),D(1,5)]=max[3,11]=11
(1,2,4) (3,5)
(1,2,4) 0
(3,5) 11 0
D[(1,2,4),(3,5)]=max[D[1,(3,5)],D((2,4),(3,5)]=max[11,11]=11
D[(2,4),(3,5)]=max[D[2,(3,5),D(4,(3,5)]=max[11,9]=11

Clustering Stages
RESULTS
Cluster Labels
Stage1 3 5 2
Stage2 2 4 5
Stage3 1 2 9
Stage4 1 3 11
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 4
Record 5 3 5

Hierarchical Cluster Complete Linkage
3 5 1 2 4 0
4000
0
2
4
6
8
10
12
Distance
Dendrogram

Cluster Analysis - More measures
Centroid linkage uses the averaging concept of cluster centroids to
define between-cluster similarity
Ward’s method merges two clusters such that the dissimilarity of the
observations with the resulting single cluster increases as little as
possible
When McQuitty’s method considers merging two clusters A and B, the
dissimilarity of the resulting cluster AB to any other cluster C is calculate
as: ((dissimilarity between A and C) + (dissimilarity between B and C))/2)
A dendrogram is a chart that depicts the set of nested clusters resulting
at each step of aggregation

Example
1 0 0 0
0 1 1 1
1 1 1 0
1 1 0 0
1 1 0 0
1 1 0 0
0 0 0 0
0 1 1 0
1 1 1 0
0 1 1 0
1 1 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 1 0 1
1 1 1 1
1 1 0 1
1 1 0 1
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
0 1 0 0
1 0 1 1
0 0 1 0
0 1 1 1
0 1 0 1
1 1 0 0
1 0 1 1
0 1 0 0
Analyse using
Matching Coefficients
Group Average linkage

Inputs
Normalized?
Draw Dendrogram?
Maximum Number of Leaves in Dendrogram
Data Type
FALSE
TRUE
10
Raw Data
Clustering Method
MATCHING
GROUP AVERAGE
Cluster Assignment
# Clusters
TRUE
29
Variables
# Selected Variables
Selected Variables
4
Similarity Measure
Data
Workbook
Worksheet
Range
# Records in the input data
Data Chapter 4.xlsx
KYC
$A$1:$G$31
30

Clustering
stages
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 1
Record 5 4 1
Record 6 5 1
Record 7 6 4
Record 8 7 2
Record 9 8 3
Record 10 9 2
Record 11 10 1
Record 12 11 5
Record 13 12 6
Record 14 13 7
Record 15 14 8
Record 16 15 7
Record 17 16 6
Record 18 17 6
Record 19 18 1
Record 20 19 2
Record 21 20 9
Record 22 21 9
Record 23 22 9
Record 24 23 5
Record 25 24 10
Record 26 25 2
Record 27 26 8
Record 28 27 1
Record 29 28 5
Record 30 29 9
Stage1 4 5 0
Stage2 4 6 0
Stage3 3 9 0
Stage4 8 10 0
Stage5 4 11 0
Stage6 14 16 0
Stage7 13 17 0
Stage8 13 18 0
Stage9 4 19 0
Stage10 8 20 0
Stage11 21 22 0
Stage12 21 23 0
Stage13 12 24 0
Stage14 2 26 0
Stage15 15 27 0
Stage16 4 28 0
Stage17 12 29 0
Stage18 21 30 0
Stage19 1 4 0.25
Stage20 2 8 0.25
Stage21 3 14 0.25
Stage22 13 15 0.25
Stage23 7 21 0.25
Stage24 1 7 0.3214
Stage25 2 25 0.35
Stage26 3 12 0.375
Stage27 1 13 0.4125
Stage28 2 3 0.5060
Stage29 1 2 0.5905
Sub-Cluster
membership
Each of the 30
observations is
assigned to
one of the 10
sub-clusters

Cluster membership
Cluster Legend (Numbers show the record sequence relative to the original data)
Sub-
Cluster
1
Sub-
Cluster
2
Sub-
Cluster
3
Sub-
Cluster
4
Sub-
Cluster
5
Sub-
Cluster
6
Sub-
Cluster
7
Sub-
Cluster
8
Sub-
Cluster
9
Sub-
Cluster
10
1 2 3 7 12 13 14 15 21 25
4 8 9 24 17 16 27 22
5 10 29 18 23
6 20 30
11 26
19
28

Analytic Solver Basic
limits the Maximum
Number of Leaves to 10.
Setting a value higher
than 10 will result in an
error message.
If the number of Leaves
is less than the number
of observations, some of
the clusters on the
horizontal axis will
initially correspond to
groups of observations
combined in early steps
of the agglomeration
not represented in the
dendrogram
Sub-cluster #3

At distance 0.251,
we have 7
clusters
At distance 0.42,
we have 3 clusters

The vertical distance
between the two horizontal
lines is the cost of merging
clusters in terms of
decreased homogeneity
within clusters =
0.42-0.251 = 0.169
NOTE: Elongating portions
of the dendrogram
represents mergers of more
dissimilar clusters

Cluster’s durability or strength
The durability or
strength of the cluster is
measured by the
difference between the
distance value at which
a cluster is originally
formed and the distance
value at which it is
merged with another
cluster.
Cluster 8 is formed at
distance 0 and merged
with cluster 6 at
distance 0.25.
Strength of cluster 8 =
0.25 – 0 = 0.25

At distance = 0.42,
we have 3 clusters
Cluster 1 Cluster 2 Cluster 3
Cluster 1: (sub-clusters 3, 7, 5)
{3, 9, 14, 16, 12, 24, 29}
Cluster 2: (sub-clusters 2, 10)
{2, 8, 10, 20, 26, 25}
Cluster 3: (sub-clusters 1, 4, 9, 6, 8)
{1, 4, 5, 6, 11, 19, 28, 7, 21, 22, 23,
30, 13, 17, 18, 15, 27}

Matching Coefficients and Group Average linkage
Cluster 3: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27}
Cluster 2: {2, 26, 8, 10, 20, 25}
Cluster 1: {3, 9, 14, 16, 12, 24, 29}
Cluster 1
Cluster 2
Cluster 3
Observation Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 0 0 0 0
8 0 1 1 0
9 1 1 1 0
10 0 1 1 0
11 1 1 0 0
12 1 0 1 1
13 1 1 0 1
14 1 1 1 1
15 0 1 0 1
16 1 1 1 1
17 1 1 0 1
18 1 1 0 1
19 1 1 0 0
20 0 1 1 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
24 1 0 1 1
25 0 0 1 0
26 0 1 1 1
27 0 1 0 1
28 1 1 0 0
29 1 0 1 1
30 0 1 0 0

CLUSTER 1
Obs Female Married Car Loan Mortgage
3 1 1 1 0
9 1 1 1 0
12 1 0 1 1
14 1 1 1 1
16 1 1 1 1
24 1 0 1 1
29 1 0 1 1
Cluster 1
Row Labels Count of Female
0 0
1 7
Grand Total 7
Row Labels Count of Married
0 3
1 4
Grand Total 7
Row Labels Count of Mortgage
0 2
1 5
Grand Total 7
All have car loans

CLUSTER 2
2 0 1 1 1
8 0 1 1 0
10 0 1 1 0
20 0 1 1 0
25 0 0 1 0
26 0 1 1 1
Cluster 2
0 6
1 0
Grand Total 6
0 1
1 5
Grand Total 6
0 4
1 2
Grand Total 6
All have car loans

CLUSTER 3
1 1 0 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 0 0 0 0
11 1 1 0 0
13 1 1 0 1
15 0 1 0 1
17 1 1 0 1
18 1 1 0 1
19 1 1 0 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
27 0 1 0 1
28 1 1 0 0
30 0 1 0 0
Cluster 3
0 7
1 10
Grand Total 17
0 2
1 15
Grand Total 17
0 12
1 5
Grand Total 17
None have car loans

APPLICATIONS TO HIERARCHICAL CLUSTERING
Herarchical clustering is often used on DNA microarrays.
Clustering of gene expression profiles is often used to try
to discover subclasses of disease.
Validation of these clusters is important for accurate
scientific interpretation of the results.

k-Means Clustering
◦ Given a value of k, the k-means algorithm randomly partitions
the observations into k clusters
◦ After all observations have been assigned to a cluster, the
resulting cluster centroids are calculated
◦ Using the updated cluster centroids, all observations are
reassigned to the cluster with the closest centroid using the
Euclidean distance
53

Example
Age, Income, Number of
Children
Age Income Children
48 17546 1
40 30085 3
51 16575 0
23 20375 3
57 50576 0
57 37870 2
22 8877 0
58 24947 0
37 25304 2
54 24212 2
66 59804 0
52 26659 0
44 15736 1
66 55205 1
36 19475 0
38 22342 0
37 17730 2
46 41016 0
62 26909 0
31 22523 0
61 57881 2
50 16497 2
54 38447 0
27 15539 0
22 12640 2
56 41034 0
45 20810 0
39 20114 1
39 29359 3
61 24270 1

Example
Clustering Observations by Age and Income
Using k-Means Clustering with k = 3
55
Most
heterogeneous
Largest
cluster
Most
homogeneous

Data
Workbook DemoKTC.xlsx
Worksheet Data
Range $A$1:$G$31
# Records in the input data 30
# Selected Variables 2
Selected Variables Age Income
K-Means Clustering: Fitting Parameters
# Clusters 3
Start type Random Start
# Iterations 10
Random seed: initial centroids 12345
K-Means Clustering: Reporting Parameters
Show data summary TRUE
Show distance from each cluster TRUE
Normalized? TRUE
Cluster using Age
and Income only
Normalize

Cluster Age Income
Cluster 1 -1.837607 -1.121744243
Cluster 2 -1.837607 -1.121744243
Cluster 3 0.7692901 0.950292945
Cluster Age Income
Cluster 1 -0.610832 -0.41375302
Cluster 2 -0.687505 -0.197585752
Cluster 3 0.8459635 0.719370083
Cluster Age Income
Cluster 1 -0.534158 -0.576349161
Cluster 2 1.5360244 1.984403237
Cluster 3 0.3859229 -0.83457936
Cluster Age Income
Cluster 1 0.8459635 1.646644617
Cluster 2 -1.147546 -0.400566393
Cluster 3 1.5360244 2.320030979
Cluster Age Income
Cluster 1 1.5360244 1.984403237
Cluster 2 -0.150791 -0.895849375
Cluster 3 -1.837607 -1.121744243
Start 1. Sum of Squares: 45.069043
Best: Start 3. Sum of Squares: 22.397972
Cluster Age Income
Cluster 1 -1.837607 -1.396366871
Cluster 2 -1.837607 -1.396366871
Cluster 3 -0.687505 -0.750336738
Cluster Age Income
Cluster 1 0.8459635 0.719370083
Cluster 2 -1.147546 -0.400566393
Cluster 3 1.2293307 -0.080467783
Cluster Age Income
Cluster 1 -0.610832 -0.41375302
Cluster 2 1.2293307 -0.080467783
Cluster 3 -0.764179 -0.623009532
Cluster Age Income
Cluster 1 -0.074118 -0.525580284
Cluster 2 -0.610832 -0.41375302
Cluster 3 0.4625964 -0.098740784
Cluster Age Income
Cluster 1 1.2293307 -0.080467783
Cluster 2 -0.687505 -0.197585752
Cluster 3 0.6159432 -0.277289314
10 iterations
Best Start
with
smallest SS

Cluster Age Income
Cluster 1 -1.0261 -0.5581
Cluster 2 0.9131 1.4389
Cluster 3 0.5009 -0.4813
Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 1 0.0000 2.7836 1.5290
Cluster 2 2.7836 0.0000 1.9639
Cluster 3 1.5290 1.9639 0.0000
Cluster Size Average Distance
Cluster 1 12 0.6215
Cluster 2 8 0.7387
Cluster 3 10 0.5202
Total 30 0.6190
Cluster Centers
Inter-Cluster Distances
Cluster Summary
Normalized data

Evaluate the strength of the clusters
59
Cluster 2 is most heterogeneous
Distance between Cluster 2 and
Cluster 3 centroids =1.9639
The average observation in Cluster 2
is approximately 2.66 times closer to
Cluster 2 than to Cluster 3.
(1.964/0.739 = 2.66)
Rule of thumb : the ratio of between-cluster distance to
within-cluster distance should exceed 1.0 for useful clusters.
For Cluster 1: within-distance = 0.6215, distance between Cluster 1 and Cluster 2 = 2.7836, the ratio = 2.7836 / 0.6215 =
4.48. The average observation is Cluster 1 is 4.48 times closer to Cluster 1 than to Cluster 2 and also 2.46 time closer
than to Cluster 3. (the ratio = 1.529 / 0.6215 = 2.46)
Average distance between
observations in the cluster
Distances between centroids

Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3
Record 1 3 1.1998 2.3291 0.4459
Record 2 1 0.9092 1.8805 1.1484
Record 3 3 1.4389 2.3338 0.3715
Record 4 1 0.7348 3.3369 2.2631
Record 5 2 2.8924 0.2183 2.1558
Record 6 2 2.2665 0.7226 1.2493
Record 7 1 1.1667 3.9503 2.5112
Record 8 3 1.9773 1.6626 0.4942
Record 9 1 0.4946 2.2890 1.2218
Record 10 3 1.6659 1.7417 0.2342
Record 11 2 3.8534 1.0791 2.9865
Record 12 3 1.5580 1.6022 0.3845
Record 13 3 0.9382 2.5657 0.7724
Record 14 2 3.6096 0.8281 2.6742
Record 15 1 0.2699 2.6579 1.2730
Record 16 1 0.4397 2.3988 1.1138
Record 17 1 0.3894 2.7119 1.2185
Record 18 2 1.8247 1.0339 1.5146
Record 19 3 2.3055 1.5519 0.8314
Record 20 1 0.1989 2.7622 1.6505
Record 21 2 3.4990 0.7786 2.7397
Record 22 3 1.3649 2.3578 0.4069
Record 23 2 2.1066 0.7397 1.2481
Record 24 1 0.5543 3.3350 2.0017
Record 25 1 0.9880 3.7580 2.4246
Record 26 2 2.3450 0.5093 1.4566
Record 27 3 0.9526 2.1985 0.5768
Record 28 1 0.4923 2.4810 1.0394
Record 29 1 0.8204 1.9727 1.1863
Record 30 3 2.1974 1.7286 0.6842
Distances from
each cluster
Classify as
group 3
because the
distance is
the smallest
to Cluster 3

Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income
Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00
Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10
Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40
Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40
Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30
Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60
Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07
Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60
Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30
Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10
Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90
Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80
Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80
Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70
Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60
Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10
Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80
Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00
Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20
Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80
Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70
Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30
Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60
Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80
Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30
Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00
Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70
Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00
Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10
Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10
Insert the original
data to analyze the
clusters

Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income
Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10
Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40
Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07
Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30
Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60
Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10
Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80
Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80
Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80
Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30
Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00
Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10
Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30
Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60
Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90
Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70
Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00
Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70
Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60
Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00
Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00
Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40
Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60
Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10
Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80
Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80
Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20
Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30
Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70
Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10
Cluster 1
Cluster 2
Cluster 3
Order the column
Cluster and
expand the
selection to all the
other columns

Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income
2 1 0,90921 1,880479 1,14838 40 30085,1
4 1 0,734788 3,336877 2,263141 23 20375,4
7 1 1,166663 3,950271 2,511188 22 8877,07
9 1 0,494644 2,289048 1,221841 37 25304,3
15 1 0,269881 2,657896 1,27302 36 19474,6
16 1 0,439695 2,398833 1,113817 38 22342,1
17 1 0,389384 2,711894 1,218504 37 17729,8
20 1 0,19891 2,762165 1,650456 31 22522,8
24 1 0,554286 3,335008 2,001662 27 15538,8
25 1 0,98799 3,758034 2,424644 22 12640,3
28 1 0,492325 2,481026 1,039444 39 20114
29 1 0,820351 1,972684 1,186339 39 29359,1
5 2 2,892376 0,218347 2,155763 57 50576,3
6 2 2,266453 0,722611 1,249289 57 37869,6
11 2 3,853381 1,079146 2,986474 66 59803,9
14 2 3,6096 0,828076 2,674181 66 55204,7
18 2 1,824724 1,033919 1,514648 46 41016
21 2 3,498976 0,778609 2,73966 61 57880,7
23 2 2,106615 0,739677 1,248115 54 38446,6
26 2 2,344982 0,50928 1,456556 56 41034
1 3 1,199799 2,329113 0,445879 48 17546
3 3 1,438875 2,333751 0,371502 51 16575,4
8 3 1,977273 1,662577 0,494178 58 24946,6
10 3 1,665932 1,741678 0,23422 54 24212,1
12 3 1,55801 1,602226 0,384503 52 26658,8
13 3 0,938242 2,565664 0,772381 44 15735,8
19 3 2,305502 1,551899 0,831416 62 26909,2
22 3 1,364876 2,357765 0,406925 50 16497,3
27 3 0,952585 2,19853 0,576751 45 20809,7
30 3 2,197374 1,728604 0,684194 61 24270,1
Distances from cluster centers are in
normalized coordinates
Cluster #Obs Avg. Dist
Cluster-1 12 0,622
Cluster-2 8 0,739
Cluster-3 10 0,52
Overall 30 0,619
Normalized coordinates
Average
Age
Average
Income
n
Cluster 1 32.58 21363.61 12
Cluster 2 57.88 47728.98 8
Cluster 3 52.50 21416.10 10

K-Means Clustering on Age, Income and Children
Age Income Children
Show data summary
Show distance from each cluster
Normalized?
TRUE
TRUE
TRUE
Start type
# Iterations
Random seed: initial centroids
3
Random Start
10
12345
Variables
# Selected Variables
Selected Variables
3
# Clusters
No. of iterations = No. of times that cluster centroids are
recalculated and observations are reassigned to clusters

Cluster Age Income Children
Cluster 1 -1.837607 -1.121744243 0.9870553
Cluster 2 -1.837607 -1.121744243 0.9870553
Cluster 3 0.7692901 0.950292945 -0.863673
Cluster 1 -0.610832 -0.41375302 -0.863673
Cluster 2 -0.687505 -0.197585752 0.9870553
Cluster 3 0.8459635 0.719370083 0.9870553
Cluster 1 -0.534158 -0.576349161 0.061691
Cluster 2 1.5360244 1.984403237 0.061691
Cluster 3 0.3859229 -0.83457936 -0.863673
Cluster 1 0.8459635 1.646644617 -0.863673
Cluster 2 -1.147546 -0.400566393 -0.863673
Cluster 3 1.5360244 2.320030979 -0.863673
Cluster 1 1.5360244 1.984403237 0.061691
Cluster 2 -0.150791 -0.895849375 0.061691
Best: Start 3. Sum of Squares: 49.715972
Cluster 1 -1.837607 -1.396366871 -0.863673
Cluster 2 -1.837607 -1.396366871 -0.863673
Cluster 3 -0.687505 -0.750336738 0.9870553
Cluster 1 0.8459635 0.719370083 0.9870553
Cluster 2 -1.147546 -0.400566393 -0.863673
Cluster 3 1.2293307 -0.080467783 -0.863673
Cluster 1 -0.610832 -0.41375302 -0.863673
Cluster 2 1.2293307 -0.080467783 -0.863673
Cluster 3 -0.764179 -0.623009532 -0.863673
Cluster 1 -0.074118 -0.525580284 -0.863673
Cluster 2 -0.610832 -0.41375302 -0.863673
Cluster 3 0.4625964 -0.098740784 -0.863673
Cluster 1 1.2293307 -0.080467783 -0.863673
Cluster 2 -0.687505 -0.197585752 0.9870553
Cluster 3 0.6159432 -0.277289314 0.9870553
Random Starts Summary

Cluster Centers
Cluster 1 -0.7182 -0.6041 0.4935
Cluster 2 1.1833 1.7700 0.0617
Cluster 3 0.4856 0.0211 -0.7711
Inter-Cluster Distances
Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 1 0 3.0722 1.8545
Cluster 2 3.0722 0 2.05893
Cluster 3 1.8545 2.0589 0
Cluster Summary
Cluster 1 15 1.2413
Cluster 2 5 0.9982
Cluster 3 10 0.8147
Total 30 1.0586
Cluster 1 and Cluster 2 are the
most distinct pairs of clusters
Observations within clusters are
more similar than observations
between clusters
Cluster 3
more
homogenous

K-Means Clustering –
Predicted Clusters
Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income Children
1 1 0.987923 2.734159 1.190912 48 17546 1
2 1 1.62843 2.955969 2.847426 40 30085.1 3
3 3 1.764699 2.876826 0.866409 51 16575.4 0
4 1 1.761474 4.184518 3.547236 23 20375.4 3
5 2 3.058468 0.992641 1.667591 57 50576.3 0
6 2 2.107507 1.440136 1.925799 57 37869.6 2
7 1 1.929472 4.473073 2.723054 22 8877.07 0
8 3 2.163087 2.213405 0.509393 58 24946.6 0
9 1 0.640108 2.868416 2.124907 37 25304.3 2
10 1 1.459529 2.317267 1.788088 54 24212.1 2
11 2 3.93367 1.132784 2.529248 66 59803.9 0
12 3 1.868574 2.206364 0.153137 52 26658.8 0
13 1 0.770418 2.981068 1.392612 44 15735.8 1
14 2 3.459491 0.412738 2.37731 66 55204.7 1
15 1 1.358113 3.221133 1.409031 36 19474.6 0
16 3 1.374677 2.97392 1.183135 38 22342.1 0
17 1 0.51566 3.272391 2.250002 37 17729.8 2
18 3 2.184812 1.710157 1.050179 46 41016 0
19 3 2.430829 2.06948 0.756316 62 26909.2 0
20 1 1.437973 3.316736 1.689235 31 22522.8 0
21 2 3.390112 1.012452 2.862822 61 57880.7 2
22 1 1.16403 2.904136 1.96578 50 16497.3 2
23 3 2.342344 1.481687 0.757448 54 38446.6 0
24 1 1.574014 3.872571 2.153806 27 15538.8 0
25 1 1.328415 4.283069 3.129631 22 12640.3 2
26 3 2.543734 1.303721 0.975943 56 41034 0
27 3 1.504315 2.776198 0.78784 45 20809.7 0
28 1 0.470227 2.907789 1.445835 39 20114 1
29 1 1.593881 3.02813 2.871819 39 29359.1 3
30 3 1.948349 2.043314 1.106838 61 24270.1 1
Cluster Age Income Children Count
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30
Average in original units

K-Means Clustering –
Predicted Clusters
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30
Averages
Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income Children
1 1 0.98792 2.73416 1.19091 48 17546 1
2 1 1.62843 2.95597 2.84743 40 30085.1 3
4 1 1.76147 4.18452 3.54724 23 20375.4 3
7 1 1.92947 4.47307 2.72305 22 8877.07 0
9 1 0.64011 2.86842 2.12491 37 25304.3 2
10 1 1.45953 2.31727 1.78809 54 24212.1 2
13 1 0.77042 2.98107 1.39261 44 15735.8 1
15 1 1.35811 3.22113 1.40903 36 19474.6 0
17 1 0.51566 3.27239 2.25 37 17729.8 2
20 1 1.43797 3.31674 1.68924 31 22522.8 0
22 1 1.16403 2.90414 1.96578 50 16497.3 2
24 1 1.57401 3.87257 2.15381 27 15538.8 0
25 1 1.32842 4.28307 3.12963 22 12640.3 2
28 1 0.47023 2.90779 1.44584 39 20114 1
29 1 1.59388 3.02813 2.87182 39 29359.1 3
5 2 3.05847 0.99264 1.66759 57 50576.3 0
6 2 2.10751 1.44014 1.9258 57 37869.6 2
11 2 3.93367 1.13278 2.52925 66 59803.9 0
14 2 3.45949 0.41274 2.37731 66 55204.7 1
21 2 3.39011 1.01245 2.86282 61 57880.7 2
3 3 1.7647 2.87683 0.86641 51 16575.4 0
8 3 2.16309 2.21341 0.50939 58 24946.6 0
12 3 1.86857 2.20636 0.15314 52 26658.8 0
16 3 1.37468 2.97392 1.18314 38 22342.1 0
18 3 2.18481 1.71016 1.05018 46 41016 0
19 3 2.43083 2.06948 0.75632 62 26909.2 0
23 3 2.34234 1.48169 0.75745 54 38446.6 0
26 3 2.54373 1.30372 0.97594 56 41034 0
27 3 1.50432 2.7762 0.78784 45 20809.7 0
30 3 1.94835 2.04331 1.10684 61 24270.1 1
Order Cluster ID and
expand the selection

Cluster 1: youngest customers, lowest
income and largest families
Cluster 2: oldest customers, highest
income and an average of one child
Cluster 3: older customers,
moderate income and few children
Distance between clusters
Cluster 1 and Cluster 2 are the most distinct pairs of clusters
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30

Cluster Analysis
As an unsupervised learning technique, cluster analysis is not guided
by any explicit measure of accuracy, and thus the notion of a ‘good’
clustering is subjective and is dependent on what the analyst hopes
the cluster analysis will uncover.
We can measure the strength of a cluster by comparing the average
distance in a cluster to the distance between cluster centroids.
Rule of thumb: the ratio of between-cluster distance to within-
cluster distance should exceed 1.0 for useful clusters.

Hierarchical Clustering versus K-Means Clustering
71
Hierarchical Clustering k-Means Clustering
Suitable when we have a small data set (e.g., less
than 500 observations) and want to easily
examine solutions with increasing numbers of
clusters
Suitable when you know how many clusters
you want and you have a larger data set (e.g.,
larger than 500 observations)
Convenient method if you want to observe how
clusters are nested
Partitions the observations,
which is appropriate if trying to summarize the
data with k “average” observations
that describe the data with the minimum
amount of error
Very sensitive to outliers
Generally not appropriate for binary or ordinal
data because the average is not meaningful

Association Rules
EVALUATING ASSOCIATION RULES
Association rule mining is the data mining process of
finding the rules that may govern associations and causal
objects between sets of items. So in a given transaction
with multiple items, it tries to find the rules that govern how
or why such items are often bought together.

Association Rules
Association rules: if-then statements which convey the likelihood of certain items being
purchased together
Although association rules are an important tool in market basket analysis, they are
applicable to other disciplines.
Antecedent: the collection of items (or item set) corresponding to the if portion of the rule
Consequent: the item set corresponding to the then portion of the rule
Support count of an item: number of transactions in the data that include that item set
73

When we go grocery shopping, we often have a standard list of
things to buy. Each shopper has a distinctive list, depending on
one’s needs and preferences. A person might buy healthy
ingredients for a family dinner, while a bachelor might buy beer
and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and
Y, that are frequently bought together:
•Both X and Y can be placed on the
same shelf, so that buyers of one
item would be prompted to buy the
other.
•Promotional discounts could be
applied to just one out of the two
items.
•Advertisements on X could be
targeted at buyers who purchase Y.
•X and Y could be combined into a
new product, such as having Y in
flavors of X.
While we may know that certain
items are frequently bought
together, the question is, how do
we uncover these associations?
Besides increasing sales profits,
association rules can also be used
in other fields. In medical diagnosis
for instance, understanding which
symptoms tend to appear together
can help to improve patient care
and medicine prescription.
Definition
Association rules analysis is a technique to
uncover how items are associated to each other.
There are three common ways to measure
association:
The Support Count, the Confidence and the Lift-Ratio

https://medium.com/analytics-vidhya/association-rule-mining-7f06401f0601
In 2004, Walmart mined trillions of bytes of data to discover that
Strawberry Pop-Tarts were most purchased, pre-hurricane. It
was later attributed to the no-cook, long-lasting capabilities of the
tarts that made them disaster favourites. This was later proved to be
true when they stocked up on Pop-Tarts pre-hurricane in the future
years only to have them sold-out.
EXAMPLES
Before a hurricane strikes, people tend to stock up on Strawberry Pop-Tarts
just as much as batteries and other essentials.
Fast-food chains learned very early in the game that people who buy fast
food tend to feel thirsty due to the high salt content and end up buying Coke.

Association Rule Mining a good tool for businesses.
1. It helps businesses build sales strategies
identifying products that sell better together
2. It helps businesses build marketing strategies
The knowledge that some ornaments do not sell as well others during Christmas may help the
manager offer a sale on the non-frequent ornaments
3. It helps shelf-life planning
If olives don’t sell very often, the manager will not stock up on it. But he still wants to ensure that the
existing stock sells before the expiration date. With the knowledge that people who buy pizza dough
tend to buy olives, the olives can be offered at a lower price in combination with the pizza dough.
4. It helps the in-store organization.
Products which are known to drive the sales of other products can be moved closer together in the store.
For instance, if the sale of butter is driven by the sale of bread, they can be moved to the same aisle
in the store.

Walmart analyzed 1.2 million baskets of a store and found
a very interesting association. They found that on Fridays,
between 5 pm and 7 pm, diapers and beer were frequently
bought together.
To test this out, they moved the two closer in the store and
found a significant impact on the sales of these products.
Further analysis of this led them to the following
conclusion: “On Friday evenings, men would head home
from work and grab some beer while also picking up
diapers for their infants.”

Example
Shopping-Cart Transactions
78
If bread and jelly, then peanut butter
antecedent consequent
We consider only association
rule with a single consequent
The Support Count = the number of
transactions that include the event
Support count of (bread, jelly)
Rule of thumb: Only consider association rule with a support count of at least 20% of transaction
=4

Example
Shopping-Cart Transactions
79
If bread and jelly, then peanut butter
We consider only association
rule with a single consequent
The Support Count = the number of
transactions that include the event
Support count of (bread, jelly)
Rule of thumb: Only consider association rule with a support count of at least 20% of transaction
=4
Support count of peanut butter =4

Association Rules
Confidence: Helps identify reliable association rules
Lift ratio: Measure to evaluate the efficiency of a rule
80
a) Find the Confidence of the rule “If Bread and Fruit, then Jelly”
c) Find the Confidence of the rule “If Milk then Peanut Butter,”
b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly”
d) Find the Lift Ratio of the rule “If Milk then Peanut Butter”
The confidence measure helps
identify which product drives the
sale of which other product.

a) Find the Confidence of the rule “If Bread and Fruit, then Jelly”
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡, 𝐽𝑒𝑙𝑙𝑦}
𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡}
=
4
4
= 1

a) Find the Confidence of the rule “If Bread and Fruit, then Jelly” = 1.0
b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly”
=
1.0
5
10
= 2.00
Identifying a
customer who
purchased both
Bread and Fruit
as one who
also purchased
Jelly is two
times better
than just
guessing that a
random
customer
purchased Jelly

=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝑀𝑖𝑙𝑘, 𝑃𝑒𝑎𝑛𝑢𝑡 𝐵𝑢𝑡𝑡𝑒𝑟}
𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝑀𝑖𝑙𝑘}
c) Find the Confidence of the rule “If Milk then Peanut Butter”
d) Find the Lift Ratio of the rule “If Milk then Peanut Butter”
=
𝟒
𝟔
= 𝟎. 𝟔𝟔
=
𝟎.𝟔𝟔
𝟒
𝟏𝟎
= 𝟏. 𝟔𝟔
Identifying a
customer who
purchased Milk
as one who
also purchased
Peanut Butter
is 66% more
likely to buy
Peanut Butter
than just
guessing that a
random
customer
purchased
Peanut Butter

Association Rules
Bread PeanutButter Milk Fruit Jelly Soda PotatoChips Vegetables WhippedCream ChocolateSauce Beer Steak Cheese Yogurt
1 1 1 1 1 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 1 1 1 0 0 0
1 0 0 1 1 1 1 0 0 0 0 1 0 0
0 1 1 1 1 1 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 0 0 0 0 0 0 0
0 0 1 1 0 1 1 0 0 0 0 0 0 0
0 1 1 1 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1 0 0 1 0 0 1

Association Rules
85
The utility of the rules depends on both its support and its lift ratio.
Although a high lift ratio suggests that the rule is very efficient at finding
when the occurrence occurs, if it has a very low support, the rule may not
as useful as another rule that has a lower lift ratio but affects a large
number of transactions ( as demonstrated by a high support).

Data Format Binary
Min support
Min confidence
Apriori
4
50
Association Rules: Reporting Parameters
Association Rules: Fitting Parameters
Method
Data Mining - Associate – Association rule

Rule ID A-Support C-Support Support Confidence Lift-Ratio Antecedent Consequent
Rule 2 4 5 4 100 2 [Bread] [Jelly]
Rule 22 4 5 4 100 2 [Bread] [Fruit,Jelly]
Rule 24 4 5 4 100 2 [Bread,Fruit] [Jelly]
Rule 3 5 4 4 80 2 [Jelly] [Bread]
Rule 23 5 4 4 80 2 [Jelly] [Bread,Fruit]
Rule 26 5 4 4 80 2 [Fruit,Jelly] [Bread]
Rule 4 4 6 4 100 1.6666667[PeanutButter] [Milk]
Rule 21 4 6 4 100 1.6666667 [PotatoChips] [Soda]
Rule 27 4 6 4 100 1.6666667[PeanutButter] [Milk,Fruit]
Rule 30 4 6 4 100 1.6666667
[PeanutButter,Fruit] [Milk]
Rule 49 4 6 4 100 1.6666667 [PotatoChips] [Fruit,Soda]
Rule 51 4 6 4 100 1.6666667
[Fruit,PotatoChips] [Soda]
Rule 5 6 4 4 66.66666667 1.6666667 [Milk][PeanutButter]
Rule 20 6 4 4 66.66666667 1.6666667 [Soda] [PotatoChips]
Rule 28 6 4 4 66.66666667 1.6666667 [Milk]
[PeanutButter,Fruit]
Rule 31 6 4 4 66.66666667 1.6666667 [Milk,Fruit] [PeanutButter]
Rule 48 6 4 4 66.66666667 1.6666667 [Soda]
[Fruit,PotatoChips]
Rule 50 6 4 4 66.66666667 1.6666667 [Fruit,Soda] [PotatoChips]
Rule 11 6 6 5 83.33333333 1.3888889 [Milk] [Soda]
Rule 12 6 6 5 83.33333333 1.3888889 [Soda] [Milk]
Rule 37 6 6 5 83.33333333 1.3888889 [Milk] [Fruit,Soda]
Rule 39 6 6 5 83.33333333 1.3888889 [Soda] [Milk,Fruit]
Rule 40 6 6 5 83.33333333 1.3888889 [Milk,Fruit] [Soda]
Rule 42 6 6 5 83.33333333 1.3888889 [Fruit,Soda] [Milk]
Rule 10 5 6 4 80 1.3333333 [Jelly] [Milk]
Rule 18 5 6 4 80 1.3333333 [Jelly] [Soda]
Rules
presented in
decreasing
order of Lift
Ratio

Rule ID A-Support C-Support Support Confidence Lift-Ratio Antecedent Consequent
Rule 33 5 6 4 80 1.3333333 [Jelly] [Milk,Fruit]
Rule 36 5 6 4 80 1.3333333 [Fruit,Jelly] [Milk]
Rule 43 5 6 4 80 1.3333333 [Jelly] [Fruit,Soda]
Rule 45 5 6 4 80 1.3333333 [Fruit,Jelly] [Soda]
Rule 9 6 5 4 66.66666667 1.3333333 [Milk] [Jelly]
Rule 19 6 5 4 66.66666667 1.3333333 [Soda] [Jelly]
Rule 32 6 5 4 66.66666667 1.3333333 [Milk] [Fruit,Jelly]
Rule 34 6 5 4 66.66666667 1.3333333 [Milk,Fruit] [Jelly]
Rule 44 6 5 4 66.66666667 1.3333333 [Soda] [Fruit,Jelly]
Rule 46 6 5 4 66.66666667 1.3333333 [Fruit,Soda] [Jelly]
Rule 1 4 9 4 100 1.1111111 [Bread] [Fruit]
Rule 6 4 9 4 100 1.1111111[PeanutButter] [Fruit]
Rule 7 6 9 6 100 1.1111111 [Milk] [Fruit]
Rule 14 5 9 5 100 1.1111111 [Jelly] [Fruit]
Rule 16 6 9 6 100 1.1111111 [Soda] [Fruit]
Rule 17 4 9 4 100 1.1111111 [PotatoChips] [Fruit]
Rule 25 4 9 4 100 1.1111111 [Bread,Jelly] [Fruit]
Rule 29 4 9 4 100 1.1111111
[PeanutButter,Milk] [Fruit]
Rule 35 4 9 4 100 1.1111111 [Milk,Jelly] [Fruit]
Rule 41 5 9 5 100 1.1111111 [Milk,Soda] [Fruit]
Rule 47 4 9 4 100 1.1111111 [Jelly,Soda] [Fruit]
Rule 52 4 9 4 100 1.1111111
[Soda,PotatoChips] [Fruit]
Rule 8 9 6 6 66.66666667 1.1111111 [Fruit] [Milk]
Rule 15 9 6 6 66.66666667 1.1111111 [Fruit] [Soda]
Rule 13 9 5 5 55.55555556 1.1111111 [Fruit] [Jelly]
Rule 38 9 5 5 55.55555556 1.1111111 [Fruit] [Milk,Soda]

Evaluating Association Rules
An association rule is judged on how actionable it is and how
well it explains the relationship between item sets
An association rule is useful if it is well supported and explain
an important previously unknown relationship
89

Example
Suppose the support for consequent B = 2
Support (A and B) = 2
Antecedent A is very popular: support of A = 50
100 transactions
Confidence (If A then B) = Support (A and B) / Support (A) = 2/50 = 0.04
Lift ratio (If A then B) =
Confidence (if A then B)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐵)/100
=
0.04
2/100
= 2
If A then B

Text Mining
Text mining is the process of transforming unstructured text into
structured data for easy analysis. Text mining uses natural language
processing (NLP), allowing machines to understand the human language
and process it automatically.
Text data is often referred to as unstructured data because in its raw form, it
cannot be stored in a traditional structured database (rows and columns).
Audio and video data are also examples of unstructured data.
Data mining with text data is more challenging than data mining with traditional
numerical data, because it requires more preprocessing to convert the text to a
format amenable for analysis.

Basic Methods
1) Word frequency
Word frequency can be used to identify the most recurrent
terms or concepts in a set of data.
2) Collocation
Collocation refers to a sequence of words that commonly appear
near each other.
3) Concordance
Concordance is used to recognize the particular context or
instance in which a word or set of words appears.
Topic Analysis, Sentiment Analysis

Example: Triad Airline
◦ Triad solicits feedback from its customers through a follow-up e-mail the
day after the customer has completed a flight.
◦ Survey asks the customer to rate various aspects of the flight and asks the
respondent to type comments into a dialog box in the e-mail; includes:
◦ Quantitative feedback from the ratings.
◦ Comments entered by the respondents which need to be analyzed.
◦ A collection of text documents to be analyzed is called a corpus.

Example: Triad Airline 10 respondents
Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.
To be analyzed, text data needs to be converted to structured data

Example: Converting text data
To be analyzed, text data needs to be converted to structured data (rows and
columns of numerical data) so that the tools of descriptive statistics, data
visualization and data mining can be applied.
We must convert a group of documents into a matrix of rows and columns
where the rows correspond to a document and the columns correspond to a
particular word.
A presence/absence or binary term-document matrix is a matrix with the rows
representing documents and the columns representing words.
Entries in the columns indicate either the presence or the absence of a
particular word in a particular document.

◦ Creating the list of terms to use in the presence/absence matrix can be a
complicated matter:
◦ Too many terms results in a matrix with many columns, which may be
difficult to manage and could yield meaningless results.
◦ Too few terms may miss important relationships.
◦ Term frequency along with the problem context are often used as a guide.
◦ In Triad’s case, management used word frequency and the context of
having a goal of satisfied customers to come up with the following list of
terms they feel are relevant for categorizing the respondent’s comments.

Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.
delayed, flight, horrible, recline, rude, seat, and service

Presence/Absence Term-Document Matrix
Term
Document Delayed Flight Horrible Recline Rude Seat Service
1 0 0 1 0 0 0 1
2 0 0 0 0 0 1 0
3 1 1 0 0 0 0 0
4 0 0 0 1 0 1 0
5 0 0 1 0 1 0 1
6 0 1 0 0 1 0 1
7 1 1 0 0 0 0 0
8 0 0 0 1 0 1 0
9 0 1 0 0 0 0 0
10 0 0 0 0 0 1 0
The text-mining process converts unstructured text into numerical data

◦ The text-mining process converts unstructured text into numerical data and applies
quantitative techniques.
◦ Which terms become the headers of the columns of the term-document matrix can
greatly impact the analysis.
Preprocessing Text Data for Analysis
◦ It is the process of dividing text into separate terms, referred to as tokens:
◦ Symbols and punctuations must be removed from the document, and all
letters should be converted to lowercase.
◦ Different forms of the same word, such as “stacking”, “stacked,” and “stack”
probably should not be considered as distinct terms.
Tokenization
It is the process of converting a word to its stem or root word.
Stemming

Recommendations
◦ The goal of preprocessing is to generate a list of most-relevant terms that is
sufficiently small so as to lend itself to analysis:
◦ Frequency can be used to eliminate words from consideration as tokens.
◦ Low-frequency words probably will not be very useful as tokens.
◦ Consolidating words that are synonyms can reduce the set of tokens.
◦ Most text-mining software gives the user the ability to manually specify terms to
include or exclude as tokens.
The use of slang, humor, and sarcasm can cause interpretation problems and might
require more sophisticated data cleansing and subjective intervention on the part of the
analyst to avoid misinterpretation.
Data preprocessing parses the original text data down to the set of tokens deemed
relevant for the topic being studied.

Frequency Term-Document Matrix
◦ When the documents in a corpus contain many words and when the frequency
of word occurrence is important to the context of the business problem,
preprocessing can be used to develop a frequency term-document matrix.
◦ A frequency term-document matrix is a matrix whose rows represent
documents and columns represent tokens, and the entries in the matrix are the
frequency of occurrence of each token in each document.

DataMining - Text frequency-inverse document frequency,

Term Count Info
Text Var Original (Total) Final (Total) Reduction, % Vocabulary
Comments 84 19 22,61904762 7
Document Info
Document ID # Characters # Terms
Comments_Doc1 70 14
Comments_Doc2 26 4
Comments_Doc3 53 10
Comments_Doc4 26 5
Comments_Doc5 61 11
Comments_Doc6 47 8
Comments_Doc7 42 7
Comments_Doc8 63 13
Comments_Doc9 23 4
Comments_Doc10 34 8
Comments
Top Terms Info
Term Collection Frequency Document Frequency
flight 4 4
seat 4 4
servic 3 3
delay 2 2
horribl 2 2
reclin 2 2
rude 2 2
Comments

Term-Document Matrix
Doc ID delay flight horribl reclin rude seat servic
Comments_Doc1 0 0 1 0 0 0 1
Comments_Doc10 0 0 0 0 0 1 0

Use Hierarchical Cluster analysis to group comments
On the Presence / Absence Term-Document Matrix
Selected Variables delay flight horribl reclin rude seat servic
Similarity Measure JACCARD
Clustering Method COMPLETE LINKAGE
Cluster Assignment TRUE
# Clusters 3
Normalized? FALSE
Draw Dendrogram? TRUE
Maximum Number of Leaves in
Dendrogram 10
Data Type Raw Data

Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 2 4
Record 5 1 5
Record 6 1 6
Record 7 3 7
Record 8 2 8
Record 9 3 9
Record 10 2 10
Record ID Cluster Sub-Cluster Comments
Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times.
Record 2 2 2 My seat was uncomfortable.
Record 3 3 3 My flight was delayed 2 hours for no apparent reason.
Record 4 2 4 My seat would not recline.
Record 5 1 5 The man at the ticket counter was rude. Service was horrible.
Record 6 1 6 The flight attendant was rude. Service was bad.
Record 7 3 7 My flight was delayed with no explanation.
Record 8 2 8 My drink spilled when the guy in front of me reclined his seat.
Record 9 3 9 My flight was canceled.
Record 10 2 10 The arm rest of my seat was nasty.

Record ID Cluster Sub-Cluster Comments
Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times.
Record 5 1 5 The man at the ticket counter was rude. Service was horrible.
Record 6 1 6 The flight attendant was rude. Service was bad.
Record 2 2 2 My seat was uncomfortable.
Record 4 2 4 My seat would not recline.
Record 8 2 8 My drink spilled when the guy in front of me reclined his seat.
Record 10 2 10 The arm rest of my seat was nasty.
Record 3 3 3 My flight was delayed 2 hours for no apparent reason.
Record 7 3 7 My flight was delayed with no explanation.
Record 9 3 9 My flight was canceled.
Cluster 1: {1, 5, 6} documents about service issues
Cluster 2: {2, 4, 8, 10} documents about seat issues
Cluster 3: {3, 7, 9} Documents about scheduling issues

Example Movie Review Frequency Term-document matrix
Terms
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2
We have 10 reviews from
movie critics. After using
preprocessing techniques,
including text reduction
by synonyms, the number
of tokens is down to two:
Great and Terrible.

Frequency Term-document matrix
Terms
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2
Apply K-Means Clustering to the Frequency-
terms matrix
With k = 2
The process of clustering /
categorizing comments or
reviews as positive, negative
or neutral is known as
sentiment analysis

Variables
Selected Variables Great Terrible
# Clusters 2
Start type Random Start
# Iterations 10
Random seed: initial centroids 12345
Show data summary TRUE
Show distance from each cluster TRUE
Normalized? FALSE
Cluster Great Terrible
Cluster 1 4.5714286 1.428571429
Cluster 2 0.6666667 3.333333333
Cluster 1 7 1.1250
Cluster 2 3 1.2136
Total 10 1.1516
One random start and 10 iterations

Record ID Cluster Dist.Cluster-1 Dist.Cluster-2
Record 1 1 1.491 5.467
Record 2 1 0.606 4.922
Record 3 1 0.606 4.922
Record 4 1 2.222 2.357
Record 5 1 0.606 4.922
Record 6 2 5.801 1.795
Record 7 1 0.714 4.069
Record 8 1 1.629 4.346
Record 9 2 3.902 0.471
Record 10 2 3.617 1.374

Record ID Cluster Great Terrible
Record 1 1 5 0
Record 2 1 5 1
Record 3 1 5 1
Record 4 1 3 3
Record 5 1 5 1
Record 6 2 0 5
Record 7 1 4 1
Record 8 1 5 3
Record 9 2 1 3
Record 10 2 1 2

Record ID Cluster Great Terrible
Record 1 1 5 0
Record 2 1 5 1
Record 3 1 5 1
Record 4 1 3 3
Record 5 1 5 1
Record 7 1 4 1
Record 8 1 5 3
Record 6 2 0 5
Record 9 2 1 3
Record 10 2 1 2
Reviews tend to
be positive
Reviews tend to
be negative
(3, 3) corresponds to a
balanced review and is
classified in Cluster 1

Sentiment Analysis
The process of clustering / categorizing
comments or reviews as positive, negative
or neutral is known as sentiment analysis

Text mining is
• how a business analyst turns 50,000 hotel guest reviews
into specific recommendations;
• how a workforce analyst improves productivity and
reduces employee turnover;
• how healthcare providers and biopharma
researchers understand patient experiences;
• and much, much more.
SAS Text Miner, Python, R, SPSS …

Notes Chapter 4.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Notes Chapter 4.pptx

Similar to Notes Chapter 4.pptx (20)

Recently uploaded

Recently uploaded (20)

Notes Chapter 4.pptx

Editor's Notes