SlideShare a Scribd company logo
1 of 117
MBA 643
Dr Danielle Morin
Fall 2022
CHAPTER 4
DESCRIPTIVE DATA MINING
•Hierarchical Clustering
•K-Means Clustering
•Association Rules
•Text Mining
Introduction
The increase in the use of data-mining techniques in
business has been caused largely by three events:
◦ The explosion in the amount of data being produced
and electronically tracked
◦ The ability to electronically warehouse these data
◦ The affordability of computer power to analyze the
data
3
Observation
A set of recorded values of variables associated
with a single entity. It is a row of values in a
spreadsheet or database, in which the columns
correspond to the variables
4
35 years
old
Male dentist single Donation 2016
$1000
Donation 2017
$2000
Supervised or Unsupervised Learning
Data-mining approaches can be separated into two categories:
Supervised learning—For prediction and classification
Unsupervised learning—To detect patterns and relationships in the data
Thought of as high-dimensional descriptive analytics
Designed to describe patterns and relationships in large data sets with
many observations of many variables
There is no outcome variable to predict
No definitive measure of accuracy, but qualitative assessment
Cluster Analysis
Goal: to segment observations into similar groups based on observed
variables
Can be employed during the data-preparation step to identify variables
or observations that can be aggregated or removed from
consideration
Commonly used in marketing to divide customers into different
homogenous groups; known as market segmentation
Used to identify outliers
Cluster Methods
Bottom-up hierarchical clustering starts with each observation
belonging to its own cluster and then sequentially merges the most
similar clusters to create a series of nested clusters
k-means clustering assigns each observation to one of k clusters in
a manner such that the observations assigned to the same cluster
are as similar as possible
Both methods depend on how two observations are similar,
hence, we have to measure similarity between observations
Three influential factors
Hierarchical versus nonhierarchical clustering
The measurement of the distance between observations
The measurement of the distance between clusters
Measurement of Distances between observations
Euclidean distance
Matching coefficients
Jaccard coefficients
Measuring Similarity Between Observations
Euclidean distance: Most common method to measure dissimilarity
between observations, when observations include continuous variables
Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise
measurements of q variables
The Euclidean distance between observations u and v is:
𝒅𝒖,𝒗 = 𝒖𝟏 − 𝒗𝟏
𝟐 + 𝒖𝟐 − 𝒗𝟐
𝟐 + ∙ ∙ ∙ + 𝒖𝒒 − 𝒗𝒒
𝟐
NOTE: This measure of distance is highly influenced by the scale on
which the variables are measured
Calculate the Euclidean Distance
Euclidean distance becomes smaller as a pair of observations become more
similar with respect to their variable values
11
Example
Euclidean distance between
(2 cars, 5 children) and (1 car, 3 children)
(2, 5) and (1, 3)
Distance = (2 − 1)2+(5 − 3)2= 5 = 2.24
Euclidian Distance
Euclidean distance is highly influenced by the scale on which
variables are measured
◦ Common to standardize the units of each variable j of
each observation u
◦ Example: uj, the value of variable j in observation u, is
replaced with its z-score, zj
The conversion to z-scores also makes it easier to identify
outlier measurements, which can distort the Euclidean
distance between observations
12
Age Female Income Married Children CarLoan Mortgage
48 1 17546.00 0 1 0 0
40 0 30085.10 1 3 1 1
51 1 16575.40 1 0 1 0
23 1 20375.40 1 3 0 0
57 1 50576.30 1 0 0 0
57 1 37869.60 1 2 0 0
22 0 8877.07 0 0 0 0
58 0 24946.60 1 0 1 0
37 1 25304.30 1 2 1 0
54 0 24212.10 1 2 1 0
66 1 59803.90 1 0 0 0
52 1 26658.80 0 0 1 1
44 1 15735.80 1 1 0 1
66 1 55204.70 1 1 1 1
36 0 19474.60 1 0 0 1
38 1 22342.10 1 0 1 1
37 1 17729.80 1 2 0 1
46 1 41016.00 1 0 0 1
62 1 26909.20 1 0 0 0
31 0 22522.80 1 0 1 0
61 0 57880.70 1 2 0 0
50 0 16497.30 1 2 0 0
54 0 38446.60 1 0 0 0
27 1 15538.80 0 0 1 1
22 0 12640.30 0 2 1 0
56 0 41034.00 1 0 1 1
45 0 20809.70 1 0 0 1
39 1 20114.00 1 1 0 0
39 1 29359.10 0 3 1 1
61 0 24270.10 1 1 0 0
Example
A Financial advising
company that provides
personalized financial advice
to its clients would like to
segment its customers pool
into several groups (clusters)
to better serve them.
Variables: Age,
Gender (1 if Female and 0 if male),
Annual Income,
whether Married (1) and not married (0),
Number of children,
If a Car loan = 1 and 0 if not,
Mortgage = 1 if a mortgage and 0 if not
Example:
Consider only Age and
Income
Age Income
48 17546.00
40 30085.10
51 16575.40
23 20375.40
57 50576.30
57 37869.60
22 8877.07
58 24946.60
37 25304.30
54 24212.10
66 59803.90
52 26658.80
44 15735.80
66 55204.70
36 19474.60
38 22342.10
37 17729.80
46 41016.00
62 26909.20
31 22522.80
61 57880.70
50 16497.30
54 38446.60
27 15538.80
22 12640.30
56 41034.00
45 20809.70
39 20114.00
39 29359.10
𝐷 = (48 − 40)2+(17546.00 − 30085.10)2=
D = 12539.1
The Euclidean distance between the first
two observations:
This dissimilarity measure is influenced by
the large values of Income
It would be better to use the Z-Score for
each variables to remove different units
Age Income
average 45.97 28011.87
st.dev. 13.04 13703.28
Zage Zincome
0.16 -0.76
-0.46 0.15
0.39 -0.83
-1.76 -0.56
0.85 1.65
0.85 0.72
-1.84 -1.40
0.92 -0.22
-0.69 -0.20
0.62 -0.28
1.54 2.32
0.46 -0.10
-0.15 -0.90
1.54 1.98
-0.76 -0.62
-0.61 -0.41
-0.69 -0.75
0.00 0.95
1.23 -0.08
-1.15 -0.40
1.15 2.18
0.31 -0.84
0.62 0.76
-1.45 -0.91
-1.84 -1.12
0.77 0.95
-0.07 -0.53
-0.53 -0.58
-0.53 0.10
Standardized Distance
between first two
observations =
𝑆𝑡. 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒
= (0.16 − −0.46)2+(−0.76 − 0.15)2
=1.101
Z-score also helps
identifying outliers
(48, 17546.00) & (40, 30085.10)
𝑍𝑎𝑔𝑒 =
(48 − 45.97)
13.04
= 0.16
(0.16, -0.76) & (-0.46, 0.15)
Matching Coefficients
For categorical variables encoded as 0–1, a better measure
of similarity between two observations can be achieved by
counting the number of variables with matching values
The simplest overlap measure is called the matching coefficient and is
computed by:
Example
Matching Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 3/5 = 0.60
Jaccard’s Coefficient
A weakness of the matching coefficient is that if two observations both
have a 0 entry for a categorical variable, this is counted as a sign of
similarity between the two observations
To avoid misstating similarity due to the absence of a feature, a
similarity measure called Jaccard’s coefficient does not count matching
zero entries and is computed by:
Example
Jaccard’s Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 2/(5-1) = 0.50
Example
Calculate the Matching Coefficient and the Jaccard’s Coefficient for
the first five observations. Consider only categorical variables
Age FemaleIncome Married Children CarLoan Mortgage
1 48 1 17546 0 1 0 0
2 40 0 30085 1 3 1 1
3 51 1 16575 1 0 1 0
4 23 1 20375 1 3 0 0
5 57 1 50576 1 0 0 0
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Example
Matching Coefficient
Similarity matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 2/4=0.5 2/4=0.5 1
4 3/4=0.75 1/4=0.25 3/4=0.75 1
5 3/4=0.75 1/4 =0.25 3/4=0.75 4/4=1 1
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Matching coefficients between
observations 1 and 3:
(1 0 0 0) (1 1 1 0)
Matching Coefficient
= 2/4 = 0.5
Example
Matching Coefficient
Similarity matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 2/4=0.50 2/4=0.50 1
4 3/4=0.75 1/4=0.25 3/4=0.75 1
5 3/4=0.75 1/4 =0.25 3/4=0.75 4/4=1 1
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Matching coefficient between
Observations 1 and 4 = ¾ = 0.75,
and therefore more similar
than observations 2 and 3 (0.5)
Example
Jaccard’s Coefficient
Similarity Matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 1/3=0.33 2/4=0.50 1
4 ½=0.5 ¼=0.25 2/3=0.67 1
5 ½=0.5 ¼=0.25 2/3=0.67 2/2=1 1
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
Jaccard coefficient between
observations 1 and 3:
(1 0 0 0) (1 1 1 0)
Jaccard coefficient =
1/(4-1) = 1/3
X X
Example
Jaccard’s Coefficient
Similarity Matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 1/3=0.33 2/4=0.50 1
4 ½=0.50 ¼-0.25 2/3=0.67 1
5 ½=0.50 ¼=0.25 2/3=0.67 2/2=1 1
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
According to Jaccard:
Observations 1 and 4 are
equally similar (0.5)
to observations 2 and 3 (0.5)
According to Matching:
Observations 1 and 4 are more
similar (0.75) than observations
2 and 3 (0.5)
Hierarchical Clustering
◦Determines the similarity of two clusters by considering the similarity
between the observations composing either cluster
◦Starts with each observation in its own cluster and then iteratively
combines the two clusters that are the most similar into a single cluster
◦Given a way to measure similarity between observations, there are
several clustering method alternatives for comparing observations in two
clusters to obtain a cluster similarity measure
Cluster similarity measures
◦Single linkage (closest neighbor)
◦Complete linkage (furthest neighbor)
◦Group average linkage
◦Median linkage
◦Centroid linkage
◦Ward’s method
◦McQuitty’s method
Similarity Measures between clusters
24
• The similarity between two clusters is defined by the similarity of the pair of
observations (one from each cluster) that are the most similar
Single linkage
• This clustering method defines the similarity between two clusters as the
similarity of the pair of observations (one from each cluster) that are the most
different
Complete
linkage
• Defines the similarity between two clusters to be the average similarity
computed over all pairs of observations between the two clusters
Group Average
linkage
• Analogous to group average linkage except that it uses the median of the
similarities computer between all pairs of observations between the two
clusters
• Uses the averaging concept of cluster centroids to define between-cluster
similarity
Median linkage
Centroid Linkage
Measuring Similarity Between Clusters
Single linkage will consider
two clusters to be close if an
observation in one of the
clusters is close to at least
one observation in the
other cluster. The cluster
formed by merging two
clusters that are close with
respect to single linkage
may also consist of pairs of
observations that are very
different.
Complete linkage will
consider two clusters to be
close if their most different
pair of observations are
close. This method produces
clusters such that all
member observations of a
cluster are relatively close
to each other. This
clustering can be distorted
by outlier observations.
The single and Complete
linkage methods are based on
a single pair of observation to
determine similarity
Measuring Similarity Between Clusters
Group Average Linkage on
all the pairs of observations.
If Cluster 1 has n1
observations and Cluster 2
has n2 observations, then
the similarity measure is the
average of all n1 x n2 pairs.
This method produces
clusters that are less
dominated by similarity
measures between single
pairs of obseravtions.
Centroid Linkage is based on
the average calculated in
each cluster which is called
the centroid. The similarity
between the clusters is
defined as the similarity of
the centroids.
The Median Linkage is similar
to the Group Average
Linkage, except that it uses
the Median instead of the
average
Example
Consider the following matrix of distances between pairs of 5 objects:
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Single linkage
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Single linkage
1 2 4 (3,5)
1 0
2 9 0
4 6 5 0
(3,5) 3 7 8 0
D[1,(3,5)]=min[D(1,3),D(1,5)]=min[3,11]=3
D[2,(3,5)]=min[D(2,3),D(2,5)]=min[7,10]=7
D[4,(3,5)]=min[D(4,3),D(4,5)]=min[9,8]=8
2 4 (1,3,5)
2 0
4 5 0
(1,3,5) 7 6 0
D[2,(1,3,5)]=min[D(2,1),D(2,3),D(2,5)]=min[9,7,10)]=7
D[4,(1,3,5)]=min[D(4,1),D(4,3),D(4,5)]=min[6,9,8)]=6
(2,4) (1,3,5)
(2,4) 0
(1,3,5) 6 0
D[(2,4),(1,3,5)]=min[D[2,(1,3,5),D(4,(1,3,5)]=min[7,6]=6
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
Classify using Hierarchical Clustering with Single linkage
1 2 4 (3,5)
1 0
2 9 0
4 6 5 0
(3,5) 3 7 8 0
2 4 (1,3,5)
2 0
4 5 0
(1,3,5) 7 6 0
(2,4) (1,3,5)
(2,4) 0
(1,3,5) 6 0
Step Distance Merger Clusters # clusters
1 2 3 with 5 1, 2, 4, (3,5) 4
2 3 1 with (3,5) 2, 4, (1,3,5) 3
3 5 2 with 4 (2,4), (1,3,5) 2
4 6 (2,4) with (1,3,5) (1,2,3,4,5) 1
Step Distance Merger Clusters # clusters
1 2 3 with 5 1, 2, 4, (3,5) 4
2 3 1 with (3,5) 2, 4, (1,3,5) 3
3 5 2 with 4 (2,4), (1,3,5) 2
4 6 (2,4) with (1,3,5) (1,2,3,4,5) 1
DENDROGRAM
Frontline Solver – DataMining – Cluster - Hierarchical Clustering – Distance Matrix– Single Linkage – 4 Clusters
Hierarchical Clustering: Reporting Parameters
Normalized?
Draw Dendrogram?
Maximum Number of Leaves in Dendrogram
Data Type
FALSE
TRUE
5
Distance Matrix
Clustering Method
EUCLIDEAN
SINGLE LINKAGE
Hierarchical Clustering: Model Parameters
Cluster Assignment
# Clusters
TRUE
4
Hierarchical Clustering: Fitting Parameters
Similarity Measure
Clustering Stages
RESULTS
Cluster Labels
Stage Cluster 1 Cluster 2 Distance
Stage1 3 5 2
Stage2 1 3 3
Stage3 2 4 5
Stage4 1 2 6
Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 4
Record 5 3 5
3 5 1 2 4 0
4000
0
1
2
3
4
5
6
7
Distance
Dendrogram
Example
Consider the following matrix of distances between pairs of 5 objects:
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Complete linkage
Frontline Solver – DataMining – Cluster - Hierarchical Clustering – Distance Matrix– Complete Linkage – 4 Clusters
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Complete linkage
1 2 4 (3,5)
1 0
2 9 0
4 6 5 0
(3,5) 11 10 9 0
D[1,(3,5)]=max[D(1,3),D(1,5)]=max[3,11]=11
D[2,(3,5)]=max[D(2,3),D(2,5)]=max [7,10]=10
D[4,(3,5)]=max[D(4,3),D(4,5)]=max[9,8]=9
1 (2,4) (3,5)
1 0
(2,4) 9 0
(3,5) 11 11 0
D[1,(2,4)]=max[D(1,2),D(1,4)]=max[9,6] =9
D[1,(3,5)]=max[D(1,3),D(1,5)]=max[3,11]=11
(1,2,4) (3,5)
(1,2,4) 0
(3,5) 11 0
D[(1,2,4),(3,5)]=max[D[1,(3,5)],D((2,4),(3,5)]=max[11,11]=11
D[(2,4),(3,5)]=max[D[2,(3,5),D(4,(3,5)]=max[11,9]=11
Clustering Stages
RESULTS
Cluster Labels
Stage Cluster 1 Cluster 2 Distance
Stage1 3 5 2
Stage2 2 4 5
Stage3 1 2 9
Stage4 1 3 11
Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 4
Record 5 3 5
Hierarchical Cluster Complete Linkage
3 5 1 2 4 0
4000
0
2
4
6
8
10
12
Distance
Dendrogram
Cluster Analysis - More measures
Centroid linkage uses the averaging concept of cluster centroids to
define between-cluster similarity
Ward’s method merges two clusters such that the dissimilarity of the
observations with the resulting single cluster increases as little as
possible
When McQuitty’s method considers merging two clusters A and B, the
dissimilarity of the resulting cluster AB to any other cluster C is calculate
as: ((dissimilarity between A and C) + (dissimilarity between B and C))/2)
A dendrogram is a chart that depicts the set of nested clusters resulting
at each step of aggregation
Example
Female Married CarLoan Mortgage
1 0 0 0
0 1 1 1
1 1 1 0
1 1 0 0
1 1 0 0
1 1 0 0
0 0 0 0
0 1 1 0
1 1 1 0
0 1 1 0
1 1 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 1 0 1
1 1 1 1
1 1 0 1
1 1 0 1
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
0 1 0 0
1 0 1 1
0 0 1 0
0 1 1 1
0 1 0 1
1 1 0 0
1 0 1 1
0 1 0 0
Analyse using
Hierarchical Clustering
Matching Coefficients
Group Average linkage
Inputs
Female Married CarLoan Mortgage
Hierarchical Clustering: Reporting Parameters
Normalized?
Draw Dendrogram?
Maximum Number of Leaves in Dendrogram
Data Type
FALSE
TRUE
10
Raw Data
Clustering Method
MATCHING
GROUP AVERAGE
Hierarchical Clustering: Model Parameters
Cluster Assignment
# Clusters
TRUE
29
Variables
# Selected Variables
Selected Variables
4
Hierarchical Clustering: Fitting Parameters
Similarity Measure
Data
Workbook
Worksheet
Range
# Records in the input data
Data Chapter 4.xlsx
KYC
$A$1:$G$31
30
Clustering
stages
Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 1
Record 5 4 1
Record 6 5 1
Record 7 6 4
Record 8 7 2
Record 9 8 3
Record 10 9 2
Record 11 10 1
Record 12 11 5
Record 13 12 6
Record 14 13 7
Record 15 14 8
Record 16 15 7
Record 17 16 6
Record 18 17 6
Record 19 18 1
Record 20 19 2
Record 21 20 9
Record 22 21 9
Record 23 22 9
Record 24 23 5
Record 25 24 10
Record 26 25 2
Record 27 26 8
Record 28 27 1
Record 29 28 5
Record 30 29 9
Stage Cluster 1 Cluster 2 Distance
Stage1 4 5 0
Stage2 4 6 0
Stage3 3 9 0
Stage4 8 10 0
Stage5 4 11 0
Stage6 14 16 0
Stage7 13 17 0
Stage8 13 18 0
Stage9 4 19 0
Stage10 8 20 0
Stage11 21 22 0
Stage12 21 23 0
Stage13 12 24 0
Stage14 2 26 0
Stage15 15 27 0
Stage16 4 28 0
Stage17 12 29 0
Stage18 21 30 0
Stage19 1 4 0.25
Stage20 2 8 0.25
Stage21 3 14 0.25
Stage22 13 15 0.25
Stage23 7 21 0.25
Stage24 1 7 0.3214
Stage25 2 25 0.35
Stage26 3 12 0.375
Stage27 1 13 0.4125
Stage28 2 3 0.5060
Stage29 1 2 0.5905
Sub-Cluster
membership
Each of the 30
observations is
assigned to
one of the 10
sub-clusters
Cluster membership
Cluster Legend (Numbers show the record sequence relative to the original data)
Sub-
Cluster
1
Sub-
Cluster
2
Sub-
Cluster
3
Sub-
Cluster
4
Sub-
Cluster
5
Sub-
Cluster
6
Sub-
Cluster
7
Sub-
Cluster
8
Sub-
Cluster
9
Sub-
Cluster
10
1 2 3 7 12 13 14 15 21 25
4 8 9 24 17 16 27 22
5 10 29 18 23
6 20 30
11 26
19
28
Analytic Solver Basic
limits the Maximum
Number of Leaves to 10.
Setting a value higher
than 10 will result in an
error message.
If the number of Leaves
is less than the number
of observations, some of
the clusters on the
horizontal axis will
initially correspond to
groups of observations
combined in early steps
of the agglomeration
not represented in the
dendrogram
Sub-cluster #3
At distance 0.251,
we have 7
clusters
At distance 0.42,
we have 3 clusters
The vertical distance
between the two horizontal
lines is the cost of merging
clusters in terms of
decreased homogeneity
within clusters =
0.42-0.251 = 0.169
NOTE: Elongating portions
of the dendrogram
represents mergers of more
dissimilar clusters
Cluster’s durability or strength
The durability or
strength of the cluster is
measured by the
difference between the
distance value at which
a cluster is originally
formed and the distance
value at which it is
merged with another
cluster.
Cluster 8 is formed at
distance 0 and merged
with cluster 6 at
distance 0.25.
Strength of cluster 8 =
0.25 – 0 = 0.25
At distance = 0.42,
we have 3 clusters
Cluster 1 Cluster 2 Cluster 3
Cluster 1: (sub-clusters 3, 7, 5)
{3, 9, 14, 16, 12, 24, 29}
Cluster 2: (sub-clusters 2, 10)
{2, 8, 10, 20, 26, 25}
Cluster 3: (sub-clusters 1, 4, 9, 6, 8)
{1, 4, 5, 6, 11, 19, 28, 7, 21, 22, 23,
30, 13, 17, 18, 15, 27}
Hierarchical Clustering
Matching Coefficients and Group Average linkage
Cluster 3: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27}
Cluster 2: {2, 26, 8, 10, 20, 25}
Cluster 1: {3, 9, 14, 16, 12, 24, 29}
Cluster 1
Cluster 2
Cluster 3
Observation Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 0 0 0 0
8 0 1 1 0
9 1 1 1 0
10 0 1 1 0
11 1 1 0 0
12 1 0 1 1
13 1 1 0 1
14 1 1 1 1
15 0 1 0 1
16 1 1 1 1
17 1 1 0 1
18 1 1 0 1
19 1 1 0 0
20 0 1 1 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
24 1 0 1 1
25 0 0 1 0
26 0 1 1 1
27 0 1 0 1
28 1 1 0 0
29 1 0 1 1
30 0 1 0 0
CLUSTER 1
Obs Female Married Car Loan Mortgage
3 1 1 1 0
9 1 1 1 0
12 1 0 1 1
14 1 1 1 1
16 1 1 1 1
24 1 0 1 1
29 1 0 1 1
Cluster 1
Row Labels Count of Female
0 0
1 7
Grand Total 7
Row Labels Count of Married
0 3
1 4
Grand Total 7
Row Labels Count of Mortgage
0 2
1 5
Grand Total 7
All have car loans
CLUSTER 2
Obs Female Married Car Loan Mortgage
2 0 1 1 1
8 0 1 1 0
10 0 1 1 0
20 0 1 1 0
25 0 0 1 0
26 0 1 1 1
Cluster 2
Row Labels Count of Female
0 6
1 0
Grand Total 6
Row Labels Count of Married
0 1
1 5
Grand Total 6
Row Labels Count of Mortgage
0 4
1 2
Grand Total 6
All have car loans
CLUSTER 3
Obs Female Married Car Loan Mortgage
1 1 0 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 0 0 0 0
11 1 1 0 0
13 1 1 0 1
15 0 1 0 1
17 1 1 0 1
18 1 1 0 1
19 1 1 0 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
27 0 1 0 1
28 1 1 0 0
30 0 1 0 0
Cluster 3
Row Labels Count of Female
0 7
1 10
Grand Total 17
Row Labels Count of Married
0 2
1 15
Grand Total 17
Row Labels Count of Mortgage
0 12
1 5
Grand Total 17
None have car loans
APPLICATIONS TO HIERARCHICAL CLUSTERING
Herarchical clustering is often used on DNA microarrays.
Clustering of gene expression profiles is often used to try
to discover subclasses of disease.
Validation of these clusters is important for accurate
scientific interpretation of the results.
k-Means Clustering
◦ Given a value of k, the k-means algorithm randomly partitions
the observations into k clusters
◦ After all observations have been assigned to a cluster, the
resulting cluster centroids are calculated
◦ Using the updated cluster centroids, all observations are
reassigned to the cluster with the closest centroid using the
Euclidean distance
53
Example
Age, Income, Number of
Children
Age Income Children
48 17546 1
40 30085 3
51 16575 0
23 20375 3
57 50576 0
57 37870 2
22 8877 0
58 24947 0
37 25304 2
54 24212 2
66 59804 0
52 26659 0
44 15736 1
66 55205 1
36 19475 0
38 22342 0
37 17730 2
46 41016 0
62 26909 0
31 22523 0
61 57881 2
50 16497 2
54 38447 0
27 15539 0
22 12640 2
56 41034 0
45 20810 0
39 20114 1
39 29359 3
61 24270 1
Example
Clustering Observations by Age and Income
Using k-Means Clustering with k = 3
55
Most
heterogeneous
Largest
cluster
Most
homogeneous
Data
Workbook DemoKTC.xlsx
Worksheet Data
Range $A$1:$G$31
# Records in the input data 30
# Selected Variables 2
Selected Variables Age Income
K-Means Clustering: Fitting Parameters
# Clusters 3
Start type Random Start
# Iterations 10
Random seed: initial centroids 12345
K-Means Clustering: Reporting Parameters
Show data summary TRUE
Show distance from each cluster TRUE
Normalized? TRUE
Cluster using Age
and Income only
Normalize
Cluster Age Income
Cluster 1 -1.837607 -1.121744243
Cluster 2 -1.837607 -1.121744243
Cluster 3 0.7692901 0.950292945
Cluster Age Income
Cluster 1 -0.610832 -0.41375302
Cluster 2 -0.687505 -0.197585752
Cluster 3 0.8459635 0.719370083
Cluster Age Income
Cluster 1 -0.534158 -0.576349161
Cluster 2 1.5360244 1.984403237
Cluster 3 0.3859229 -0.83457936
Cluster Age Income
Cluster 1 0.8459635 1.646644617
Cluster 2 -1.147546 -0.400566393
Cluster 3 1.5360244 2.320030979
Cluster Age Income
Cluster 1 1.5360244 1.984403237
Cluster 2 -0.150791 -0.895849375
Cluster 3 -1.837607 -1.121744243
Start 1. Sum of Squares: 45.069043
Start 2. Sum of Squares: 24.830185
Best: Start 3. Sum of Squares: 22.397972
Start 4. Sum of Squares: 35.082695
Start 5. Sum of Squares: 25.533615
Cluster Age Income
Cluster 1 -1.837607 -1.396366871
Cluster 2 -1.837607 -1.396366871
Cluster 3 -0.687505 -0.750336738
Cluster Age Income
Cluster 1 0.8459635 0.719370083
Cluster 2 -1.147546 -0.400566393
Cluster 3 1.2293307 -0.080467783
Cluster Age Income
Cluster 1 -0.610832 -0.41375302
Cluster 2 1.2293307 -0.080467783
Cluster 3 -0.764179 -0.623009532
Cluster Age Income
Cluster 1 -0.074118 -0.525580284
Cluster 2 -0.610832 -0.41375302
Cluster 3 0.4625964 -0.098740784
Cluster Age Income
Cluster 1 1.2293307 -0.080467783
Cluster 2 -0.687505 -0.197585752
Cluster 3 0.6159432 -0.277289314
Start 7. Sum of Squares: 23.072807
Start 8. Sum of Squares: 34.274493
Start 9. Sum of Squares: 35.570412
Start 10. Sum of Squares: 34.630941
Start 6. Sum of Squares: 85.235001
10 iterations
Best Start
with
smallest SS
Cluster Age Income
Cluster 1 -1.0261 -0.5581
Cluster 2 0.9131 1.4389
Cluster 3 0.5009 -0.4813
Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 1 0.0000 2.7836 1.5290
Cluster 2 2.7836 0.0000 1.9639
Cluster 3 1.5290 1.9639 0.0000
Cluster Size Average Distance
Cluster 1 12 0.6215
Cluster 2 8 0.7387
Cluster 3 10 0.5202
Total 30 0.6190
Cluster Centers
Inter-Cluster Distances
Cluster Summary
Normalized data
Evaluate the strength of the clusters
59
Cluster 2 is most heterogeneous
Distance between Cluster 2 and
Cluster 3 centroids =1.9639
The average observation in Cluster 2
is approximately 2.66 times closer to
Cluster 2 than to Cluster 3.
(1.964/0.739 = 2.66)
Rule of thumb : the ratio of between-cluster distance to
within-cluster distance should exceed 1.0 for useful clusters.
For Cluster 1: within-distance = 0.6215, distance between Cluster 1 and Cluster 2 = 2.7836, the ratio = 2.7836 / 0.6215 =
4.48. The average observation is Cluster 1 is 4.48 times closer to Cluster 1 than to Cluster 2 and also 2.46 time closer
than to Cluster 3. (the ratio = 1.529 / 0.6215 = 2.46)
Average distance between
observations in the cluster
Distances between centroids
Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3
Record 1 3 1.1998 2.3291 0.4459
Record 2 1 0.9092 1.8805 1.1484
Record 3 3 1.4389 2.3338 0.3715
Record 4 1 0.7348 3.3369 2.2631
Record 5 2 2.8924 0.2183 2.1558
Record 6 2 2.2665 0.7226 1.2493
Record 7 1 1.1667 3.9503 2.5112
Record 8 3 1.9773 1.6626 0.4942
Record 9 1 0.4946 2.2890 1.2218
Record 10 3 1.6659 1.7417 0.2342
Record 11 2 3.8534 1.0791 2.9865
Record 12 3 1.5580 1.6022 0.3845
Record 13 3 0.9382 2.5657 0.7724
Record 14 2 3.6096 0.8281 2.6742
Record 15 1 0.2699 2.6579 1.2730
Record 16 1 0.4397 2.3988 1.1138
Record 17 1 0.3894 2.7119 1.2185
Record 18 2 1.8247 1.0339 1.5146
Record 19 3 2.3055 1.5519 0.8314
Record 20 1 0.1989 2.7622 1.6505
Record 21 2 3.4990 0.7786 2.7397
Record 22 3 1.3649 2.3578 0.4069
Record 23 2 2.1066 0.7397 1.2481
Record 24 1 0.5543 3.3350 2.0017
Record 25 1 0.9880 3.7580 2.4246
Record 26 2 2.3450 0.5093 1.4566
Record 27 3 0.9526 2.1985 0.5768
Record 28 1 0.4923 2.4810 1.0394
Record 29 1 0.8204 1.9727 1.1863
Record 30 3 2.1974 1.7286 0.6842
Distances from
each cluster
Classify as
group 3
because the
distance is
the smallest
to Cluster 3
Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income
Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00
Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10
Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40
Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40
Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30
Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60
Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07
Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60
Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30
Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10
Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90
Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80
Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80
Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70
Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60
Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10
Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80
Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00
Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20
Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80
Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70
Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30
Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60
Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80
Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30
Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00
Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70
Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00
Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10
Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10
Insert the original
data to analyze the
clusters
Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income
Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10
Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40
Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07
Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30
Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60
Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10
Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80
Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80
Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80
Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30
Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00
Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10
Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30
Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60
Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90
Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70
Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00
Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70
Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60
Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00
Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00
Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40
Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60
Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10
Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80
Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80
Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20
Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30
Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70
Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10
Cluster 1
Cluster 2
Cluster 3
Order the column
Cluster and
expand the
selection to all the
other columns
Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income
2 1 0,90921 1,880479 1,14838 40 30085,1
4 1 0,734788 3,336877 2,263141 23 20375,4
7 1 1,166663 3,950271 2,511188 22 8877,07
9 1 0,494644 2,289048 1,221841 37 25304,3
15 1 0,269881 2,657896 1,27302 36 19474,6
16 1 0,439695 2,398833 1,113817 38 22342,1
17 1 0,389384 2,711894 1,218504 37 17729,8
20 1 0,19891 2,762165 1,650456 31 22522,8
24 1 0,554286 3,335008 2,001662 27 15538,8
25 1 0,98799 3,758034 2,424644 22 12640,3
28 1 0,492325 2,481026 1,039444 39 20114
29 1 0,820351 1,972684 1,186339 39 29359,1
5 2 2,892376 0,218347 2,155763 57 50576,3
6 2 2,266453 0,722611 1,249289 57 37869,6
11 2 3,853381 1,079146 2,986474 66 59803,9
14 2 3,6096 0,828076 2,674181 66 55204,7
18 2 1,824724 1,033919 1,514648 46 41016
21 2 3,498976 0,778609 2,73966 61 57880,7
23 2 2,106615 0,739677 1,248115 54 38446,6
26 2 2,344982 0,50928 1,456556 56 41034
1 3 1,199799 2,329113 0,445879 48 17546
3 3 1,438875 2,333751 0,371502 51 16575,4
8 3 1,977273 1,662577 0,494178 58 24946,6
10 3 1,665932 1,741678 0,23422 54 24212,1
12 3 1,55801 1,602226 0,384503 52 26658,8
13 3 0,938242 2,565664 0,772381 44 15735,8
19 3 2,305502 1,551899 0,831416 62 26909,2
22 3 1,364876 2,357765 0,406925 50 16497,3
27 3 0,952585 2,19853 0,576751 45 20809,7
30 3 2,197374 1,728604 0,684194 61 24270,1
Distances from cluster centers are in
normalized coordinates
Cluster #Obs Avg. Dist
Cluster-1 12 0,622
Cluster-2 8 0,739
Cluster-3 10 0,52
Overall 30 0,619
Normalized coordinates
Average
Age
Average
Income
n
Cluster 1 32.58 21363.61 12
Cluster 2 57.88 47728.98 8
Cluster 3 52.50 21416.10 10
K-Means Clustering on Age, Income and Children
Age Income Children
K-Means Clustering: Reporting Parameters
Show data summary
Show distance from each cluster
Normalized?
TRUE
TRUE
TRUE
Start type
# Iterations
Random seed: initial centroids
3
Random Start
10
12345
Variables
# Selected Variables
Selected Variables
3
K-Means Clustering: Fitting Parameters
# Clusters
No. of iterations = No. of times that cluster centroids are
recalculated and observations are reassigned to clusters
Cluster Age Income Children
Cluster 1 -1.837607 -1.121744243 0.9870553
Cluster 2 -1.837607 -1.121744243 0.9870553
Cluster 3 0.7692901 0.950292945 -0.863673
Cluster Age Income Children
Cluster 1 -0.610832 -0.41375302 -0.863673
Cluster 2 -0.687505 -0.197585752 0.9870553
Cluster 3 0.8459635 0.719370083 0.9870553
Cluster Age Income Children
Cluster 1 -0.534158 -0.576349161 0.061691
Cluster 2 1.5360244 1.984403237 0.061691
Cluster 3 0.3859229 -0.83457936 -0.863673
Cluster Age Income Children
Cluster 1 0.8459635 1.646644617 -0.863673
Cluster 2 -1.147546 -0.400566393 -0.863673
Cluster 3 1.5360244 2.320030979 -0.863673
Cluster Age Income Children
Cluster 1 1.5360244 1.984403237 0.061691
Cluster 2 -0.150791 -0.895849375 0.061691
Start 1. Sum of Squares: 81.070037
Start 2. Sum of Squares: 51.665475
Best: Start 3. Sum of Squares: 49.715972
Start 4. Sum of Squares: 86.460648
Start 5. Sum of Squares: 54.790692
Cluster Age Income Children
Cluster 1 -1.837607 -1.396366871 -0.863673
Cluster 2 -1.837607 -1.396366871 -0.863673
Cluster 3 -0.687505 -0.750336738 0.9870553
Cluster Age Income Children
Cluster 1 0.8459635 0.719370083 0.9870553
Cluster 2 -1.147546 -0.400566393 -0.863673
Cluster 3 1.2293307 -0.080467783 -0.863673
Cluster Age Income Children
Cluster 1 -0.610832 -0.41375302 -0.863673
Cluster 2 1.2293307 -0.080467783 -0.863673
Cluster 3 -0.764179 -0.623009532 -0.863673
Cluster Age Income Children
Cluster 1 -0.074118 -0.525580284 -0.863673
Cluster 2 -0.610832 -0.41375302 -0.863673
Cluster 3 0.4625964 -0.098740784 -0.863673
Cluster Age Income Children
Cluster 1 1.2293307 -0.080467783 -0.863673
Cluster 2 -0.687505 -0.197585752 0.9870553
Cluster 3 0.6159432 -0.277289314 0.9870553
Start 7. Sum of Squares: 60.531938
Start 8. Sum of Squares: 85.652446
Start 9. Sum of Squares: 86.948365
Start 10. Sum of Squares: 63.737762
Start 6. Sum of Squares: 133.415590
Random Starts Summary
Cluster Centers
Cluster Age Income Children
Cluster 1 -0.7182 -0.6041 0.4935
Cluster 2 1.1833 1.7700 0.0617
Cluster 3 0.4856 0.0211 -0.7711
Inter-Cluster Distances
Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 1 0 3.0722 1.8545
Cluster 2 3.0722 0 2.05893
Cluster 3 1.8545 2.0589 0
Cluster Summary
Cluster Size Average Distance
Cluster 1 15 1.2413
Cluster 2 5 0.9982
Cluster 3 10 0.8147
Total 30 1.0586
Cluster 1 and Cluster 2 are the
most distinct pairs of clusters
Observations within clusters are
more similar than observations
between clusters
Cluster 3
more
homogenous
K-Means Clustering –
Predicted Clusters
Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income Children
1 1 0.987923 2.734159 1.190912 48 17546 1
2 1 1.62843 2.955969 2.847426 40 30085.1 3
3 3 1.764699 2.876826 0.866409 51 16575.4 0
4 1 1.761474 4.184518 3.547236 23 20375.4 3
5 2 3.058468 0.992641 1.667591 57 50576.3 0
6 2 2.107507 1.440136 1.925799 57 37869.6 2
7 1 1.929472 4.473073 2.723054 22 8877.07 0
8 3 2.163087 2.213405 0.509393 58 24946.6 0
9 1 0.640108 2.868416 2.124907 37 25304.3 2
10 1 1.459529 2.317267 1.788088 54 24212.1 2
11 2 3.93367 1.132784 2.529248 66 59803.9 0
12 3 1.868574 2.206364 0.153137 52 26658.8 0
13 1 0.770418 2.981068 1.392612 44 15735.8 1
14 2 3.459491 0.412738 2.37731 66 55204.7 1
15 1 1.358113 3.221133 1.409031 36 19474.6 0
16 3 1.374677 2.97392 1.183135 38 22342.1 0
17 1 0.51566 3.272391 2.250002 37 17729.8 2
18 3 2.184812 1.710157 1.050179 46 41016 0
19 3 2.430829 2.06948 0.756316 62 26909.2 0
20 1 1.437973 3.316736 1.689235 31 22522.8 0
21 2 3.390112 1.012452 2.862822 61 57880.7 2
22 1 1.16403 2.904136 1.96578 50 16497.3 2
23 3 2.342344 1.481687 0.757448 54 38446.6 0
24 1 1.574014 3.872571 2.153806 27 15538.8 0
25 1 1.328415 4.283069 3.129631 22 12640.3 2
26 3 2.543734 1.303721 0.975943 56 41034 0
27 3 1.504315 2.776198 0.78784 45 20809.7 0
28 1 0.470227 2.907789 1.445835 39 20114 1
29 1 1.593881 3.02813 2.871819 39 29359.1 3
30 3 1.948349 2.043314 1.106838 61 24270.1 1
Cluster Age Income Children Count
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30
Average in original units
K-Means Clustering –
Predicted Clusters
Cluster Age Income Children Count
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30
Averages
Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income Children
1 1 0.98792 2.73416 1.19091 48 17546 1
2 1 1.62843 2.95597 2.84743 40 30085.1 3
4 1 1.76147 4.18452 3.54724 23 20375.4 3
7 1 1.92947 4.47307 2.72305 22 8877.07 0
9 1 0.64011 2.86842 2.12491 37 25304.3 2
10 1 1.45953 2.31727 1.78809 54 24212.1 2
13 1 0.77042 2.98107 1.39261 44 15735.8 1
15 1 1.35811 3.22113 1.40903 36 19474.6 0
17 1 0.51566 3.27239 2.25 37 17729.8 2
20 1 1.43797 3.31674 1.68924 31 22522.8 0
22 1 1.16403 2.90414 1.96578 50 16497.3 2
24 1 1.57401 3.87257 2.15381 27 15538.8 0
25 1 1.32842 4.28307 3.12963 22 12640.3 2
28 1 0.47023 2.90779 1.44584 39 20114 1
29 1 1.59388 3.02813 2.87182 39 29359.1 3
5 2 3.05847 0.99264 1.66759 57 50576.3 0
6 2 2.10751 1.44014 1.9258 57 37869.6 2
11 2 3.93367 1.13278 2.52925 66 59803.9 0
14 2 3.45949 0.41274 2.37731 66 55204.7 1
21 2 3.39011 1.01245 2.86282 61 57880.7 2
3 3 1.7647 2.87683 0.86641 51 16575.4 0
8 3 2.16309 2.21341 0.50939 58 24946.6 0
12 3 1.86857 2.20636 0.15314 52 26658.8 0
16 3 1.37468 2.97392 1.18314 38 22342.1 0
18 3 2.18481 1.71016 1.05018 46 41016 0
19 3 2.43083 2.06948 0.75632 62 26909.2 0
23 3 2.34234 1.48169 0.75745 54 38446.6 0
26 3 2.54373 1.30372 0.97594 56 41034 0
27 3 1.50432 2.7762 0.78784 45 20809.7 0
30 3 1.94835 2.04331 1.10684 61 24270.1 1
Order Cluster ID and
expand the selection
Cluster 1: youngest customers, lowest
income and largest families
Cluster 2: oldest customers, highest
income and an average of one child
Cluster 3: older customers,
moderate income and few children
Distance between clusters
Cluster 1 and Cluster 2 are the most distinct pairs of clusters
Cluster Age Income Children Count
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30
Cluster Analysis
As an unsupervised learning technique, cluster analysis is not guided
by any explicit measure of accuracy, and thus the notion of a ‘good’
clustering is subjective and is dependent on what the analyst hopes
the cluster analysis will uncover.
We can measure the strength of a cluster by comparing the average
distance in a cluster to the distance between cluster centroids.
Rule of thumb: the ratio of between-cluster distance to within-
cluster distance should exceed 1.0 for useful clusters.
Hierarchical Clustering versus K-Means Clustering
71
Hierarchical Clustering k-Means Clustering
Suitable when we have a small data set (e.g., less
than 500 observations) and want to easily
examine solutions with increasing numbers of
clusters
Suitable when you know how many clusters
you want and you have a larger data set (e.g.,
larger than 500 observations)
Convenient method if you want to observe how
clusters are nested
Partitions the observations,
which is appropriate if trying to summarize the
data with k “average” observations
that describe the data with the minimum
amount of error
Very sensitive to outliers
Generally not appropriate for binary or ordinal
data because the average is not meaningful
Association Rules
EVALUATING ASSOCIATION RULES
Association rule mining is the data mining process of
finding the rules that may govern associations and causal
objects between sets of items. So in a given transaction
with multiple items, it tries to find the rules that govern how
or why such items are often bought together.
Association Rules
Association rules: if-then statements which convey the likelihood of certain items being
purchased together
Although association rules are an important tool in market basket analysis, they are
applicable to other disciplines.
Antecedent: the collection of items (or item set) corresponding to the if portion of the rule
Consequent: the item set corresponding to the then portion of the rule
Support count of an item: number of transactions in the data that include that item set
73
When we go grocery shopping, we often have a standard list of
things to buy. Each shopper has a distinctive list, depending on
one’s needs and preferences. A person might buy healthy
ingredients for a family dinner, while a bachelor might buy beer
and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and
Y, that are frequently bought together:
•Both X and Y can be placed on the
same shelf, so that buyers of one
item would be prompted to buy the
other.
•Promotional discounts could be
applied to just one out of the two
items.
•Advertisements on X could be
targeted at buyers who purchase Y.
•X and Y could be combined into a
new product, such as having Y in
flavors of X.
While we may know that certain
items are frequently bought
together, the question is, how do
we uncover these associations?
Besides increasing sales profits,
association rules can also be used
in other fields. In medical diagnosis
for instance, understanding which
symptoms tend to appear together
can help to improve patient care
and medicine prescription.
Definition
Association rules analysis is a technique to
uncover how items are associated to each other.
There are three common ways to measure
association:
The Support Count, the Confidence and the Lift-Ratio
https://medium.com/analytics-vidhya/association-rule-mining-7f06401f0601
In 2004, Walmart mined trillions of bytes of data to discover that
Strawberry Pop-Tarts were most purchased, pre-hurricane. It
was later attributed to the no-cook, long-lasting capabilities of the
tarts that made them disaster favourites. This was later proved to be
true when they stocked up on Pop-Tarts pre-hurricane in the future
years only to have them sold-out.
EXAMPLES
Before a hurricane strikes, people tend to stock up on Strawberry Pop-Tarts
just as much as batteries and other essentials.
Fast-food chains learned very early in the game that people who buy fast
food tend to feel thirsty due to the high salt content and end up buying Coke.
Association Rule Mining a good tool for businesses.
1. It helps businesses build sales strategies
identifying products that sell better together
2. It helps businesses build marketing strategies
The knowledge that some ornaments do not sell as well others during Christmas may help the
manager offer a sale on the non-frequent ornaments
3. It helps shelf-life planning
If olives don’t sell very often, the manager will not stock up on it. But he still wants to ensure that the
existing stock sells before the expiration date. With the knowledge that people who buy pizza dough
tend to buy olives, the olives can be offered at a lower price in combination with the pizza dough.
4. It helps the in-store organization.
Products which are known to drive the sales of other products can be moved closer together in the store.
For instance, if the sale of butter is driven by the sale of bread, they can be moved to the same aisle
in the store.
Walmart analyzed 1.2 million baskets of a store and found
a very interesting association. They found that on Fridays,
between 5 pm and 7 pm, diapers and beer were frequently
bought together.
To test this out, they moved the two closer in the store and
found a significant impact on the sales of these products.
Further analysis of this led them to the following
conclusion: “On Friday evenings, men would head home
from work and grab some beer while also picking up
diapers for their infants.”
Example
Shopping-Cart Transactions
78
If bread and jelly, then peanut butter
antecedent consequent
We consider only association
rule with a single consequent
The Support Count = the number of
transactions that include the event
Support count of (bread, jelly)
Rule of thumb: Only consider association rule with a support count of at least 20% of transaction
=4
Example
Shopping-Cart Transactions
79
If bread and jelly, then peanut butter
We consider only association
rule with a single consequent
The Support Count = the number of
transactions that include the event
Support count of (bread, jelly)
Rule of thumb: Only consider association rule with a support count of at least 20% of transaction
=4
Support count of peanut butter =4
Association Rules
Confidence: Helps identify reliable association rules
Lift ratio: Measure to evaluate the efficiency of a rule
80
a) Find the Confidence of the rule “If Bread and Fruit, then Jelly”
c) Find the Confidence of the rule “If Milk then Peanut Butter,”
b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly”
d) Find the Lift Ratio of the rule “If Milk then Peanut Butter”
The confidence measure helps
identify which product drives the
sale of which other product.
a) Find the Confidence of the rule “If Bread and Fruit, then Jelly”
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡, 𝐽𝑒𝑙𝑙𝑦}
𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡}
=
4
4
= 1
a) Find the Confidence of the rule “If Bread and Fruit, then Jelly” = 1.0
b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly”
=
1.0
5
10
= 2.00
Identifying a
customer who
purchased both
Bread and Fruit
as one who
also purchased
Jelly is two
times better
than just
guessing that a
random
customer
purchased Jelly
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝑀𝑖𝑙𝑘, 𝑃𝑒𝑎𝑛𝑢𝑡 𝐵𝑢𝑡𝑡𝑒𝑟}
𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝑀𝑖𝑙𝑘}
c) Find the Confidence of the rule “If Milk then Peanut Butter”
d) Find the Lift Ratio of the rule “If Milk then Peanut Butter”
=
𝟒
𝟔
= 𝟎. 𝟔𝟔
=
𝟎.𝟔𝟔
𝟒
𝟏𝟎
= 𝟏. 𝟔𝟔
Identifying a
customer who
purchased Milk
as one who
also purchased
Peanut Butter
is 66% more
likely to buy
Peanut Butter
than just
guessing that a
random
customer
purchased
Peanut Butter
Association Rules
Bread PeanutButter Milk Fruit Jelly Soda PotatoChips Vegetables WhippedCream ChocolateSauce Beer Steak Cheese Yogurt
1 1 1 1 1 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 1 1 1 0 0 0
1 0 0 1 1 1 1 0 0 0 0 1 0 0
0 1 1 1 1 1 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 0 0 0 0 0 0 0
0 0 1 1 0 1 1 0 0 0 0 0 0 0
0 1 1 1 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1 0 0 1 0 0 1
Association Rules
85
The utility of the rules depends on both its support and its lift ratio.
Although a high lift ratio suggests that the rule is very efficient at finding
when the occurrence occurs, if it has a very low support, the rule may not
as useful as another rule that has a lower lift ratio but affects a large
number of transactions ( as demonstrated by a high support).
Data Format Binary
Min support
Min confidence
Apriori
4
50
Association Rules: Reporting Parameters
Association Rules: Fitting Parameters
Method
Data Mining - Associate – Association rule
Rule ID A-Support C-Support Support Confidence Lift-Ratio Antecedent Consequent
Rule 2 4 5 4 100 2 [Bread] [Jelly]
Rule 22 4 5 4 100 2 [Bread] [Fruit,Jelly]
Rule 24 4 5 4 100 2 [Bread,Fruit] [Jelly]
Rule 3 5 4 4 80 2 [Jelly] [Bread]
Rule 23 5 4 4 80 2 [Jelly] [Bread,Fruit]
Rule 26 5 4 4 80 2 [Fruit,Jelly] [Bread]
Rule 4 4 6 4 100 1.6666667[PeanutButter] [Milk]
Rule 21 4 6 4 100 1.6666667 [PotatoChips] [Soda]
Rule 27 4 6 4 100 1.6666667[PeanutButter] [Milk,Fruit]
Rule 30 4 6 4 100 1.6666667
[PeanutButter,Fruit] [Milk]
Rule 49 4 6 4 100 1.6666667 [PotatoChips] [Fruit,Soda]
Rule 51 4 6 4 100 1.6666667
[Fruit,PotatoChips] [Soda]
Rule 5 6 4 4 66.66666667 1.6666667 [Milk][PeanutButter]
Rule 20 6 4 4 66.66666667 1.6666667 [Soda] [PotatoChips]
Rule 28 6 4 4 66.66666667 1.6666667 [Milk]
[PeanutButter,Fruit]
Rule 31 6 4 4 66.66666667 1.6666667 [Milk,Fruit] [PeanutButter]
Rule 48 6 4 4 66.66666667 1.6666667 [Soda]
[Fruit,PotatoChips]
Rule 50 6 4 4 66.66666667 1.6666667 [Fruit,Soda] [PotatoChips]
Rule 11 6 6 5 83.33333333 1.3888889 [Milk] [Soda]
Rule 12 6 6 5 83.33333333 1.3888889 [Soda] [Milk]
Rule 37 6 6 5 83.33333333 1.3888889 [Milk] [Fruit,Soda]
Rule 39 6 6 5 83.33333333 1.3888889 [Soda] [Milk,Fruit]
Rule 40 6 6 5 83.33333333 1.3888889 [Milk,Fruit] [Soda]
Rule 42 6 6 5 83.33333333 1.3888889 [Fruit,Soda] [Milk]
Rule 10 5 6 4 80 1.3333333 [Jelly] [Milk]
Rule 18 5 6 4 80 1.3333333 [Jelly] [Soda]
Rules
presented in
decreasing
order of Lift
Ratio
Rule ID A-Support C-Support Support Confidence Lift-Ratio Antecedent Consequent
Rule 33 5 6 4 80 1.3333333 [Jelly] [Milk,Fruit]
Rule 36 5 6 4 80 1.3333333 [Fruit,Jelly] [Milk]
Rule 43 5 6 4 80 1.3333333 [Jelly] [Fruit,Soda]
Rule 45 5 6 4 80 1.3333333 [Fruit,Jelly] [Soda]
Rule 9 6 5 4 66.66666667 1.3333333 [Milk] [Jelly]
Rule 19 6 5 4 66.66666667 1.3333333 [Soda] [Jelly]
Rule 32 6 5 4 66.66666667 1.3333333 [Milk] [Fruit,Jelly]
Rule 34 6 5 4 66.66666667 1.3333333 [Milk,Fruit] [Jelly]
Rule 44 6 5 4 66.66666667 1.3333333 [Soda] [Fruit,Jelly]
Rule 46 6 5 4 66.66666667 1.3333333 [Fruit,Soda] [Jelly]
Rule 1 4 9 4 100 1.1111111 [Bread] [Fruit]
Rule 6 4 9 4 100 1.1111111[PeanutButter] [Fruit]
Rule 7 6 9 6 100 1.1111111 [Milk] [Fruit]
Rule 14 5 9 5 100 1.1111111 [Jelly] [Fruit]
Rule 16 6 9 6 100 1.1111111 [Soda] [Fruit]
Rule 17 4 9 4 100 1.1111111 [PotatoChips] [Fruit]
Rule 25 4 9 4 100 1.1111111 [Bread,Jelly] [Fruit]
Rule 29 4 9 4 100 1.1111111
[PeanutButter,Milk] [Fruit]
Rule 35 4 9 4 100 1.1111111 [Milk,Jelly] [Fruit]
Rule 41 5 9 5 100 1.1111111 [Milk,Soda] [Fruit]
Rule 47 4 9 4 100 1.1111111 [Jelly,Soda] [Fruit]
Rule 52 4 9 4 100 1.1111111
[Soda,PotatoChips] [Fruit]
Rule 8 9 6 6 66.66666667 1.1111111 [Fruit] [Milk]
Rule 15 9 6 6 66.66666667 1.1111111 [Fruit] [Soda]
Rule 13 9 5 5 55.55555556 1.1111111 [Fruit] [Jelly]
Rule 38 9 5 5 55.55555556 1.1111111 [Fruit] [Milk,Soda]
Evaluating Association Rules
An association rule is judged on how actionable it is and how
well it explains the relationship between item sets
An association rule is useful if it is well supported and explain
an important previously unknown relationship
89
Example
Suppose the support for consequent B = 2
Support (A and B) = 2
Antecedent A is very popular: support of A = 50
100 transactions
Confidence (If A then B) = Support (A and B) / Support (A) = 2/50 = 0.04
Lift ratio (If A then B) =
Confidence (if A then B)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐵)/100
=
0.04
2/100
= 2
If A then B
Text Mining
Text mining is the process of transforming unstructured text into
structured data for easy analysis. Text mining uses natural language
processing (NLP), allowing machines to understand the human language
and process it automatically.
Text data is often referred to as unstructured data because in its raw form, it
cannot be stored in a traditional structured database (rows and columns).
Audio and video data are also examples of unstructured data.
Data mining with text data is more challenging than data mining with traditional
numerical data, because it requires more preprocessing to convert the text to a
format amenable for analysis.
Basic Methods
1) Word frequency
Word frequency can be used to identify the most recurrent
terms or concepts in a set of data.
2) Collocation
Collocation refers to a sequence of words that commonly appear
near each other.
3) Concordance
Concordance is used to recognize the particular context or
instance in which a word or set of words appears.
Topic Analysis, Sentiment Analysis
Example: Triad Airline
◦ Triad solicits feedback from its customers through a follow-up e-mail the
day after the customer has completed a flight.
◦ Survey asks the customer to rate various aspects of the flight and asks the
respondent to type comments into a dialog box in the e-mail; includes:
◦ Quantitative feedback from the ratings.
◦ Comments entered by the respondents which need to be analyzed.
◦ A collection of text documents to be analyzed is called a corpus.
Example: Triad Airline 10 respondents
Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.
To be analyzed, text data needs to be converted to structured data
Example: Converting text data
To be analyzed, text data needs to be converted to structured data (rows and
columns of numerical data) so that the tools of descriptive statistics, data
visualization and data mining can be applied.
We must convert a group of documents into a matrix of rows and columns
where the rows correspond to a document and the columns correspond to a
particular word.
A presence/absence or binary term-document matrix is a matrix with the rows
representing documents and the columns representing words.
Entries in the columns indicate either the presence or the absence of a
particular word in a particular document.
Example: Converting text data
◦ Creating the list of terms to use in the presence/absence matrix can be a
complicated matter:
◦ Too many terms results in a matrix with many columns, which may be
difficult to manage and could yield meaningless results.
◦ Too few terms may miss important relationships.
◦ Term frequency along with the problem context are often used as a guide.
◦ In Triad’s case, management used word frequency and the context of
having a goal of satisfied customers to come up with the following list of
terms they feel are relevant for categorizing the respondent’s comments.
Example: Converting text data
Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.
delayed, flight, horrible, recline, rude, seat, and service
Presence/Absence Term-Document Matrix
Term
Document Delayed Flight Horrible Recline Rude Seat Service
1 0 0 1 0 0 0 1
2 0 0 0 0 0 1 0
3 1 1 0 0 0 0 0
4 0 0 0 1 0 1 0
5 0 0 1 0 1 0 1
6 0 1 0 0 1 0 1
7 1 1 0 0 0 0 0
8 0 0 0 1 0 1 0
9 0 1 0 0 0 0 0
10 0 0 0 0 0 1 0
The text-mining process converts unstructured text into numerical data
◦ The text-mining process converts unstructured text into numerical data and applies
quantitative techniques.
◦ Which terms become the headers of the columns of the term-document matrix can
greatly impact the analysis.
Preprocessing Text Data for Analysis
◦ It is the process of dividing text into separate terms, referred to as tokens:
◦ Symbols and punctuations must be removed from the document, and all
letters should be converted to lowercase.
◦ Different forms of the same word, such as “stacking”, “stacked,” and “stack”
probably should not be considered as distinct terms.
Tokenization
It is the process of converting a word to its stem or root word.
Stemming
Recommendations
◦ The goal of preprocessing is to generate a list of most-relevant terms that is
sufficiently small so as to lend itself to analysis:
◦ Frequency can be used to eliminate words from consideration as tokens.
◦ Low-frequency words probably will not be very useful as tokens.
◦ Consolidating words that are synonyms can reduce the set of tokens.
◦ Most text-mining software gives the user the ability to manually specify terms to
include or exclude as tokens.
The use of slang, humor, and sarcasm can cause interpretation problems and might
require more sophisticated data cleansing and subjective intervention on the part of the
analyst to avoid misinterpretation.
Data preprocessing parses the original text data down to the set of tokens deemed
relevant for the topic being studied.
Frequency Term-Document Matrix
◦ When the documents in a corpus contain many words and when the frequency
of word occurrence is important to the context of the business problem,
preprocessing can be used to develop a frequency term-document matrix.
◦ A frequency term-document matrix is a matrix whose rows represent
documents and columns represent tokens, and the entries in the matrix are the
frequency of occurrence of each token in each document.
DataMining - Text frequency-inverse document frequency,
Term Count Info
Text Var Original (Total) Final (Total) Reduction, % Vocabulary
Comments 84 19 22,61904762 7
Document Info
Document ID # Characters # Terms
Comments_Doc1 70 14
Comments_Doc2 26 4
Comments_Doc3 53 10
Comments_Doc4 26 5
Comments_Doc5 61 11
Comments_Doc6 47 8
Comments_Doc7 42 7
Comments_Doc8 63 13
Comments_Doc9 23 4
Comments_Doc10 34 8
Comments
Top Terms Info
Term Collection Frequency Document Frequency
flight 4 4
seat 4 4
servic 3 3
delay 2 2
horribl 2 2
reclin 2 2
rude 2 2
Comments
Term-Document Matrix
Doc ID delay flight horribl reclin rude seat servic
Comments_Doc1 0 0 1 0 0 0 1
Comments_Doc2 0 0 0 0 0 1 0
Comments_Doc3 1 1 0 0 0 0 0
Comments_Doc4 0 0 0 1 0 1 0
Comments_Doc5 0 0 1 0 1 0 1
Comments_Doc6 0 1 0 0 1 0 1
Comments_Doc7 1 1 0 0 0 0 0
Comments_Doc8 0 0 0 1 0 1 0
Comments_Doc9 0 1 0 0 0 0 0
Comments_Doc10 0 0 0 0 0 1 0
Use Hierarchical Cluster analysis to group comments
On the Presence / Absence Term-Document Matrix
# Selected Variables 7
Selected Variables delay flight horribl reclin rude seat servic
Hierarchical Clustering: Fitting Parameters
Similarity Measure JACCARD
Clustering Method COMPLETE LINKAGE
Hierarchical Clustering: Model Parameters
Cluster Assignment TRUE
# Clusters 3
Hierarchical Clustering: Reporting Parameters
Normalized? FALSE
Draw Dendrogram? TRUE
Maximum Number of Leaves in
Dendrogram 10
Data Type Raw Data
Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 2 4
Record 5 1 5
Record 6 1 6
Record 7 3 7
Record 8 2 8
Record 9 3 9
Record 10 2 10
Record ID Cluster Sub-Cluster Comments
Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times.
Record 2 2 2 My seat was uncomfortable.
Record 3 3 3 My flight was delayed 2 hours for no apparent reason.
Record 4 2 4 My seat would not recline.
Record 5 1 5 The man at the ticket counter was rude. Service was horrible.
Record 6 1 6 The flight attendant was rude. Service was bad.
Record 7 3 7 My flight was delayed with no explanation.
Record 8 2 8 My drink spilled when the guy in front of me reclined his seat.
Record 9 3 9 My flight was canceled.
Record 10 2 10 The arm rest of my seat was nasty.
Record ID Cluster Sub-Cluster Comments
Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times.
Record 5 1 5 The man at the ticket counter was rude. Service was horrible.
Record 6 1 6 The flight attendant was rude. Service was bad.
Record 2 2 2 My seat was uncomfortable.
Record 4 2 4 My seat would not recline.
Record 8 2 8 My drink spilled when the guy in front of me reclined his seat.
Record 10 2 10 The arm rest of my seat was nasty.
Record 3 3 3 My flight was delayed 2 hours for no apparent reason.
Record 7 3 7 My flight was delayed with no explanation.
Record 9 3 9 My flight was canceled.
Cluster 1: {1, 5, 6} documents about service issues
Cluster 2: {2, 4, 8, 10} documents about seat issues
Cluster 3: {3, 7, 9} Documents about scheduling issues
Example Movie Review Frequency Term-document matrix
Terms
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2
We have 10 reviews from
movie critics. After using
preprocessing techniques,
including text reduction
by synonyms, the number
of tokens is down to two:
Great and Terrible.
Frequency Term-document matrix
Terms
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2
Apply K-Means Clustering to the Frequency-
terms matrix
With k = 2
The process of clustering /
categorizing comments or
reviews as positive, negative
or neutral is known as
sentiment analysis
Variables
# Selected Variables 2
Selected Variables Great Terrible
K-Means Clustering: Fitting Parameters
# Clusters 2
Start type Random Start
# Iterations 10
Random seed: initial centroids 12345
K-Means Clustering: Reporting Parameters
Show data summary TRUE
Show distance from each cluster TRUE
Normalized? FALSE
Cluster Great Terrible
Cluster 1 4.5714286 1.428571429
Cluster 2 0.6666667 3.333333333
Cluster Size Average Distance
Cluster 1 7 1.1250
Cluster 2 3 1.2136
Total 10 1.1516
One random start and 10 iterations
Record ID Cluster Dist.Cluster-1 Dist.Cluster-2
Record 1 1 1.491 5.467
Record 2 1 0.606 4.922
Record 3 1 0.606 4.922
Record 4 1 2.222 2.357
Record 5 1 0.606 4.922
Record 6 2 5.801 1.795
Record 7 1 0.714 4.069
Record 8 1 1.629 4.346
Record 9 2 3.902 0.471
Record 10 2 3.617 1.374
Record ID Cluster Great Terrible
Record 1 1 5 0
Record 2 1 5 1
Record 3 1 5 1
Record 4 1 3 3
Record 5 1 5 1
Record 6 2 0 5
Record 7 1 4 1
Record 8 1 5 3
Record 9 2 1 3
Record 10 2 1 2
Record ID Cluster Great Terrible
Record 1 1 5 0
Record 2 1 5 1
Record 3 1 5 1
Record 4 1 3 3
Record 5 1 5 1
Record 7 1 4 1
Record 8 1 5 3
Record 6 2 0 5
Record 9 2 1 3
Record 10 2 1 2
Reviews tend to
be positive
Reviews tend to
be negative
(3, 3) corresponds to a
balanced review and is
classified in Cluster 1
Sentiment Analysis
The process of clustering / categorizing
comments or reviews as positive, negative
or neutral is known as sentiment analysis
Text mining is
• how a business analyst turns 50,000 hotel guest reviews
into specific recommendations;
• how a workforce analyst improves productivity and
reduces employee turnover;
• how healthcare providers and biopharma
researchers understand patient experiences;
• and much, much more.
SAS Text Miner, Python, R, SPSS …
Notes Chapter 4.pptx
Notes Chapter 4.pptx

More Related Content

What's hot

Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie companyRahul Biradar
 
Apple Inc. Analysis for Strategy Class
Apple Inc. Analysis for Strategy ClassApple Inc. Analysis for Strategy Class
Apple Inc. Analysis for Strategy ClassTadas Labudis
 
Kristin’s Cookie Company Production process and analysis case study
Kristin’s Cookie Company Production process and analysis case studyKristin’s Cookie Company Production process and analysis case study
Kristin’s Cookie Company Production process and analysis case studyArfan Afzal
 
Kering-case-study-2
Kering-case-study-2Kering-case-study-2
Kering-case-study-2Disha Nagi
 
Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.
Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.
Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.aguesdon
 
Dell supply chain group
Dell supply chain groupDell supply chain group
Dell supply chain groupFPT Univesity
 
Us home appliances industry analysis
Us home appliances industry analysisUs home appliances industry analysis
Us home appliances industry analysisMuniza Nasir
 
Managing inventory at alko inc
Managing inventory at alko incManaging inventory at alko inc
Managing inventory at alko incaliyudhi_h
 
Walmart - Supply Chain Management
Walmart -  Supply Chain ManagementWalmart -  Supply Chain Management
Walmart - Supply Chain ManagementAbhishek Anand
 
scientific glass
scientific glassscientific glass
scientific glassvidit00
 
Vertical integration and Zara Retailing
Vertical integration and Zara RetailingVertical integration and Zara Retailing
Vertical integration and Zara RetailingNiharika Senecha
 
Distribution Network for Michael's Hardware
Distribution Network for Michael's HardwareDistribution Network for Michael's Hardware
Distribution Network for Michael's HardwareMd. Rezwan Munshi
 

What's hot (20)

Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie company
 
Apple Inc. Analysis for Strategy Class
Apple Inc. Analysis for Strategy ClassApple Inc. Analysis for Strategy Class
Apple Inc. Analysis for Strategy Class
 
Apple supply chain analysis
Apple supply chain analysisApple supply chain analysis
Apple supply chain analysis
 
Bi sysco
Bi syscoBi sysco
Bi sysco
 
ford strategies of supply chain case study
ford strategies of supply chain case study  ford strategies of supply chain case study
ford strategies of supply chain case study
 
Kristin’s Cookie Company Production process and analysis case study
Kristin’s Cookie Company Production process and analysis case studyKristin’s Cookie Company Production process and analysis case study
Kristin’s Cookie Company Production process and analysis case study
 
Kering-case-study-2
Kering-case-study-2Kering-case-study-2
Kering-case-study-2
 
Pelican stores report
Pelican stores reportPelican stores report
Pelican stores report
 
Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.
Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.
Supply Chain Management in the Motor Vehicle Industry, the Example of Mini.
 
Case study Boots
Case study BootsCase study Boots
Case study Boots
 
The bis corporation anubhaw
The bis corporation   anubhawThe bis corporation   anubhaw
The bis corporation anubhaw
 
Dell supply chain group
Dell supply chain groupDell supply chain group
Dell supply chain group
 
Us home appliances industry analysis
Us home appliances industry analysisUs home appliances industry analysis
Us home appliances industry analysis
 
Yc deveshwar
Yc deveshwarYc deveshwar
Yc deveshwar
 
Managing inventory at alko inc
Managing inventory at alko incManaging inventory at alko inc
Managing inventory at alko inc
 
Walmart - Supply Chain Management
Walmart -  Supply Chain ManagementWalmart -  Supply Chain Management
Walmart - Supply Chain Management
 
scientific glass
scientific glassscientific glass
scientific glass
 
Vertical integration and Zara Retailing
Vertical integration and Zara RetailingVertical integration and Zara Retailing
Vertical integration and Zara Retailing
 
Distribution Network for Michael's Hardware
Distribution Network for Michael's HardwareDistribution Network for Michael's Hardware
Distribution Network for Michael's Hardware
 
Executive Summary Assessment of Electric Vehicle Technology Development and I...
Executive Summary Assessment of Electric Vehicle Technology Development and I...Executive Summary Assessment of Electric Vehicle Technology Development and I...
Executive Summary Assessment of Electric Vehicle Technology Development and I...
 

Similar to Notes Chapter 4.pptx

Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
 
Looking at data
Looking at dataLooking at data
Looking at datapcalabri
 
Chi squared test
Chi squared testChi squared test
Chi squared testvikas232190
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdfAmanuelDina
 
Measures of dispersion range qd md
Measures of dispersion range qd mdMeasures of dispersion range qd md
Measures of dispersion range qd mdRekhaChoudhary24
 
Galambos N Analysis Of Survey Results
Galambos N Analysis Of Survey ResultsGalambos N Analysis Of Survey Results
Galambos N Analysis Of Survey ResultsNora Galambos
 
Chapter 9 learning more about sample data(1)
Chapter 9 learning more about sample data(1)Chapter 9 learning more about sample data(1)
Chapter 9 learning more about sample data(1)Celumusa Godfrey Nkosi
 
Quantitative Analysis
Quantitative AnalysisQuantitative Analysis
Quantitative Analysisunmgrc
 
statistic project on Hero motocorp
statistic project on Hero motocorpstatistic project on Hero motocorp
statistic project on Hero motocorpYug Bokadia
 
The implications of parameter independence in probabilistic sensitivity analy...
The implications of parameter independence in probabilistic sensitivity analy...The implications of parameter independence in probabilistic sensitivity analy...
The implications of parameter independence in probabilistic sensitivity analy...cheweb1
 
Lesson 27 using statistical techniques in analyzing data
Lesson 27 using statistical techniques in analyzing dataLesson 27 using statistical techniques in analyzing data
Lesson 27 using statistical techniques in analyzing datamjlobetos
 
Correlations about random_walks
Correlations about random_walksCorrelations about random_walks
Correlations about random_walksToshiyuki Shimono
 
Common statistical concepts
Common statistical conceptsCommon statistical concepts
Common statistical conceptsRoger Watson
 
Lesson 6 coefficient of determination
Lesson 6   coefficient of determinationLesson 6   coefficient of determination
Lesson 6 coefficient of determinationMehediHasan1023
 
Statistics project2
Statistics project2Statistics project2
Statistics project2shri1984
 

Similar to Notes Chapter 4.pptx (20)

Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
 
Looking at data
Looking at dataLooking at data
Looking at data
 
Chi squared test
Chi squared testChi squared test
Chi squared test
 
DescriptiveStatistics.pdf
DescriptiveStatistics.pdfDescriptiveStatistics.pdf
DescriptiveStatistics.pdf
 
1505 shi using our laptop
1505 shi using our laptop1505 shi using our laptop
1505 shi using our laptop
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Measures of dispersion range qd md
Measures of dispersion range qd mdMeasures of dispersion range qd md
Measures of dispersion range qd md
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
Galambos N Analysis Of Survey Results
Galambos N Analysis Of Survey ResultsGalambos N Analysis Of Survey Results
Galambos N Analysis Of Survey Results
 
Statistics
StatisticsStatistics
Statistics
 
Chapter 9 learning more about sample data(1)
Chapter 9 learning more about sample data(1)Chapter 9 learning more about sample data(1)
Chapter 9 learning more about sample data(1)
 
Quantitative Analysis
Quantitative AnalysisQuantitative Analysis
Quantitative Analysis
 
statistic project on Hero motocorp
statistic project on Hero motocorpstatistic project on Hero motocorp
statistic project on Hero motocorp
 
SHPE Poster
SHPE PosterSHPE Poster
SHPE Poster
 
The implications of parameter independence in probabilistic sensitivity analy...
The implications of parameter independence in probabilistic sensitivity analy...The implications of parameter independence in probabilistic sensitivity analy...
The implications of parameter independence in probabilistic sensitivity analy...
 
Lesson 27 using statistical techniques in analyzing data
Lesson 27 using statistical techniques in analyzing dataLesson 27 using statistical techniques in analyzing data
Lesson 27 using statistical techniques in analyzing data
 
Correlations about random_walks
Correlations about random_walksCorrelations about random_walks
Correlations about random_walks
 
Common statistical concepts
Common statistical conceptsCommon statistical concepts
Common statistical concepts
 
Lesson 6 coefficient of determination
Lesson 6   coefficient of determinationLesson 6   coefficient of determination
Lesson 6 coefficient of determination
 
Statistics project2
Statistics project2Statistics project2
Statistics project2
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Notes Chapter 4.pptx

  • 1. MBA 643 Dr Danielle Morin Fall 2022 CHAPTER 4 DESCRIPTIVE DATA MINING
  • 3. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events: ◦ The explosion in the amount of data being produced and electronically tracked ◦ The ability to electronically warehouse these data ◦ The affordability of computer power to analyze the data 3
  • 4. Observation A set of recorded values of variables associated with a single entity. It is a row of values in a spreadsheet or database, in which the columns correspond to the variables 4 35 years old Male dentist single Donation 2016 $1000 Donation 2017 $2000
  • 5. Supervised or Unsupervised Learning Data-mining approaches can be separated into two categories: Supervised learning—For prediction and classification Unsupervised learning—To detect patterns and relationships in the data Thought of as high-dimensional descriptive analytics Designed to describe patterns and relationships in large data sets with many observations of many variables There is no outcome variable to predict No definitive measure of accuracy, but qualitative assessment
  • 6. Cluster Analysis Goal: to segment observations into similar groups based on observed variables Can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration Commonly used in marketing to divide customers into different homogenous groups; known as market segmentation Used to identify outliers
  • 7. Cluster Methods Bottom-up hierarchical clustering starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters k-means clustering assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible Both methods depend on how two observations are similar, hence, we have to measure similarity between observations
  • 8. Three influential factors Hierarchical versus nonhierarchical clustering The measurement of the distance between observations The measurement of the distance between clusters
  • 9. Measurement of Distances between observations Euclidean distance Matching coefficients Jaccard coefficients
  • 10. Measuring Similarity Between Observations Euclidean distance: Most common method to measure dissimilarity between observations, when observations include continuous variables Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise measurements of q variables The Euclidean distance between observations u and v is: 𝒅𝒖,𝒗 = 𝒖𝟏 − 𝒗𝟏 𝟐 + 𝒖𝟐 − 𝒗𝟐 𝟐 + ∙ ∙ ∙ + 𝒖𝒒 − 𝒗𝒒 𝟐 NOTE: This measure of distance is highly influenced by the scale on which the variables are measured
  • 11. Calculate the Euclidean Distance Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values 11 Example Euclidean distance between (2 cars, 5 children) and (1 car, 3 children) (2, 5) and (1, 3) Distance = (2 − 1)2+(5 − 3)2= 5 = 2.24
  • 12. Euclidian Distance Euclidean distance is highly influenced by the scale on which variables are measured ◦ Common to standardize the units of each variable j of each observation u ◦ Example: uj, the value of variable j in observation u, is replaced with its z-score, zj The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations 12
  • 13. Age Female Income Married Children CarLoan Mortgage 48 1 17546.00 0 1 0 0 40 0 30085.10 1 3 1 1 51 1 16575.40 1 0 1 0 23 1 20375.40 1 3 0 0 57 1 50576.30 1 0 0 0 57 1 37869.60 1 2 0 0 22 0 8877.07 0 0 0 0 58 0 24946.60 1 0 1 0 37 1 25304.30 1 2 1 0 54 0 24212.10 1 2 1 0 66 1 59803.90 1 0 0 0 52 1 26658.80 0 0 1 1 44 1 15735.80 1 1 0 1 66 1 55204.70 1 1 1 1 36 0 19474.60 1 0 0 1 38 1 22342.10 1 0 1 1 37 1 17729.80 1 2 0 1 46 1 41016.00 1 0 0 1 62 1 26909.20 1 0 0 0 31 0 22522.80 1 0 1 0 61 0 57880.70 1 2 0 0 50 0 16497.30 1 2 0 0 54 0 38446.60 1 0 0 0 27 1 15538.80 0 0 1 1 22 0 12640.30 0 2 1 0 56 0 41034.00 1 0 1 1 45 0 20809.70 1 0 0 1 39 1 20114.00 1 1 0 0 39 1 29359.10 0 3 1 1 61 0 24270.10 1 1 0 0 Example A Financial advising company that provides personalized financial advice to its clients would like to segment its customers pool into several groups (clusters) to better serve them. Variables: Age, Gender (1 if Female and 0 if male), Annual Income, whether Married (1) and not married (0), Number of children, If a Car loan = 1 and 0 if not, Mortgage = 1 if a mortgage and 0 if not
  • 14. Example: Consider only Age and Income Age Income 48 17546.00 40 30085.10 51 16575.40 23 20375.40 57 50576.30 57 37869.60 22 8877.07 58 24946.60 37 25304.30 54 24212.10 66 59803.90 52 26658.80 44 15735.80 66 55204.70 36 19474.60 38 22342.10 37 17729.80 46 41016.00 62 26909.20 31 22522.80 61 57880.70 50 16497.30 54 38446.60 27 15538.80 22 12640.30 56 41034.00 45 20809.70 39 20114.00 39 29359.10 𝐷 = (48 − 40)2+(17546.00 − 30085.10)2= D = 12539.1 The Euclidean distance between the first two observations: This dissimilarity measure is influenced by the large values of Income It would be better to use the Z-Score for each variables to remove different units Age Income average 45.97 28011.87 st.dev. 13.04 13703.28 Zage Zincome 0.16 -0.76 -0.46 0.15 0.39 -0.83 -1.76 -0.56 0.85 1.65 0.85 0.72 -1.84 -1.40 0.92 -0.22 -0.69 -0.20 0.62 -0.28 1.54 2.32 0.46 -0.10 -0.15 -0.90 1.54 1.98 -0.76 -0.62 -0.61 -0.41 -0.69 -0.75 0.00 0.95 1.23 -0.08 -1.15 -0.40 1.15 2.18 0.31 -0.84 0.62 0.76 -1.45 -0.91 -1.84 -1.12 0.77 0.95 -0.07 -0.53 -0.53 -0.58 -0.53 0.10 Standardized Distance between first two observations = 𝑆𝑡. 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = (0.16 − −0.46)2+(−0.76 − 0.15)2 =1.101 Z-score also helps identifying outliers (48, 17546.00) & (40, 30085.10) 𝑍𝑎𝑔𝑒 = (48 − 45.97) 13.04 = 0.16 (0.16, -0.76) & (-0.46, 0.15)
  • 15. Matching Coefficients For categorical variables encoded as 0–1, a better measure of similarity between two observations can be achieved by counting the number of variables with matching values The simplest overlap measure is called the matching coefficient and is computed by: Example Matching Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 3/5 = 0.60
  • 16. Jaccard’s Coefficient A weakness of the matching coefficient is that if two observations both have a 0 entry for a categorical variable, this is counted as a sign of similarity between the two observations To avoid misstating similarity due to the absence of a feature, a similarity measure called Jaccard’s coefficient does not count matching zero entries and is computed by: Example Jaccard’s Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 2/(5-1) = 0.50
  • 17. Example Calculate the Matching Coefficient and the Jaccard’s Coefficient for the first five observations. Consider only categorical variables Age FemaleIncome Married Children CarLoan Mortgage 1 48 1 17546 0 1 0 0 2 40 0 30085 1 3 1 1 3 51 1 16575 1 0 1 0 4 23 1 20375 1 3 0 0 5 57 1 50576 1 0 0 0 Female Married CarLoan Mortgage 1 1 0 0 0 2 0 1 1 1 3 1 1 1 0 4 1 1 0 0 5 1 1 0 0
  • 18. Example Matching Coefficient Similarity matrix Obs. 1 2 3 4 5 1 1 2 0 1 3 2/4=0.5 2/4=0.5 1 4 3/4=0.75 1/4=0.25 3/4=0.75 1 5 3/4=0.75 1/4 =0.25 3/4=0.75 4/4=1 1 Female Married CarLoan Mortgage 1 1 0 0 0 2 0 1 1 1 3 1 1 1 0 4 1 1 0 0 5 1 1 0 0 Matching coefficients between observations 1 and 3: (1 0 0 0) (1 1 1 0) Matching Coefficient = 2/4 = 0.5
  • 19. Example Matching Coefficient Similarity matrix Obs. 1 2 3 4 5 1 1 2 0 1 3 2/4=0.50 2/4=0.50 1 4 3/4=0.75 1/4=0.25 3/4=0.75 1 5 3/4=0.75 1/4 =0.25 3/4=0.75 4/4=1 1 Female Married CarLoan Mortgage 1 1 0 0 0 2 0 1 1 1 3 1 1 1 0 4 1 1 0 0 5 1 1 0 0 Matching coefficient between Observations 1 and 4 = ¾ = 0.75, and therefore more similar than observations 2 and 3 (0.5)
  • 20. Example Jaccard’s Coefficient Similarity Matrix Obs. 1 2 3 4 5 1 1 2 0 1 3 1/3=0.33 2/4=0.50 1 4 ½=0.5 ¼=0.25 2/3=0.67 1 5 ½=0.5 ¼=0.25 2/3=0.67 2/2=1 1 Female Married CarLoan Mortgage 1 1 0 0 0 2 0 1 1 1 3 1 1 1 0 4 1 1 0 0 5 1 1 0 0 Jaccard coefficient between observations 1 and 3: (1 0 0 0) (1 1 1 0) Jaccard coefficient = 1/(4-1) = 1/3 X X
  • 21. Example Jaccard’s Coefficient Similarity Matrix Obs. 1 2 3 4 5 1 1 2 0 1 3 1/3=0.33 2/4=0.50 1 4 ½=0.50 ¼-0.25 2/3=0.67 1 5 ½=0.50 ¼=0.25 2/3=0.67 2/2=1 1 Female Married CarLoan Mortgage 1 1 0 0 0 2 0 1 1 1 3 1 1 1 0 4 1 1 0 0 5 1 1 0 0 According to Jaccard: Observations 1 and 4 are equally similar (0.5) to observations 2 and 3 (0.5) According to Matching: Observations 1 and 4 are more similar (0.75) than observations 2 and 3 (0.5)
  • 22. Hierarchical Clustering ◦Determines the similarity of two clusters by considering the similarity between the observations composing either cluster ◦Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster ◦Given a way to measure similarity between observations, there are several clustering method alternatives for comparing observations in two clusters to obtain a cluster similarity measure
  • 23. Cluster similarity measures ◦Single linkage (closest neighbor) ◦Complete linkage (furthest neighbor) ◦Group average linkage ◦Median linkage ◦Centroid linkage ◦Ward’s method ◦McQuitty’s method
  • 24. Similarity Measures between clusters 24 • The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar Single linkage • This clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different Complete linkage • Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters Group Average linkage • Analogous to group average linkage except that it uses the median of the similarities computer between all pairs of observations between the two clusters • Uses the averaging concept of cluster centroids to define between-cluster similarity Median linkage Centroid Linkage
  • 25. Measuring Similarity Between Clusters Single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster. The cluster formed by merging two clusters that are close with respect to single linkage may also consist of pairs of observations that are very different. Complete linkage will consider two clusters to be close if their most different pair of observations are close. This method produces clusters such that all member observations of a cluster are relatively close to each other. This clustering can be distorted by outlier observations. The single and Complete linkage methods are based on a single pair of observation to determine similarity
  • 26. Measuring Similarity Between Clusters Group Average Linkage on all the pairs of observations. If Cluster 1 has n1 observations and Cluster 2 has n2 observations, then the similarity measure is the average of all n1 x n2 pairs. This method produces clusters that are less dominated by similarity measures between single pairs of obseravtions. Centroid Linkage is based on the average calculated in each cluster which is called the centroid. The similarity between the clusters is defined as the similarity of the centroids. The Median Linkage is similar to the Group Average Linkage, except that it uses the Median instead of the average
  • 27. Example Consider the following matrix of distances between pairs of 5 objects: 1 2 3 4 5 1 0 9 3 6 11 2 9 0 7 5 10 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0 D = Classify using Hierarchical Clustering with Single linkage
  • 28. 1 2 3 4 5 1 0 9 3 6 11 2 9 0 7 5 10 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0 D = Classify using Hierarchical Clustering with Single linkage 1 2 4 (3,5) 1 0 2 9 0 4 6 5 0 (3,5) 3 7 8 0 D[1,(3,5)]=min[D(1,3),D(1,5)]=min[3,11]=3 D[2,(3,5)]=min[D(2,3),D(2,5)]=min[7,10]=7 D[4,(3,5)]=min[D(4,3),D(4,5)]=min[9,8]=8 2 4 (1,3,5) 2 0 4 5 0 (1,3,5) 7 6 0 D[2,(1,3,5)]=min[D(2,1),D(2,3),D(2,5)]=min[9,7,10)]=7 D[4,(1,3,5)]=min[D(4,1),D(4,3),D(4,5)]=min[6,9,8)]=6 (2,4) (1,3,5) (2,4) 0 (1,3,5) 6 0 D[(2,4),(1,3,5)]=min[D[2,(1,3,5),D(4,(1,3,5)]=min[7,6]=6
  • 29. 1 2 3 4 5 1 0 9 3 6 11 2 9 0 7 5 10 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0 Classify using Hierarchical Clustering with Single linkage 1 2 4 (3,5) 1 0 2 9 0 4 6 5 0 (3,5) 3 7 8 0 2 4 (1,3,5) 2 0 4 5 0 (1,3,5) 7 6 0 (2,4) (1,3,5) (2,4) 0 (1,3,5) 6 0 Step Distance Merger Clusters # clusters 1 2 3 with 5 1, 2, 4, (3,5) 4 2 3 1 with (3,5) 2, 4, (1,3,5) 3 3 5 2 with 4 (2,4), (1,3,5) 2 4 6 (2,4) with (1,3,5) (1,2,3,4,5) 1
  • 30. Step Distance Merger Clusters # clusters 1 2 3 with 5 1, 2, 4, (3,5) 4 2 3 1 with (3,5) 2, 4, (1,3,5) 3 3 5 2 with 4 (2,4), (1,3,5) 2 4 6 (2,4) with (1,3,5) (1,2,3,4,5) 1 DENDROGRAM
  • 31. Frontline Solver – DataMining – Cluster - Hierarchical Clustering – Distance Matrix– Single Linkage – 4 Clusters Hierarchical Clustering: Reporting Parameters Normalized? Draw Dendrogram? Maximum Number of Leaves in Dendrogram Data Type FALSE TRUE 5 Distance Matrix Clustering Method EUCLIDEAN SINGLE LINKAGE Hierarchical Clustering: Model Parameters Cluster Assignment # Clusters TRUE 4 Hierarchical Clustering: Fitting Parameters Similarity Measure
  • 32. Clustering Stages RESULTS Cluster Labels Stage Cluster 1 Cluster 2 Distance Stage1 3 5 2 Stage2 1 3 3 Stage3 2 4 5 Stage4 1 2 6 Record ID Cluster Sub-Cluster Record 1 1 1 Record 2 2 2 Record 3 3 3 Record 4 4 4 Record 5 3 5
  • 33. 3 5 1 2 4 0 4000 0 1 2 3 4 5 6 7 Distance Dendrogram
  • 34. Example Consider the following matrix of distances between pairs of 5 objects: 1 2 3 4 5 1 0 9 3 6 11 2 9 0 7 5 10 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0 D = Classify using Hierarchical Clustering with Complete linkage Frontline Solver – DataMining – Cluster - Hierarchical Clustering – Distance Matrix– Complete Linkage – 4 Clusters
  • 35. 1 2 3 4 5 1 0 9 3 6 11 2 9 0 7 5 10 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0 D = Classify using Hierarchical Clustering with Complete linkage 1 2 4 (3,5) 1 0 2 9 0 4 6 5 0 (3,5) 11 10 9 0 D[1,(3,5)]=max[D(1,3),D(1,5)]=max[3,11]=11 D[2,(3,5)]=max[D(2,3),D(2,5)]=max [7,10]=10 D[4,(3,5)]=max[D(4,3),D(4,5)]=max[9,8]=9 1 (2,4) (3,5) 1 0 (2,4) 9 0 (3,5) 11 11 0 D[1,(2,4)]=max[D(1,2),D(1,4)]=max[9,6] =9 D[1,(3,5)]=max[D(1,3),D(1,5)]=max[3,11]=11 (1,2,4) (3,5) (1,2,4) 0 (3,5) 11 0 D[(1,2,4),(3,5)]=max[D[1,(3,5)],D((2,4),(3,5)]=max[11,11]=11 D[(2,4),(3,5)]=max[D[2,(3,5),D(4,(3,5)]=max[11,9]=11
  • 36. Clustering Stages RESULTS Cluster Labels Stage Cluster 1 Cluster 2 Distance Stage1 3 5 2 Stage2 2 4 5 Stage3 1 2 9 Stage4 1 3 11 Record ID Cluster Sub-Cluster Record 1 1 1 Record 2 2 2 Record 3 3 3 Record 4 4 4 Record 5 3 5
  • 37. Hierarchical Cluster Complete Linkage 3 5 1 2 4 0 4000 0 2 4 6 8 10 12 Distance Dendrogram
  • 38. Cluster Analysis - More measures Centroid linkage uses the averaging concept of cluster centroids to define between-cluster similarity Ward’s method merges two clusters such that the dissimilarity of the observations with the resulting single cluster increases as little as possible When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculate as: ((dissimilarity between A and C) + (dissimilarity between B and C))/2) A dendrogram is a chart that depicts the set of nested clusters resulting at each step of aggregation
  • 39. Example Female Married CarLoan Mortgage 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 Analyse using Hierarchical Clustering Matching Coefficients Group Average linkage
  • 40. Inputs Female Married CarLoan Mortgage Hierarchical Clustering: Reporting Parameters Normalized? Draw Dendrogram? Maximum Number of Leaves in Dendrogram Data Type FALSE TRUE 10 Raw Data Clustering Method MATCHING GROUP AVERAGE Hierarchical Clustering: Model Parameters Cluster Assignment # Clusters TRUE 29 Variables # Selected Variables Selected Variables 4 Hierarchical Clustering: Fitting Parameters Similarity Measure Data Workbook Worksheet Range # Records in the input data Data Chapter 4.xlsx KYC $A$1:$G$31 30
  • 41. Clustering stages Record ID Cluster Sub-Cluster Record 1 1 1 Record 2 2 2 Record 3 3 3 Record 4 4 1 Record 5 4 1 Record 6 5 1 Record 7 6 4 Record 8 7 2 Record 9 8 3 Record 10 9 2 Record 11 10 1 Record 12 11 5 Record 13 12 6 Record 14 13 7 Record 15 14 8 Record 16 15 7 Record 17 16 6 Record 18 17 6 Record 19 18 1 Record 20 19 2 Record 21 20 9 Record 22 21 9 Record 23 22 9 Record 24 23 5 Record 25 24 10 Record 26 25 2 Record 27 26 8 Record 28 27 1 Record 29 28 5 Record 30 29 9 Stage Cluster 1 Cluster 2 Distance Stage1 4 5 0 Stage2 4 6 0 Stage3 3 9 0 Stage4 8 10 0 Stage5 4 11 0 Stage6 14 16 0 Stage7 13 17 0 Stage8 13 18 0 Stage9 4 19 0 Stage10 8 20 0 Stage11 21 22 0 Stage12 21 23 0 Stage13 12 24 0 Stage14 2 26 0 Stage15 15 27 0 Stage16 4 28 0 Stage17 12 29 0 Stage18 21 30 0 Stage19 1 4 0.25 Stage20 2 8 0.25 Stage21 3 14 0.25 Stage22 13 15 0.25 Stage23 7 21 0.25 Stage24 1 7 0.3214 Stage25 2 25 0.35 Stage26 3 12 0.375 Stage27 1 13 0.4125 Stage28 2 3 0.5060 Stage29 1 2 0.5905 Sub-Cluster membership Each of the 30 observations is assigned to one of the 10 sub-clusters
  • 42. Cluster membership Cluster Legend (Numbers show the record sequence relative to the original data) Sub- Cluster 1 Sub- Cluster 2 Sub- Cluster 3 Sub- Cluster 4 Sub- Cluster 5 Sub- Cluster 6 Sub- Cluster 7 Sub- Cluster 8 Sub- Cluster 9 Sub- Cluster 10 1 2 3 7 12 13 14 15 21 25 4 8 9 24 17 16 27 22 5 10 29 18 23 6 20 30 11 26 19 28
  • 43. Analytic Solver Basic limits the Maximum Number of Leaves to 10. Setting a value higher than 10 will result in an error message. If the number of Leaves is less than the number of observations, some of the clusters on the horizontal axis will initially correspond to groups of observations combined in early steps of the agglomeration not represented in the dendrogram Sub-cluster #3
  • 44. At distance 0.251, we have 7 clusters At distance 0.42, we have 3 clusters
  • 45. The vertical distance between the two horizontal lines is the cost of merging clusters in terms of decreased homogeneity within clusters = 0.42-0.251 = 0.169 NOTE: Elongating portions of the dendrogram represents mergers of more dissimilar clusters
  • 46. Cluster’s durability or strength The durability or strength of the cluster is measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster. Cluster 8 is formed at distance 0 and merged with cluster 6 at distance 0.25. Strength of cluster 8 = 0.25 – 0 = 0.25
  • 47. At distance = 0.42, we have 3 clusters Cluster 1 Cluster 2 Cluster 3 Cluster 1: (sub-clusters 3, 7, 5) {3, 9, 14, 16, 12, 24, 29} Cluster 2: (sub-clusters 2, 10) {2, 8, 10, 20, 26, 25} Cluster 3: (sub-clusters 1, 4, 9, 6, 8) {1, 4, 5, 6, 11, 19, 28, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27}
  • 48. Hierarchical Clustering Matching Coefficients and Group Average linkage Cluster 3: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27} Cluster 2: {2, 26, 8, 10, 20, 25} Cluster 1: {3, 9, 14, 16, 12, 24, 29} Cluster 1 Cluster 2 Cluster 3 Observation Female Married CarLoan Mortgage 1 1 0 0 0 2 0 1 1 1 3 1 1 1 0 4 1 1 0 0 5 1 1 0 0 6 1 1 0 0 7 0 0 0 0 8 0 1 1 0 9 1 1 1 0 10 0 1 1 0 11 1 1 0 0 12 1 0 1 1 13 1 1 0 1 14 1 1 1 1 15 0 1 0 1 16 1 1 1 1 17 1 1 0 1 18 1 1 0 1 19 1 1 0 0 20 0 1 1 0 21 0 1 0 0 22 0 1 0 0 23 0 1 0 0 24 1 0 1 1 25 0 0 1 0 26 0 1 1 1 27 0 1 0 1 28 1 1 0 0 29 1 0 1 1 30 0 1 0 0
  • 49. CLUSTER 1 Obs Female Married Car Loan Mortgage 3 1 1 1 0 9 1 1 1 0 12 1 0 1 1 14 1 1 1 1 16 1 1 1 1 24 1 0 1 1 29 1 0 1 1 Cluster 1 Row Labels Count of Female 0 0 1 7 Grand Total 7 Row Labels Count of Married 0 3 1 4 Grand Total 7 Row Labels Count of Mortgage 0 2 1 5 Grand Total 7 All have car loans
  • 50. CLUSTER 2 Obs Female Married Car Loan Mortgage 2 0 1 1 1 8 0 1 1 0 10 0 1 1 0 20 0 1 1 0 25 0 0 1 0 26 0 1 1 1 Cluster 2 Row Labels Count of Female 0 6 1 0 Grand Total 6 Row Labels Count of Married 0 1 1 5 Grand Total 6 Row Labels Count of Mortgage 0 4 1 2 Grand Total 6 All have car loans
  • 51. CLUSTER 3 Obs Female Married Car Loan Mortgage 1 1 0 0 0 4 1 1 0 0 5 1 1 0 0 6 1 1 0 0 7 0 0 0 0 11 1 1 0 0 13 1 1 0 1 15 0 1 0 1 17 1 1 0 1 18 1 1 0 1 19 1 1 0 0 21 0 1 0 0 22 0 1 0 0 23 0 1 0 0 27 0 1 0 1 28 1 1 0 0 30 0 1 0 0 Cluster 3 Row Labels Count of Female 0 7 1 10 Grand Total 17 Row Labels Count of Married 0 2 1 15 Grand Total 17 Row Labels Count of Mortgage 0 12 1 5 Grand Total 17 None have car loans
  • 52. APPLICATIONS TO HIERARCHICAL CLUSTERING Herarchical clustering is often used on DNA microarrays. Clustering of gene expression profiles is often used to try to discover subclasses of disease. Validation of these clusters is important for accurate scientific interpretation of the results.
  • 53. k-Means Clustering ◦ Given a value of k, the k-means algorithm randomly partitions the observations into k clusters ◦ After all observations have been assigned to a cluster, the resulting cluster centroids are calculated ◦ Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid using the Euclidean distance 53
  • 54. Example Age, Income, Number of Children Age Income Children 48 17546 1 40 30085 3 51 16575 0 23 20375 3 57 50576 0 57 37870 2 22 8877 0 58 24947 0 37 25304 2 54 24212 2 66 59804 0 52 26659 0 44 15736 1 66 55205 1 36 19475 0 38 22342 0 37 17730 2 46 41016 0 62 26909 0 31 22523 0 61 57881 2 50 16497 2 54 38447 0 27 15539 0 22 12640 2 56 41034 0 45 20810 0 39 20114 1 39 29359 3 61 24270 1
  • 55. Example Clustering Observations by Age and Income Using k-Means Clustering with k = 3 55 Most heterogeneous Largest cluster Most homogeneous
  • 56. Data Workbook DemoKTC.xlsx Worksheet Data Range $A$1:$G$31 # Records in the input data 30 # Selected Variables 2 Selected Variables Age Income K-Means Clustering: Fitting Parameters # Clusters 3 Start type Random Start # Iterations 10 Random seed: initial centroids 12345 K-Means Clustering: Reporting Parameters Show data summary TRUE Show distance from each cluster TRUE Normalized? TRUE Cluster using Age and Income only Normalize
  • 57. Cluster Age Income Cluster 1 -1.837607 -1.121744243 Cluster 2 -1.837607 -1.121744243 Cluster 3 0.7692901 0.950292945 Cluster Age Income Cluster 1 -0.610832 -0.41375302 Cluster 2 -0.687505 -0.197585752 Cluster 3 0.8459635 0.719370083 Cluster Age Income Cluster 1 -0.534158 -0.576349161 Cluster 2 1.5360244 1.984403237 Cluster 3 0.3859229 -0.83457936 Cluster Age Income Cluster 1 0.8459635 1.646644617 Cluster 2 -1.147546 -0.400566393 Cluster 3 1.5360244 2.320030979 Cluster Age Income Cluster 1 1.5360244 1.984403237 Cluster 2 -0.150791 -0.895849375 Cluster 3 -1.837607 -1.121744243 Start 1. Sum of Squares: 45.069043 Start 2. Sum of Squares: 24.830185 Best: Start 3. Sum of Squares: 22.397972 Start 4. Sum of Squares: 35.082695 Start 5. Sum of Squares: 25.533615 Cluster Age Income Cluster 1 -1.837607 -1.396366871 Cluster 2 -1.837607 -1.396366871 Cluster 3 -0.687505 -0.750336738 Cluster Age Income Cluster 1 0.8459635 0.719370083 Cluster 2 -1.147546 -0.400566393 Cluster 3 1.2293307 -0.080467783 Cluster Age Income Cluster 1 -0.610832 -0.41375302 Cluster 2 1.2293307 -0.080467783 Cluster 3 -0.764179 -0.623009532 Cluster Age Income Cluster 1 -0.074118 -0.525580284 Cluster 2 -0.610832 -0.41375302 Cluster 3 0.4625964 -0.098740784 Cluster Age Income Cluster 1 1.2293307 -0.080467783 Cluster 2 -0.687505 -0.197585752 Cluster 3 0.6159432 -0.277289314 Start 7. Sum of Squares: 23.072807 Start 8. Sum of Squares: 34.274493 Start 9. Sum of Squares: 35.570412 Start 10. Sum of Squares: 34.630941 Start 6. Sum of Squares: 85.235001 10 iterations Best Start with smallest SS
  • 58. Cluster Age Income Cluster 1 -1.0261 -0.5581 Cluster 2 0.9131 1.4389 Cluster 3 0.5009 -0.4813 Cluster Cluster 1 Cluster 2 Cluster 3 Cluster 1 0.0000 2.7836 1.5290 Cluster 2 2.7836 0.0000 1.9639 Cluster 3 1.5290 1.9639 0.0000 Cluster Size Average Distance Cluster 1 12 0.6215 Cluster 2 8 0.7387 Cluster 3 10 0.5202 Total 30 0.6190 Cluster Centers Inter-Cluster Distances Cluster Summary Normalized data
  • 59. Evaluate the strength of the clusters 59 Cluster 2 is most heterogeneous Distance between Cluster 2 and Cluster 3 centroids =1.9639 The average observation in Cluster 2 is approximately 2.66 times closer to Cluster 2 than to Cluster 3. (1.964/0.739 = 2.66) Rule of thumb : the ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful clusters. For Cluster 1: within-distance = 0.6215, distance between Cluster 1 and Cluster 2 = 2.7836, the ratio = 2.7836 / 0.6215 = 4.48. The average observation is Cluster 1 is 4.48 times closer to Cluster 1 than to Cluster 2 and also 2.46 time closer than to Cluster 3. (the ratio = 1.529 / 0.6215 = 2.46) Average distance between observations in the cluster Distances between centroids
  • 60. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Record 1 3 1.1998 2.3291 0.4459 Record 2 1 0.9092 1.8805 1.1484 Record 3 3 1.4389 2.3338 0.3715 Record 4 1 0.7348 3.3369 2.2631 Record 5 2 2.8924 0.2183 2.1558 Record 6 2 2.2665 0.7226 1.2493 Record 7 1 1.1667 3.9503 2.5112 Record 8 3 1.9773 1.6626 0.4942 Record 9 1 0.4946 2.2890 1.2218 Record 10 3 1.6659 1.7417 0.2342 Record 11 2 3.8534 1.0791 2.9865 Record 12 3 1.5580 1.6022 0.3845 Record 13 3 0.9382 2.5657 0.7724 Record 14 2 3.6096 0.8281 2.6742 Record 15 1 0.2699 2.6579 1.2730 Record 16 1 0.4397 2.3988 1.1138 Record 17 1 0.3894 2.7119 1.2185 Record 18 2 1.8247 1.0339 1.5146 Record 19 3 2.3055 1.5519 0.8314 Record 20 1 0.1989 2.7622 1.6505 Record 21 2 3.4990 0.7786 2.7397 Record 22 3 1.3649 2.3578 0.4069 Record 23 2 2.1066 0.7397 1.2481 Record 24 1 0.5543 3.3350 2.0017 Record 25 1 0.9880 3.7580 2.4246 Record 26 2 2.3450 0.5093 1.4566 Record 27 3 0.9526 2.1985 0.5768 Record 28 1 0.4923 2.4810 1.0394 Record 29 1 0.8204 1.9727 1.1863 Record 30 3 2.1974 1.7286 0.6842 Distances from each cluster Classify as group 3 because the distance is the smallest to Cluster 3
  • 61. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00 Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10 Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40 Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40 Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30 Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60 Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07 Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60 Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30 Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10 Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90 Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80 Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80 Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70 Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60 Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10 Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80 Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00 Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20 Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80 Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70 Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30 Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60 Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80 Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30 Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00 Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70 Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00 Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10 Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10 Insert the original data to analyze the clusters
  • 62. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10 Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40 Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07 Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30 Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60 Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10 Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80 Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80 Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80 Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30 Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00 Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10 Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30 Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60 Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90 Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70 Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00 Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70 Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60 Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00 Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00 Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40 Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60 Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10 Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80 Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80 Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20 Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30 Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70 Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10 Cluster 1 Cluster 2 Cluster 3 Order the column Cluster and expand the selection to all the other columns
  • 63. Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income 2 1 0,90921 1,880479 1,14838 40 30085,1 4 1 0,734788 3,336877 2,263141 23 20375,4 7 1 1,166663 3,950271 2,511188 22 8877,07 9 1 0,494644 2,289048 1,221841 37 25304,3 15 1 0,269881 2,657896 1,27302 36 19474,6 16 1 0,439695 2,398833 1,113817 38 22342,1 17 1 0,389384 2,711894 1,218504 37 17729,8 20 1 0,19891 2,762165 1,650456 31 22522,8 24 1 0,554286 3,335008 2,001662 27 15538,8 25 1 0,98799 3,758034 2,424644 22 12640,3 28 1 0,492325 2,481026 1,039444 39 20114 29 1 0,820351 1,972684 1,186339 39 29359,1 5 2 2,892376 0,218347 2,155763 57 50576,3 6 2 2,266453 0,722611 1,249289 57 37869,6 11 2 3,853381 1,079146 2,986474 66 59803,9 14 2 3,6096 0,828076 2,674181 66 55204,7 18 2 1,824724 1,033919 1,514648 46 41016 21 2 3,498976 0,778609 2,73966 61 57880,7 23 2 2,106615 0,739677 1,248115 54 38446,6 26 2 2,344982 0,50928 1,456556 56 41034 1 3 1,199799 2,329113 0,445879 48 17546 3 3 1,438875 2,333751 0,371502 51 16575,4 8 3 1,977273 1,662577 0,494178 58 24946,6 10 3 1,665932 1,741678 0,23422 54 24212,1 12 3 1,55801 1,602226 0,384503 52 26658,8 13 3 0,938242 2,565664 0,772381 44 15735,8 19 3 2,305502 1,551899 0,831416 62 26909,2 22 3 1,364876 2,357765 0,406925 50 16497,3 27 3 0,952585 2,19853 0,576751 45 20809,7 30 3 2,197374 1,728604 0,684194 61 24270,1 Distances from cluster centers are in normalized coordinates Cluster #Obs Avg. Dist Cluster-1 12 0,622 Cluster-2 8 0,739 Cluster-3 10 0,52 Overall 30 0,619 Normalized coordinates Average Age Average Income n Cluster 1 32.58 21363.61 12 Cluster 2 57.88 47728.98 8 Cluster 3 52.50 21416.10 10
  • 64. K-Means Clustering on Age, Income and Children Age Income Children K-Means Clustering: Reporting Parameters Show data summary Show distance from each cluster Normalized? TRUE TRUE TRUE Start type # Iterations Random seed: initial centroids 3 Random Start 10 12345 Variables # Selected Variables Selected Variables 3 K-Means Clustering: Fitting Parameters # Clusters No. of iterations = No. of times that cluster centroids are recalculated and observations are reassigned to clusters
  • 65. Cluster Age Income Children Cluster 1 -1.837607 -1.121744243 0.9870553 Cluster 2 -1.837607 -1.121744243 0.9870553 Cluster 3 0.7692901 0.950292945 -0.863673 Cluster Age Income Children Cluster 1 -0.610832 -0.41375302 -0.863673 Cluster 2 -0.687505 -0.197585752 0.9870553 Cluster 3 0.8459635 0.719370083 0.9870553 Cluster Age Income Children Cluster 1 -0.534158 -0.576349161 0.061691 Cluster 2 1.5360244 1.984403237 0.061691 Cluster 3 0.3859229 -0.83457936 -0.863673 Cluster Age Income Children Cluster 1 0.8459635 1.646644617 -0.863673 Cluster 2 -1.147546 -0.400566393 -0.863673 Cluster 3 1.5360244 2.320030979 -0.863673 Cluster Age Income Children Cluster 1 1.5360244 1.984403237 0.061691 Cluster 2 -0.150791 -0.895849375 0.061691 Start 1. Sum of Squares: 81.070037 Start 2. Sum of Squares: 51.665475 Best: Start 3. Sum of Squares: 49.715972 Start 4. Sum of Squares: 86.460648 Start 5. Sum of Squares: 54.790692 Cluster Age Income Children Cluster 1 -1.837607 -1.396366871 -0.863673 Cluster 2 -1.837607 -1.396366871 -0.863673 Cluster 3 -0.687505 -0.750336738 0.9870553 Cluster Age Income Children Cluster 1 0.8459635 0.719370083 0.9870553 Cluster 2 -1.147546 -0.400566393 -0.863673 Cluster 3 1.2293307 -0.080467783 -0.863673 Cluster Age Income Children Cluster 1 -0.610832 -0.41375302 -0.863673 Cluster 2 1.2293307 -0.080467783 -0.863673 Cluster 3 -0.764179 -0.623009532 -0.863673 Cluster Age Income Children Cluster 1 -0.074118 -0.525580284 -0.863673 Cluster 2 -0.610832 -0.41375302 -0.863673 Cluster 3 0.4625964 -0.098740784 -0.863673 Cluster Age Income Children Cluster 1 1.2293307 -0.080467783 -0.863673 Cluster 2 -0.687505 -0.197585752 0.9870553 Cluster 3 0.6159432 -0.277289314 0.9870553 Start 7. Sum of Squares: 60.531938 Start 8. Sum of Squares: 85.652446 Start 9. Sum of Squares: 86.948365 Start 10. Sum of Squares: 63.737762 Start 6. Sum of Squares: 133.415590 Random Starts Summary
  • 66. Cluster Centers Cluster Age Income Children Cluster 1 -0.7182 -0.6041 0.4935 Cluster 2 1.1833 1.7700 0.0617 Cluster 3 0.4856 0.0211 -0.7711 Inter-Cluster Distances Cluster Cluster 1 Cluster 2 Cluster 3 Cluster 1 0 3.0722 1.8545 Cluster 2 3.0722 0 2.05893 Cluster 3 1.8545 2.0589 0 Cluster Summary Cluster Size Average Distance Cluster 1 15 1.2413 Cluster 2 5 0.9982 Cluster 3 10 0.8147 Total 30 1.0586 Cluster 1 and Cluster 2 are the most distinct pairs of clusters Observations within clusters are more similar than observations between clusters Cluster 3 more homogenous
  • 67. K-Means Clustering – Predicted Clusters Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income Children 1 1 0.987923 2.734159 1.190912 48 17546 1 2 1 1.62843 2.955969 2.847426 40 30085.1 3 3 3 1.764699 2.876826 0.866409 51 16575.4 0 4 1 1.761474 4.184518 3.547236 23 20375.4 3 5 2 3.058468 0.992641 1.667591 57 50576.3 0 6 2 2.107507 1.440136 1.925799 57 37869.6 2 7 1 1.929472 4.473073 2.723054 22 8877.07 0 8 3 2.163087 2.213405 0.509393 58 24946.6 0 9 1 0.640108 2.868416 2.124907 37 25304.3 2 10 1 1.459529 2.317267 1.788088 54 24212.1 2 11 2 3.93367 1.132784 2.529248 66 59803.9 0 12 3 1.868574 2.206364 0.153137 52 26658.8 0 13 1 0.770418 2.981068 1.392612 44 15735.8 1 14 2 3.459491 0.412738 2.37731 66 55204.7 1 15 1 1.358113 3.221133 1.409031 36 19474.6 0 16 3 1.374677 2.97392 1.183135 38 22342.1 0 17 1 0.51566 3.272391 2.250002 37 17729.8 2 18 3 2.184812 1.710157 1.050179 46 41016 0 19 3 2.430829 2.06948 0.756316 62 26909.2 0 20 1 1.437973 3.316736 1.689235 31 22522.8 0 21 2 3.390112 1.012452 2.862822 61 57880.7 2 22 1 1.16403 2.904136 1.96578 50 16497.3 2 23 3 2.342344 1.481687 0.757448 54 38446.6 0 24 1 1.574014 3.872571 2.153806 27 15538.8 0 25 1 1.328415 4.283069 3.129631 22 12640.3 2 26 3 2.543734 1.303721 0.975943 56 41034 0 27 3 1.504315 2.776198 0.78784 45 20809.7 0 28 1 0.470227 2.907789 1.445835 39 20114 1 29 1 1.593881 3.02813 2.871819 39 29359.1 3 30 3 1.948349 2.043314 1.106838 61 24270.1 1 Cluster Age Income Children Count 1 36.6 19734.16 1.47 15 2 61.4 52267.04 1 5 3 52.3 28300.85 0.1 10 Total 45.97 28011.87 0.93 30 Average in original units
  • 68. K-Means Clustering – Predicted Clusters Cluster Age Income Children Count 1 36.6 19734.16 1.47 15 2 61.4 52267.04 1 5 3 52.3 28300.85 0.1 10 Total 45.97 28011.87 0.93 30 Averages Record ID Cluster ID Dist. Clust-1 Dist. Clust-2 Dist. Clust-3 Age Income Children 1 1 0.98792 2.73416 1.19091 48 17546 1 2 1 1.62843 2.95597 2.84743 40 30085.1 3 4 1 1.76147 4.18452 3.54724 23 20375.4 3 7 1 1.92947 4.47307 2.72305 22 8877.07 0 9 1 0.64011 2.86842 2.12491 37 25304.3 2 10 1 1.45953 2.31727 1.78809 54 24212.1 2 13 1 0.77042 2.98107 1.39261 44 15735.8 1 15 1 1.35811 3.22113 1.40903 36 19474.6 0 17 1 0.51566 3.27239 2.25 37 17729.8 2 20 1 1.43797 3.31674 1.68924 31 22522.8 0 22 1 1.16403 2.90414 1.96578 50 16497.3 2 24 1 1.57401 3.87257 2.15381 27 15538.8 0 25 1 1.32842 4.28307 3.12963 22 12640.3 2 28 1 0.47023 2.90779 1.44584 39 20114 1 29 1 1.59388 3.02813 2.87182 39 29359.1 3 5 2 3.05847 0.99264 1.66759 57 50576.3 0 6 2 2.10751 1.44014 1.9258 57 37869.6 2 11 2 3.93367 1.13278 2.52925 66 59803.9 0 14 2 3.45949 0.41274 2.37731 66 55204.7 1 21 2 3.39011 1.01245 2.86282 61 57880.7 2 3 3 1.7647 2.87683 0.86641 51 16575.4 0 8 3 2.16309 2.21341 0.50939 58 24946.6 0 12 3 1.86857 2.20636 0.15314 52 26658.8 0 16 3 1.37468 2.97392 1.18314 38 22342.1 0 18 3 2.18481 1.71016 1.05018 46 41016 0 19 3 2.43083 2.06948 0.75632 62 26909.2 0 23 3 2.34234 1.48169 0.75745 54 38446.6 0 26 3 2.54373 1.30372 0.97594 56 41034 0 27 3 1.50432 2.7762 0.78784 45 20809.7 0 30 3 1.94835 2.04331 1.10684 61 24270.1 1 Order Cluster ID and expand the selection
  • 69. Cluster 1: youngest customers, lowest income and largest families Cluster 2: oldest customers, highest income and an average of one child Cluster 3: older customers, moderate income and few children Distance between clusters Cluster 1 and Cluster 2 are the most distinct pairs of clusters Cluster Age Income Children Count 1 36.6 19734.16 1.47 15 2 61.4 52267.04 1 5 3 52.3 28300.85 0.1 10 Total 45.97 28011.87 0.93 30
  • 70. Cluster Analysis As an unsupervised learning technique, cluster analysis is not guided by any explicit measure of accuracy, and thus the notion of a ‘good’ clustering is subjective and is dependent on what the analyst hopes the cluster analysis will uncover. We can measure the strength of a cluster by comparing the average distance in a cluster to the distance between cluster centroids. Rule of thumb: the ratio of between-cluster distance to within- cluster distance should exceed 1.0 for useful clusters.
  • 71. Hierarchical Clustering versus K-Means Clustering 71 Hierarchical Clustering k-Means Clustering Suitable when we have a small data set (e.g., less than 500 observations) and want to easily examine solutions with increasing numbers of clusters Suitable when you know how many clusters you want and you have a larger data set (e.g., larger than 500 observations) Convenient method if you want to observe how clusters are nested Partitions the observations, which is appropriate if trying to summarize the data with k “average” observations that describe the data with the minimum amount of error Very sensitive to outliers Generally not appropriate for binary or ordinal data because the average is not meaningful
  • 72. Association Rules EVALUATING ASSOCIATION RULES Association rule mining is the data mining process of finding the rules that may govern associations and causal objects between sets of items. So in a given transaction with multiple items, it tries to find the rules that govern how or why such items are often bought together.
  • 73. Association Rules Association rules: if-then statements which convey the likelihood of certain items being purchased together Although association rules are an important tool in market basket analysis, they are applicable to other disciplines. Antecedent: the collection of items (or item set) corresponding to the if portion of the rule Consequent: the item set corresponding to the then portion of the rule Support count of an item: number of transactions in the data that include that item set 73
  • 74. When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a distinctive list, depending on one’s needs and preferences. A person might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips. Understanding these buying patterns can help to increase sales in several ways. If there is a pair of items, X and Y, that are frequently bought together: •Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other. •Promotional discounts could be applied to just one out of the two items. •Advertisements on X could be targeted at buyers who purchase Y. •X and Y could be combined into a new product, such as having Y in flavors of X. While we may know that certain items are frequently bought together, the question is, how do we uncover these associations? Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for instance, understanding which symptoms tend to appear together can help to improve patient care and medicine prescription. Definition Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association: The Support Count, the Confidence and the Lift-Ratio
  • 75. https://medium.com/analytics-vidhya/association-rule-mining-7f06401f0601 In 2004, Walmart mined trillions of bytes of data to discover that Strawberry Pop-Tarts were most purchased, pre-hurricane. It was later attributed to the no-cook, long-lasting capabilities of the tarts that made them disaster favourites. This was later proved to be true when they stocked up on Pop-Tarts pre-hurricane in the future years only to have them sold-out. EXAMPLES Before a hurricane strikes, people tend to stock up on Strawberry Pop-Tarts just as much as batteries and other essentials. Fast-food chains learned very early in the game that people who buy fast food tend to feel thirsty due to the high salt content and end up buying Coke.
  • 76. Association Rule Mining a good tool for businesses. 1. It helps businesses build sales strategies identifying products that sell better together 2. It helps businesses build marketing strategies The knowledge that some ornaments do not sell as well others during Christmas may help the manager offer a sale on the non-frequent ornaments 3. It helps shelf-life planning If olives don’t sell very often, the manager will not stock up on it. But he still wants to ensure that the existing stock sells before the expiration date. With the knowledge that people who buy pizza dough tend to buy olives, the olives can be offered at a lower price in combination with the pizza dough. 4. It helps the in-store organization. Products which are known to drive the sales of other products can be moved closer together in the store. For instance, if the sale of butter is driven by the sale of bread, they can be moved to the same aisle in the store.
  • 77. Walmart analyzed 1.2 million baskets of a store and found a very interesting association. They found that on Fridays, between 5 pm and 7 pm, diapers and beer were frequently bought together. To test this out, they moved the two closer in the store and found a significant impact on the sales of these products. Further analysis of this led them to the following conclusion: “On Friday evenings, men would head home from work and grab some beer while also picking up diapers for their infants.”
  • 78. Example Shopping-Cart Transactions 78 If bread and jelly, then peanut butter antecedent consequent We consider only association rule with a single consequent The Support Count = the number of transactions that include the event Support count of (bread, jelly) Rule of thumb: Only consider association rule with a support count of at least 20% of transaction =4
  • 79. Example Shopping-Cart Transactions 79 If bread and jelly, then peanut butter We consider only association rule with a single consequent The Support Count = the number of transactions that include the event Support count of (bread, jelly) Rule of thumb: Only consider association rule with a support count of at least 20% of transaction =4 Support count of peanut butter =4
  • 80. Association Rules Confidence: Helps identify reliable association rules Lift ratio: Measure to evaluate the efficiency of a rule 80 a) Find the Confidence of the rule “If Bread and Fruit, then Jelly” c) Find the Confidence of the rule “If Milk then Peanut Butter,” b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly” d) Find the Lift Ratio of the rule “If Milk then Peanut Butter” The confidence measure helps identify which product drives the sale of which other product.
  • 81. a) Find the Confidence of the rule “If Bread and Fruit, then Jelly” = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡, 𝐽𝑒𝑙𝑙𝑦} 𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡} = 4 4 = 1
  • 82. a) Find the Confidence of the rule “If Bread and Fruit, then Jelly” = 1.0 b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly” = 1.0 5 10 = 2.00 Identifying a customer who purchased both Bread and Fruit as one who also purchased Jelly is two times better than just guessing that a random customer purchased Jelly
  • 83. = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝑀𝑖𝑙𝑘, 𝑃𝑒𝑎𝑛𝑢𝑡 𝐵𝑢𝑡𝑡𝑒𝑟} 𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝑀𝑖𝑙𝑘} c) Find the Confidence of the rule “If Milk then Peanut Butter” d) Find the Lift Ratio of the rule “If Milk then Peanut Butter” = 𝟒 𝟔 = 𝟎. 𝟔𝟔 = 𝟎.𝟔𝟔 𝟒 𝟏𝟎 = 𝟏. 𝟔𝟔 Identifying a customer who purchased Milk as one who also purchased Peanut Butter is 66% more likely to buy Peanut Butter than just guessing that a random customer purchased Peanut Butter
  • 84. Association Rules Bread PeanutButter Milk Fruit Jelly Soda PotatoChips Vegetables WhippedCream ChocolateSauce Beer Steak Cheese Yogurt 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1
  • 85. Association Rules 85 The utility of the rules depends on both its support and its lift ratio. Although a high lift ratio suggests that the rule is very efficient at finding when the occurrence occurs, if it has a very low support, the rule may not as useful as another rule that has a lower lift ratio but affects a large number of transactions ( as demonstrated by a high support).
  • 86. Data Format Binary Min support Min confidence Apriori 4 50 Association Rules: Reporting Parameters Association Rules: Fitting Parameters Method Data Mining - Associate – Association rule
  • 87. Rule ID A-Support C-Support Support Confidence Lift-Ratio Antecedent Consequent Rule 2 4 5 4 100 2 [Bread] [Jelly] Rule 22 4 5 4 100 2 [Bread] [Fruit,Jelly] Rule 24 4 5 4 100 2 [Bread,Fruit] [Jelly] Rule 3 5 4 4 80 2 [Jelly] [Bread] Rule 23 5 4 4 80 2 [Jelly] [Bread,Fruit] Rule 26 5 4 4 80 2 [Fruit,Jelly] [Bread] Rule 4 4 6 4 100 1.6666667[PeanutButter] [Milk] Rule 21 4 6 4 100 1.6666667 [PotatoChips] [Soda] Rule 27 4 6 4 100 1.6666667[PeanutButter] [Milk,Fruit] Rule 30 4 6 4 100 1.6666667 [PeanutButter,Fruit] [Milk] Rule 49 4 6 4 100 1.6666667 [PotatoChips] [Fruit,Soda] Rule 51 4 6 4 100 1.6666667 [Fruit,PotatoChips] [Soda] Rule 5 6 4 4 66.66666667 1.6666667 [Milk][PeanutButter] Rule 20 6 4 4 66.66666667 1.6666667 [Soda] [PotatoChips] Rule 28 6 4 4 66.66666667 1.6666667 [Milk] [PeanutButter,Fruit] Rule 31 6 4 4 66.66666667 1.6666667 [Milk,Fruit] [PeanutButter] Rule 48 6 4 4 66.66666667 1.6666667 [Soda] [Fruit,PotatoChips] Rule 50 6 4 4 66.66666667 1.6666667 [Fruit,Soda] [PotatoChips] Rule 11 6 6 5 83.33333333 1.3888889 [Milk] [Soda] Rule 12 6 6 5 83.33333333 1.3888889 [Soda] [Milk] Rule 37 6 6 5 83.33333333 1.3888889 [Milk] [Fruit,Soda] Rule 39 6 6 5 83.33333333 1.3888889 [Soda] [Milk,Fruit] Rule 40 6 6 5 83.33333333 1.3888889 [Milk,Fruit] [Soda] Rule 42 6 6 5 83.33333333 1.3888889 [Fruit,Soda] [Milk] Rule 10 5 6 4 80 1.3333333 [Jelly] [Milk] Rule 18 5 6 4 80 1.3333333 [Jelly] [Soda] Rules presented in decreasing order of Lift Ratio
  • 88. Rule ID A-Support C-Support Support Confidence Lift-Ratio Antecedent Consequent Rule 33 5 6 4 80 1.3333333 [Jelly] [Milk,Fruit] Rule 36 5 6 4 80 1.3333333 [Fruit,Jelly] [Milk] Rule 43 5 6 4 80 1.3333333 [Jelly] [Fruit,Soda] Rule 45 5 6 4 80 1.3333333 [Fruit,Jelly] [Soda] Rule 9 6 5 4 66.66666667 1.3333333 [Milk] [Jelly] Rule 19 6 5 4 66.66666667 1.3333333 [Soda] [Jelly] Rule 32 6 5 4 66.66666667 1.3333333 [Milk] [Fruit,Jelly] Rule 34 6 5 4 66.66666667 1.3333333 [Milk,Fruit] [Jelly] Rule 44 6 5 4 66.66666667 1.3333333 [Soda] [Fruit,Jelly] Rule 46 6 5 4 66.66666667 1.3333333 [Fruit,Soda] [Jelly] Rule 1 4 9 4 100 1.1111111 [Bread] [Fruit] Rule 6 4 9 4 100 1.1111111[PeanutButter] [Fruit] Rule 7 6 9 6 100 1.1111111 [Milk] [Fruit] Rule 14 5 9 5 100 1.1111111 [Jelly] [Fruit] Rule 16 6 9 6 100 1.1111111 [Soda] [Fruit] Rule 17 4 9 4 100 1.1111111 [PotatoChips] [Fruit] Rule 25 4 9 4 100 1.1111111 [Bread,Jelly] [Fruit] Rule 29 4 9 4 100 1.1111111 [PeanutButter,Milk] [Fruit] Rule 35 4 9 4 100 1.1111111 [Milk,Jelly] [Fruit] Rule 41 5 9 5 100 1.1111111 [Milk,Soda] [Fruit] Rule 47 4 9 4 100 1.1111111 [Jelly,Soda] [Fruit] Rule 52 4 9 4 100 1.1111111 [Soda,PotatoChips] [Fruit] Rule 8 9 6 6 66.66666667 1.1111111 [Fruit] [Milk] Rule 15 9 6 6 66.66666667 1.1111111 [Fruit] [Soda] Rule 13 9 5 5 55.55555556 1.1111111 [Fruit] [Jelly] Rule 38 9 5 5 55.55555556 1.1111111 [Fruit] [Milk,Soda]
  • 89. Evaluating Association Rules An association rule is judged on how actionable it is and how well it explains the relationship between item sets An association rule is useful if it is well supported and explain an important previously unknown relationship 89
  • 90. Example Suppose the support for consequent B = 2 Support (A and B) = 2 Antecedent A is very popular: support of A = 50 100 transactions Confidence (If A then B) = Support (A and B) / Support (A) = 2/50 = 0.04 Lift ratio (If A then B) = Confidence (if A then B) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐵)/100 = 0.04 2/100 = 2 If A then B
  • 91. Text Mining Text mining is the process of transforming unstructured text into structured data for easy analysis. Text mining uses natural language processing (NLP), allowing machines to understand the human language and process it automatically. Text data is often referred to as unstructured data because in its raw form, it cannot be stored in a traditional structured database (rows and columns). Audio and video data are also examples of unstructured data. Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis.
  • 92. Basic Methods 1) Word frequency Word frequency can be used to identify the most recurrent terms or concepts in a set of data. 2) Collocation Collocation refers to a sequence of words that commonly appear near each other. 3) Concordance Concordance is used to recognize the particular context or instance in which a word or set of words appears. Topic Analysis, Sentiment Analysis
  • 93. Example: Triad Airline ◦ Triad solicits feedback from its customers through a follow-up e-mail the day after the customer has completed a flight. ◦ Survey asks the customer to rate various aspects of the flight and asks the respondent to type comments into a dialog box in the e-mail; includes: ◦ Quantitative feedback from the ratings. ◦ Comments entered by the respondents which need to be analyzed. ◦ A collection of text documents to be analyzed is called a corpus.
  • 94. Example: Triad Airline 10 respondents Concerns The wi-fi service was horrible. It was slow and cut off several times. My seat was uncomfortable. My flight was delayed 2 hours for no apparent reason. My seat would not recline. The man at the ticket counter was rude. Service was horrible. The flight attendant was rude. Service was bad. My flight was delayed with no explanation. My drink spilled when the guy in front of me reclined his seat. My flight was canceled. The arm rest of my seat was nasty. To be analyzed, text data needs to be converted to structured data
  • 95. Example: Converting text data To be analyzed, text data needs to be converted to structured data (rows and columns of numerical data) so that the tools of descriptive statistics, data visualization and data mining can be applied. We must convert a group of documents into a matrix of rows and columns where the rows correspond to a document and the columns correspond to a particular word. A presence/absence or binary term-document matrix is a matrix with the rows representing documents and the columns representing words. Entries in the columns indicate either the presence or the absence of a particular word in a particular document.
  • 96. Example: Converting text data ◦ Creating the list of terms to use in the presence/absence matrix can be a complicated matter: ◦ Too many terms results in a matrix with many columns, which may be difficult to manage and could yield meaningless results. ◦ Too few terms may miss important relationships. ◦ Term frequency along with the problem context are often used as a guide. ◦ In Triad’s case, management used word frequency and the context of having a goal of satisfied customers to come up with the following list of terms they feel are relevant for categorizing the respondent’s comments.
  • 97. Example: Converting text data Concerns The wi-fi service was horrible. It was slow and cut off several times. My seat was uncomfortable. My flight was delayed 2 hours for no apparent reason. My seat would not recline. The man at the ticket counter was rude. Service was horrible. The flight attendant was rude. Service was bad. My flight was delayed with no explanation. My drink spilled when the guy in front of me reclined his seat. My flight was canceled. The arm rest of my seat was nasty. delayed, flight, horrible, recline, rude, seat, and service
  • 98. Presence/Absence Term-Document Matrix Term Document Delayed Flight Horrible Recline Rude Seat Service 1 0 0 1 0 0 0 1 2 0 0 0 0 0 1 0 3 1 1 0 0 0 0 0 4 0 0 0 1 0 1 0 5 0 0 1 0 1 0 1 6 0 1 0 0 1 0 1 7 1 1 0 0 0 0 0 8 0 0 0 1 0 1 0 9 0 1 0 0 0 0 0 10 0 0 0 0 0 1 0 The text-mining process converts unstructured text into numerical data
  • 99. ◦ The text-mining process converts unstructured text into numerical data and applies quantitative techniques. ◦ Which terms become the headers of the columns of the term-document matrix can greatly impact the analysis. Preprocessing Text Data for Analysis ◦ It is the process of dividing text into separate terms, referred to as tokens: ◦ Symbols and punctuations must be removed from the document, and all letters should be converted to lowercase. ◦ Different forms of the same word, such as “stacking”, “stacked,” and “stack” probably should not be considered as distinct terms. Tokenization It is the process of converting a word to its stem or root word. Stemming
  • 100. Recommendations ◦ The goal of preprocessing is to generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis: ◦ Frequency can be used to eliminate words from consideration as tokens. ◦ Low-frequency words probably will not be very useful as tokens. ◦ Consolidating words that are synonyms can reduce the set of tokens. ◦ Most text-mining software gives the user the ability to manually specify terms to include or exclude as tokens. The use of slang, humor, and sarcasm can cause interpretation problems and might require more sophisticated data cleansing and subjective intervention on the part of the analyst to avoid misinterpretation. Data preprocessing parses the original text data down to the set of tokens deemed relevant for the topic being studied.
  • 101. Frequency Term-Document Matrix ◦ When the documents in a corpus contain many words and when the frequency of word occurrence is important to the context of the business problem, preprocessing can be used to develop a frequency term-document matrix. ◦ A frequency term-document matrix is a matrix whose rows represent documents and columns represent tokens, and the entries in the matrix are the frequency of occurrence of each token in each document.
  • 102. DataMining - Text frequency-inverse document frequency,
  • 103. Term Count Info Text Var Original (Total) Final (Total) Reduction, % Vocabulary Comments 84 19 22,61904762 7 Document Info Document ID # Characters # Terms Comments_Doc1 70 14 Comments_Doc2 26 4 Comments_Doc3 53 10 Comments_Doc4 26 5 Comments_Doc5 61 11 Comments_Doc6 47 8 Comments_Doc7 42 7 Comments_Doc8 63 13 Comments_Doc9 23 4 Comments_Doc10 34 8 Comments Top Terms Info Term Collection Frequency Document Frequency flight 4 4 seat 4 4 servic 3 3 delay 2 2 horribl 2 2 reclin 2 2 rude 2 2 Comments
  • 104. Term-Document Matrix Doc ID delay flight horribl reclin rude seat servic Comments_Doc1 0 0 1 0 0 0 1 Comments_Doc2 0 0 0 0 0 1 0 Comments_Doc3 1 1 0 0 0 0 0 Comments_Doc4 0 0 0 1 0 1 0 Comments_Doc5 0 0 1 0 1 0 1 Comments_Doc6 0 1 0 0 1 0 1 Comments_Doc7 1 1 0 0 0 0 0 Comments_Doc8 0 0 0 1 0 1 0 Comments_Doc9 0 1 0 0 0 0 0 Comments_Doc10 0 0 0 0 0 1 0
  • 105. Use Hierarchical Cluster analysis to group comments On the Presence / Absence Term-Document Matrix # Selected Variables 7 Selected Variables delay flight horribl reclin rude seat servic Hierarchical Clustering: Fitting Parameters Similarity Measure JACCARD Clustering Method COMPLETE LINKAGE Hierarchical Clustering: Model Parameters Cluster Assignment TRUE # Clusters 3 Hierarchical Clustering: Reporting Parameters Normalized? FALSE Draw Dendrogram? TRUE Maximum Number of Leaves in Dendrogram 10 Data Type Raw Data
  • 106. Record ID Cluster Sub-Cluster Record 1 1 1 Record 2 2 2 Record 3 3 3 Record 4 2 4 Record 5 1 5 Record 6 1 6 Record 7 3 7 Record 8 2 8 Record 9 3 9 Record 10 2 10 Record ID Cluster Sub-Cluster Comments Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times. Record 2 2 2 My seat was uncomfortable. Record 3 3 3 My flight was delayed 2 hours for no apparent reason. Record 4 2 4 My seat would not recline. Record 5 1 5 The man at the ticket counter was rude. Service was horrible. Record 6 1 6 The flight attendant was rude. Service was bad. Record 7 3 7 My flight was delayed with no explanation. Record 8 2 8 My drink spilled when the guy in front of me reclined his seat. Record 9 3 9 My flight was canceled. Record 10 2 10 The arm rest of my seat was nasty.
  • 107. Record ID Cluster Sub-Cluster Comments Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times. Record 5 1 5 The man at the ticket counter was rude. Service was horrible. Record 6 1 6 The flight attendant was rude. Service was bad. Record 2 2 2 My seat was uncomfortable. Record 4 2 4 My seat would not recline. Record 8 2 8 My drink spilled when the guy in front of me reclined his seat. Record 10 2 10 The arm rest of my seat was nasty. Record 3 3 3 My flight was delayed 2 hours for no apparent reason. Record 7 3 7 My flight was delayed with no explanation. Record 9 3 9 My flight was canceled. Cluster 1: {1, 5, 6} documents about service issues Cluster 2: {2, 4, 8, 10} documents about seat issues Cluster 3: {3, 7, 9} Documents about scheduling issues
  • 108. Example Movie Review Frequency Term-document matrix Terms Document Great Terrible 1 5 0 2 5 1 3 5 1 4 3 3 5 5 1 6 0 5 7 4 1 8 5 3 9 1 3 10 1 2 We have 10 reviews from movie critics. After using preprocessing techniques, including text reduction by synonyms, the number of tokens is down to two: Great and Terrible.
  • 109. Frequency Term-document matrix Terms Document Great Terrible 1 5 0 2 5 1 3 5 1 4 3 3 5 5 1 6 0 5 7 4 1 8 5 3 9 1 3 10 1 2 Apply K-Means Clustering to the Frequency- terms matrix With k = 2 The process of clustering / categorizing comments or reviews as positive, negative or neutral is known as sentiment analysis
  • 110. Variables # Selected Variables 2 Selected Variables Great Terrible K-Means Clustering: Fitting Parameters # Clusters 2 Start type Random Start # Iterations 10 Random seed: initial centroids 12345 K-Means Clustering: Reporting Parameters Show data summary TRUE Show distance from each cluster TRUE Normalized? FALSE Cluster Great Terrible Cluster 1 4.5714286 1.428571429 Cluster 2 0.6666667 3.333333333 Cluster Size Average Distance Cluster 1 7 1.1250 Cluster 2 3 1.2136 Total 10 1.1516 One random start and 10 iterations
  • 111. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Record 1 1 1.491 5.467 Record 2 1 0.606 4.922 Record 3 1 0.606 4.922 Record 4 1 2.222 2.357 Record 5 1 0.606 4.922 Record 6 2 5.801 1.795 Record 7 1 0.714 4.069 Record 8 1 1.629 4.346 Record 9 2 3.902 0.471 Record 10 2 3.617 1.374
  • 112. Record ID Cluster Great Terrible Record 1 1 5 0 Record 2 1 5 1 Record 3 1 5 1 Record 4 1 3 3 Record 5 1 5 1 Record 6 2 0 5 Record 7 1 4 1 Record 8 1 5 3 Record 9 2 1 3 Record 10 2 1 2
  • 113. Record ID Cluster Great Terrible Record 1 1 5 0 Record 2 1 5 1 Record 3 1 5 1 Record 4 1 3 3 Record 5 1 5 1 Record 7 1 4 1 Record 8 1 5 3 Record 6 2 0 5 Record 9 2 1 3 Record 10 2 1 2 Reviews tend to be positive Reviews tend to be negative (3, 3) corresponds to a balanced review and is classified in Cluster 1
  • 114. Sentiment Analysis The process of clustering / categorizing comments or reviews as positive, negative or neutral is known as sentiment analysis
  • 115. Text mining is • how a business analyst turns 50,000 hotel guest reviews into specific recommendations; • how a workforce analyst improves productivity and reduces employee turnover; • how healthcare providers and biopharma researchers understand patient experiences; • and much, much more. SAS Text Miner, Python, R, SPSS …

Editor's Notes

  1. Euclidean distance is highly influenced by the scale on which variables are measured.