3. Introduction
The increase in the use of data-mining techniques in
business has been caused largely by three events:
◦ The explosion in the amount of data being produced
and electronically tracked
◦ The ability to electronically warehouse these data
◦ The affordability of computer power to analyze the
data
3
4. Observation
A set of recorded values of variables associated
with a single entity. It is a row of values in a
spreadsheet or database, in which the columns
correspond to the variables
4
35 years
old
Male dentist single Donation 2016
$1000
Donation 2017
$2000
5. Supervised or Unsupervised Learning
Data-mining approaches can be separated into two categories:
Supervised learning—For prediction and classification
Unsupervised learning—To detect patterns and relationships in the data
Thought of as high-dimensional descriptive analytics
Designed to describe patterns and relationships in large data sets with
many observations of many variables
There is no outcome variable to predict
No definitive measure of accuracy, but qualitative assessment
6. Cluster Analysis
Goal: to segment observations into similar groups based on observed
variables
Can be employed during the data-preparation step to identify variables
or observations that can be aggregated or removed from
consideration
Commonly used in marketing to divide customers into different
homogenous groups; known as market segmentation
Used to identify outliers
7. Cluster Methods
Bottom-up hierarchical clustering starts with each observation
belonging to its own cluster and then sequentially merges the most
similar clusters to create a series of nested clusters
k-means clustering assigns each observation to one of k clusters in
a manner such that the observations assigned to the same cluster
are as similar as possible
Both methods depend on how two observations are similar,
hence, we have to measure similarity between observations
8. Three influential factors
Hierarchical versus nonhierarchical clustering
The measurement of the distance between observations
The measurement of the distance between clusters
9. Measurement of Distances between observations
Euclidean distance
Matching coefficients
Jaccard coefficients
10. Measuring Similarity Between Observations
Euclidean distance: Most common method to measure dissimilarity
between observations, when observations include continuous variables
Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise
measurements of q variables
The Euclidean distance between observations u and v is:
𝒅𝒖,𝒗 = 𝒖𝟏 − 𝒗𝟏
𝟐 + 𝒖𝟐 − 𝒗𝟐
𝟐 + ∙ ∙ ∙ + 𝒖𝒒 − 𝒗𝒒
𝟐
NOTE: This measure of distance is highly influenced by the scale on
which the variables are measured
11. Calculate the Euclidean Distance
Euclidean distance becomes smaller as a pair of observations become more
similar with respect to their variable values
11
Example
Euclidean distance between
(2 cars, 5 children) and (1 car, 3 children)
(2, 5) and (1, 3)
Distance = (2 − 1)2+(5 − 3)2= 5 = 2.24
12. Euclidian Distance
Euclidean distance is highly influenced by the scale on which
variables are measured
◦ Common to standardize the units of each variable j of
each observation u
◦ Example: uj, the value of variable j in observation u, is
replaced with its z-score, zj
The conversion to z-scores also makes it easier to identify
outlier measurements, which can distort the Euclidean
distance between observations
12
13. Age Female Income Married Children CarLoan Mortgage
48 1 17546.00 0 1 0 0
40 0 30085.10 1 3 1 1
51 1 16575.40 1 0 1 0
23 1 20375.40 1 3 0 0
57 1 50576.30 1 0 0 0
57 1 37869.60 1 2 0 0
22 0 8877.07 0 0 0 0
58 0 24946.60 1 0 1 0
37 1 25304.30 1 2 1 0
54 0 24212.10 1 2 1 0
66 1 59803.90 1 0 0 0
52 1 26658.80 0 0 1 1
44 1 15735.80 1 1 0 1
66 1 55204.70 1 1 1 1
36 0 19474.60 1 0 0 1
38 1 22342.10 1 0 1 1
37 1 17729.80 1 2 0 1
46 1 41016.00 1 0 0 1
62 1 26909.20 1 0 0 0
31 0 22522.80 1 0 1 0
61 0 57880.70 1 2 0 0
50 0 16497.30 1 2 0 0
54 0 38446.60 1 0 0 0
27 1 15538.80 0 0 1 1
22 0 12640.30 0 2 1 0
56 0 41034.00 1 0 1 1
45 0 20809.70 1 0 0 1
39 1 20114.00 1 1 0 0
39 1 29359.10 0 3 1 1
61 0 24270.10 1 1 0 0
Example
A Financial advising
company that provides
personalized financial advice
to its clients would like to
segment its customers pool
into several groups (clusters)
to better serve them.
Variables: Age,
Gender (1 if Female and 0 if male),
Annual Income,
whether Married (1) and not married (0),
Number of children,
If a Car loan = 1 and 0 if not,
Mortgage = 1 if a mortgage and 0 if not
14. Example:
Consider only Age and
Income
Age Income
48 17546.00
40 30085.10
51 16575.40
23 20375.40
57 50576.30
57 37869.60
22 8877.07
58 24946.60
37 25304.30
54 24212.10
66 59803.90
52 26658.80
44 15735.80
66 55204.70
36 19474.60
38 22342.10
37 17729.80
46 41016.00
62 26909.20
31 22522.80
61 57880.70
50 16497.30
54 38446.60
27 15538.80
22 12640.30
56 41034.00
45 20809.70
39 20114.00
39 29359.10
𝐷 = (48 − 40)2+(17546.00 − 30085.10)2=
D = 12539.1
The Euclidean distance between the first
two observations:
This dissimilarity measure is influenced by
the large values of Income
It would be better to use the Z-Score for
each variables to remove different units
Age Income
average 45.97 28011.87
st.dev. 13.04 13703.28
Zage Zincome
0.16 -0.76
-0.46 0.15
0.39 -0.83
-1.76 -0.56
0.85 1.65
0.85 0.72
-1.84 -1.40
0.92 -0.22
-0.69 -0.20
0.62 -0.28
1.54 2.32
0.46 -0.10
-0.15 -0.90
1.54 1.98
-0.76 -0.62
-0.61 -0.41
-0.69 -0.75
0.00 0.95
1.23 -0.08
-1.15 -0.40
1.15 2.18
0.31 -0.84
0.62 0.76
-1.45 -0.91
-1.84 -1.12
0.77 0.95
-0.07 -0.53
-0.53 -0.58
-0.53 0.10
Standardized Distance
between first two
observations =
𝑆𝑡. 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒
= (0.16 − −0.46)2+(−0.76 − 0.15)2
=1.101
Z-score also helps
identifying outliers
(48, 17546.00) & (40, 30085.10)
𝑍𝑎𝑔𝑒 =
(48 − 45.97)
13.04
= 0.16
(0.16, -0.76) & (-0.46, 0.15)
15. Matching Coefficients
For categorical variables encoded as 0–1, a better measure
of similarity between two observations can be achieved by
counting the number of variables with matching values
The simplest overlap measure is called the matching coefficient and is
computed by:
Example
Matching Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 3/5 = 0.60
16. Jaccard’s Coefficient
A weakness of the matching coefficient is that if two observations both
have a 0 entry for a categorical variable, this is counted as a sign of
similarity between the two observations
To avoid misstating similarity due to the absence of a feature, a
similarity measure called Jaccard’s coefficient does not count matching
zero entries and is computed by:
Example
Jaccard’s Coefficient for (1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1)
(1, 0, 1, 0, 1) and (0, 1, 1, 0, 1) = 2/(5-1) = 0.50
17. Example
Calculate the Matching Coefficient and the Jaccard’s Coefficient for
the first five observations. Consider only categorical variables
Age FemaleIncome Married Children CarLoan Mortgage
1 48 1 17546 0 1 0 0
2 40 0 30085 1 3 1 1
3 51 1 16575 1 0 1 0
4 23 1 20375 1 3 0 0
5 57 1 50576 1 0 0 0
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
21. Example
Jaccard’s Coefficient
Similarity Matrix
Obs. 1 2 3 4 5
1 1
2 0 1
3 1/3=0.33 2/4=0.50 1
4 ½=0.50 ¼-0.25 2/3=0.67 1
5 ½=0.50 ¼=0.25 2/3=0.67 2/2=1 1
Female Married CarLoan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
According to Jaccard:
Observations 1 and 4 are
equally similar (0.5)
to observations 2 and 3 (0.5)
According to Matching:
Observations 1 and 4 are more
similar (0.75) than observations
2 and 3 (0.5)
22. Hierarchical Clustering
◦Determines the similarity of two clusters by considering the similarity
between the observations composing either cluster
◦Starts with each observation in its own cluster and then iteratively
combines the two clusters that are the most similar into a single cluster
◦Given a way to measure similarity between observations, there are
several clustering method alternatives for comparing observations in two
clusters to obtain a cluster similarity measure
24. Similarity Measures between clusters
24
• The similarity between two clusters is defined by the similarity of the pair of
observations (one from each cluster) that are the most similar
Single linkage
• This clustering method defines the similarity between two clusters as the
similarity of the pair of observations (one from each cluster) that are the most
different
Complete
linkage
• Defines the similarity between two clusters to be the average similarity
computed over all pairs of observations between the two clusters
Group Average
linkage
• Analogous to group average linkage except that it uses the median of the
similarities computer between all pairs of observations between the two
clusters
• Uses the averaging concept of cluster centroids to define between-cluster
similarity
Median linkage
Centroid Linkage
25. Measuring Similarity Between Clusters
Single linkage will consider
two clusters to be close if an
observation in one of the
clusters is close to at least
one observation in the
other cluster. The cluster
formed by merging two
clusters that are close with
respect to single linkage
may also consist of pairs of
observations that are very
different.
Complete linkage will
consider two clusters to be
close if their most different
pair of observations are
close. This method produces
clusters such that all
member observations of a
cluster are relatively close
to each other. This
clustering can be distorted
by outlier observations.
The single and Complete
linkage methods are based on
a single pair of observation to
determine similarity
26. Measuring Similarity Between Clusters
Group Average Linkage on
all the pairs of observations.
If Cluster 1 has n1
observations and Cluster 2
has n2 observations, then
the similarity measure is the
average of all n1 x n2 pairs.
This method produces
clusters that are less
dominated by similarity
measures between single
pairs of obseravtions.
Centroid Linkage is based on
the average calculated in
each cluster which is called
the centroid. The similarity
between the clusters is
defined as the similarity of
the centroids.
The Median Linkage is similar
to the Group Average
Linkage, except that it uses
the Median instead of the
average
27. Example
Consider the following matrix of distances between pairs of 5 objects:
1 2 3 4 5
1 0 9 3 6 11
2 9 0 7 5 10
3 3 7 0 9 2
4 6 5 9 0 8
5 11 10 2 8 0
D =
Classify using Hierarchical Clustering with Single linkage
38. Cluster Analysis - More measures
Centroid linkage uses the averaging concept of cluster centroids to
define between-cluster similarity
Ward’s method merges two clusters such that the dissimilarity of the
observations with the resulting single cluster increases as little as
possible
When McQuitty’s method considers merging two clusters A and B, the
dissimilarity of the resulting cluster AB to any other cluster C is calculate
as: ((dissimilarity between A and C) + (dissimilarity between B and C))/2)
A dendrogram is a chart that depicts the set of nested clusters resulting
at each step of aggregation
40. Inputs
Female Married CarLoan Mortgage
Hierarchical Clustering: Reporting Parameters
Normalized?
Draw Dendrogram?
Maximum Number of Leaves in Dendrogram
Data Type
FALSE
TRUE
10
Raw Data
Clustering Method
MATCHING
GROUP AVERAGE
Hierarchical Clustering: Model Parameters
Cluster Assignment
# Clusters
TRUE
29
Variables
# Selected Variables
Selected Variables
4
Hierarchical Clustering: Fitting Parameters
Similarity Measure
Data
Workbook
Worksheet
Range
# Records in the input data
Data Chapter 4.xlsx
KYC
$A$1:$G$31
30
41. Clustering
stages
Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 4 1
Record 5 4 1
Record 6 5 1
Record 7 6 4
Record 8 7 2
Record 9 8 3
Record 10 9 2
Record 11 10 1
Record 12 11 5
Record 13 12 6
Record 14 13 7
Record 15 14 8
Record 16 15 7
Record 17 16 6
Record 18 17 6
Record 19 18 1
Record 20 19 2
Record 21 20 9
Record 22 21 9
Record 23 22 9
Record 24 23 5
Record 25 24 10
Record 26 25 2
Record 27 26 8
Record 28 27 1
Record 29 28 5
Record 30 29 9
Stage Cluster 1 Cluster 2 Distance
Stage1 4 5 0
Stage2 4 6 0
Stage3 3 9 0
Stage4 8 10 0
Stage5 4 11 0
Stage6 14 16 0
Stage7 13 17 0
Stage8 13 18 0
Stage9 4 19 0
Stage10 8 20 0
Stage11 21 22 0
Stage12 21 23 0
Stage13 12 24 0
Stage14 2 26 0
Stage15 15 27 0
Stage16 4 28 0
Stage17 12 29 0
Stage18 21 30 0
Stage19 1 4 0.25
Stage20 2 8 0.25
Stage21 3 14 0.25
Stage22 13 15 0.25
Stage23 7 21 0.25
Stage24 1 7 0.3214
Stage25 2 25 0.35
Stage26 3 12 0.375
Stage27 1 13 0.4125
Stage28 2 3 0.5060
Stage29 1 2 0.5905
Sub-Cluster
membership
Each of the 30
observations is
assigned to
one of the 10
sub-clusters
43. Analytic Solver Basic
limits the Maximum
Number of Leaves to 10.
Setting a value higher
than 10 will result in an
error message.
If the number of Leaves
is less than the number
of observations, some of
the clusters on the
horizontal axis will
initially correspond to
groups of observations
combined in early steps
of the agglomeration
not represented in the
dendrogram
Sub-cluster #3
45. The vertical distance
between the two horizontal
lines is the cost of merging
clusters in terms of
decreased homogeneity
within clusters =
0.42-0.251 = 0.169
NOTE: Elongating portions
of the dendrogram
represents mergers of more
dissimilar clusters
46. Cluster’s durability or strength
The durability or
strength of the cluster is
measured by the
difference between the
distance value at which
a cluster is originally
formed and the distance
value at which it is
merged with another
cluster.
Cluster 8 is formed at
distance 0 and merged
with cluster 6 at
distance 0.25.
Strength of cluster 8 =
0.25 – 0 = 0.25
49. CLUSTER 1
Obs Female Married Car Loan Mortgage
3 1 1 1 0
9 1 1 1 0
12 1 0 1 1
14 1 1 1 1
16 1 1 1 1
24 1 0 1 1
29 1 0 1 1
Cluster 1
Row Labels Count of Female
0 0
1 7
Grand Total 7
Row Labels Count of Married
0 3
1 4
Grand Total 7
Row Labels Count of Mortgage
0 2
1 5
Grand Total 7
All have car loans
50. CLUSTER 2
Obs Female Married Car Loan Mortgage
2 0 1 1 1
8 0 1 1 0
10 0 1 1 0
20 0 1 1 0
25 0 0 1 0
26 0 1 1 1
Cluster 2
Row Labels Count of Female
0 6
1 0
Grand Total 6
Row Labels Count of Married
0 1
1 5
Grand Total 6
Row Labels Count of Mortgage
0 4
1 2
Grand Total 6
All have car loans
51. CLUSTER 3
Obs Female Married Car Loan Mortgage
1 1 0 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 0 0 0 0
11 1 1 0 0
13 1 1 0 1
15 0 1 0 1
17 1 1 0 1
18 1 1 0 1
19 1 1 0 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
27 0 1 0 1
28 1 1 0 0
30 0 1 0 0
Cluster 3
Row Labels Count of Female
0 7
1 10
Grand Total 17
Row Labels Count of Married
0 2
1 15
Grand Total 17
Row Labels Count of Mortgage
0 12
1 5
Grand Total 17
None have car loans
52. APPLICATIONS TO HIERARCHICAL CLUSTERING
Herarchical clustering is often used on DNA microarrays.
Clustering of gene expression profiles is often used to try
to discover subclasses of disease.
Validation of these clusters is important for accurate
scientific interpretation of the results.
53. k-Means Clustering
◦ Given a value of k, the k-means algorithm randomly partitions
the observations into k clusters
◦ After all observations have been assigned to a cluster, the
resulting cluster centroids are calculated
◦ Using the updated cluster centroids, all observations are
reassigned to the cluster with the closest centroid using the
Euclidean distance
53
55. Example
Clustering Observations by Age and Income
Using k-Means Clustering with k = 3
55
Most
heterogeneous
Largest
cluster
Most
homogeneous
56. Data
Workbook DemoKTC.xlsx
Worksheet Data
Range $A$1:$G$31
# Records in the input data 30
# Selected Variables 2
Selected Variables Age Income
K-Means Clustering: Fitting Parameters
# Clusters 3
Start type Random Start
# Iterations 10
Random seed: initial centroids 12345
K-Means Clustering: Reporting Parameters
Show data summary TRUE
Show distance from each cluster TRUE
Normalized? TRUE
Cluster using Age
and Income only
Normalize
57. Cluster Age Income
Cluster 1 -1.837607 -1.121744243
Cluster 2 -1.837607 -1.121744243
Cluster 3 0.7692901 0.950292945
Cluster Age Income
Cluster 1 -0.610832 -0.41375302
Cluster 2 -0.687505 -0.197585752
Cluster 3 0.8459635 0.719370083
Cluster Age Income
Cluster 1 -0.534158 -0.576349161
Cluster 2 1.5360244 1.984403237
Cluster 3 0.3859229 -0.83457936
Cluster Age Income
Cluster 1 0.8459635 1.646644617
Cluster 2 -1.147546 -0.400566393
Cluster 3 1.5360244 2.320030979
Cluster Age Income
Cluster 1 1.5360244 1.984403237
Cluster 2 -0.150791 -0.895849375
Cluster 3 -1.837607 -1.121744243
Start 1. Sum of Squares: 45.069043
Start 2. Sum of Squares: 24.830185
Best: Start 3. Sum of Squares: 22.397972
Start 4. Sum of Squares: 35.082695
Start 5. Sum of Squares: 25.533615
Cluster Age Income
Cluster 1 -1.837607 -1.396366871
Cluster 2 -1.837607 -1.396366871
Cluster 3 -0.687505 -0.750336738
Cluster Age Income
Cluster 1 0.8459635 0.719370083
Cluster 2 -1.147546 -0.400566393
Cluster 3 1.2293307 -0.080467783
Cluster Age Income
Cluster 1 -0.610832 -0.41375302
Cluster 2 1.2293307 -0.080467783
Cluster 3 -0.764179 -0.623009532
Cluster Age Income
Cluster 1 -0.074118 -0.525580284
Cluster 2 -0.610832 -0.41375302
Cluster 3 0.4625964 -0.098740784
Cluster Age Income
Cluster 1 1.2293307 -0.080467783
Cluster 2 -0.687505 -0.197585752
Cluster 3 0.6159432 -0.277289314
Start 7. Sum of Squares: 23.072807
Start 8. Sum of Squares: 34.274493
Start 9. Sum of Squares: 35.570412
Start 10. Sum of Squares: 34.630941
Start 6. Sum of Squares: 85.235001
10 iterations
Best Start
with
smallest SS
59. Evaluate the strength of the clusters
59
Cluster 2 is most heterogeneous
Distance between Cluster 2 and
Cluster 3 centroids =1.9639
The average observation in Cluster 2
is approximately 2.66 times closer to
Cluster 2 than to Cluster 3.
(1.964/0.739 = 2.66)
Rule of thumb : the ratio of between-cluster distance to
within-cluster distance should exceed 1.0 for useful clusters.
For Cluster 1: within-distance = 0.6215, distance between Cluster 1 and Cluster 2 = 2.7836, the ratio = 2.7836 / 0.6215 =
4.48. The average observation is Cluster 1 is 4.48 times closer to Cluster 1 than to Cluster 2 and also 2.46 time closer
than to Cluster 3. (the ratio = 1.529 / 0.6215 = 2.46)
Average distance between
observations in the cluster
Distances between centroids
60. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3
Record 1 3 1.1998 2.3291 0.4459
Record 2 1 0.9092 1.8805 1.1484
Record 3 3 1.4389 2.3338 0.3715
Record 4 1 0.7348 3.3369 2.2631
Record 5 2 2.8924 0.2183 2.1558
Record 6 2 2.2665 0.7226 1.2493
Record 7 1 1.1667 3.9503 2.5112
Record 8 3 1.9773 1.6626 0.4942
Record 9 1 0.4946 2.2890 1.2218
Record 10 3 1.6659 1.7417 0.2342
Record 11 2 3.8534 1.0791 2.9865
Record 12 3 1.5580 1.6022 0.3845
Record 13 3 0.9382 2.5657 0.7724
Record 14 2 3.6096 0.8281 2.6742
Record 15 1 0.2699 2.6579 1.2730
Record 16 1 0.4397 2.3988 1.1138
Record 17 1 0.3894 2.7119 1.2185
Record 18 2 1.8247 1.0339 1.5146
Record 19 3 2.3055 1.5519 0.8314
Record 20 1 0.1989 2.7622 1.6505
Record 21 2 3.4990 0.7786 2.7397
Record 22 3 1.3649 2.3578 0.4069
Record 23 2 2.1066 0.7397 1.2481
Record 24 1 0.5543 3.3350 2.0017
Record 25 1 0.9880 3.7580 2.4246
Record 26 2 2.3450 0.5093 1.4566
Record 27 3 0.9526 2.1985 0.5768
Record 28 1 0.4923 2.4810 1.0394
Record 29 1 0.8204 1.9727 1.1863
Record 30 3 2.1974 1.7286 0.6842
Distances from
each cluster
Classify as
group 3
because the
distance is
the smallest
to Cluster 3
61. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income
Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00
Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10
Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40
Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40
Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30
Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60
Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07
Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60
Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30
Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10
Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90
Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80
Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80
Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70
Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60
Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10
Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80
Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00
Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20
Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80
Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70
Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30
Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60
Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80
Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30
Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00
Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70
Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00
Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10
Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10
Insert the original
data to analyze the
clusters
62. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2 Dist.Cluster-3 Age Income
Record 2 1 0.9092 1.8805 1.1484 40.0000 30085.10
Record 4 1 0.7348 3.3369 2.2631 23.0000 20375.40
Record 7 1 1.1667 3.9503 2.5112 22.0000 8877.07
Record 9 1 0.4946 2.2890 1.2218 37.0000 25304.30
Record 15 1 0.2699 2.6579 1.2730 36.0000 19474.60
Record 16 1 0.4397 2.3988 1.1138 38.0000 22342.10
Record 17 1 0.3894 2.7119 1.2185 37.0000 17729.80
Record 20 1 0.1989 2.7622 1.6505 31.0000 22522.80
Record 24 1 0.5543 3.3350 2.0017 27.0000 15538.80
Record 25 1 0.9880 3.7580 2.4246 22.0000 12640.30
Record 28 1 0.4923 2.4810 1.0394 39.0000 20114.00
Record 29 1 0.8204 1.9727 1.1863 39.0000 29359.10
Record 5 2 2.8924 0.2183 2.1558 57.0000 50576.30
Record 6 2 2.2665 0.7226 1.2493 57.0000 37869.60
Record 11 2 3.8534 1.0791 2.9865 66.0000 59803.90
Record 14 2 3.6096 0.8281 2.6742 66.0000 55204.70
Record 18 2 1.8247 1.0339 1.5146 46.0000 41016.00
Record 21 2 3.4990 0.7786 2.7397 61.0000 57880.70
Record 23 2 2.1066 0.7397 1.2481 54.0000 38446.60
Record 26 2 2.3450 0.5093 1.4566 56.0000 41034.00
Record 1 3 1.1998 2.3291 0.4459 48.0000 17546.00
Record 3 3 1.4389 2.3338 0.3715 51.0000 16575.40
Record 8 3 1.9773 1.6626 0.4942 58.0000 24946.60
Record 10 3 1.6659 1.7417 0.2342 54.0000 24212.10
Record 12 3 1.5580 1.6022 0.3845 52.0000 26658.80
Record 13 3 0.9382 2.5657 0.7724 44.0000 15735.80
Record 19 3 2.3055 1.5519 0.8314 62.0000 26909.20
Record 22 3 1.3649 2.3578 0.4069 50.0000 16497.30
Record 27 3 0.9526 2.1985 0.5768 45.0000 20809.70
Record 30 3 2.1974 1.7286 0.6842 61.0000 24270.10
Cluster 1
Cluster 2
Cluster 3
Order the column
Cluster and
expand the
selection to all the
other columns
64. K-Means Clustering on Age, Income and Children
Age Income Children
K-Means Clustering: Reporting Parameters
Show data summary
Show distance from each cluster
Normalized?
TRUE
TRUE
TRUE
Start type
# Iterations
Random seed: initial centroids
3
Random Start
10
12345
Variables
# Selected Variables
Selected Variables
3
K-Means Clustering: Fitting Parameters
# Clusters
No. of iterations = No. of times that cluster centroids are
recalculated and observations are reassigned to clusters
65. Cluster Age Income Children
Cluster 1 -1.837607 -1.121744243 0.9870553
Cluster 2 -1.837607 -1.121744243 0.9870553
Cluster 3 0.7692901 0.950292945 -0.863673
Cluster Age Income Children
Cluster 1 -0.610832 -0.41375302 -0.863673
Cluster 2 -0.687505 -0.197585752 0.9870553
Cluster 3 0.8459635 0.719370083 0.9870553
Cluster Age Income Children
Cluster 1 -0.534158 -0.576349161 0.061691
Cluster 2 1.5360244 1.984403237 0.061691
Cluster 3 0.3859229 -0.83457936 -0.863673
Cluster Age Income Children
Cluster 1 0.8459635 1.646644617 -0.863673
Cluster 2 -1.147546 -0.400566393 -0.863673
Cluster 3 1.5360244 2.320030979 -0.863673
Cluster Age Income Children
Cluster 1 1.5360244 1.984403237 0.061691
Cluster 2 -0.150791 -0.895849375 0.061691
Start 1. Sum of Squares: 81.070037
Start 2. Sum of Squares: 51.665475
Best: Start 3. Sum of Squares: 49.715972
Start 4. Sum of Squares: 86.460648
Start 5. Sum of Squares: 54.790692
Cluster Age Income Children
Cluster 1 -1.837607 -1.396366871 -0.863673
Cluster 2 -1.837607 -1.396366871 -0.863673
Cluster 3 -0.687505 -0.750336738 0.9870553
Cluster Age Income Children
Cluster 1 0.8459635 0.719370083 0.9870553
Cluster 2 -1.147546 -0.400566393 -0.863673
Cluster 3 1.2293307 -0.080467783 -0.863673
Cluster Age Income Children
Cluster 1 -0.610832 -0.41375302 -0.863673
Cluster 2 1.2293307 -0.080467783 -0.863673
Cluster 3 -0.764179 -0.623009532 -0.863673
Cluster Age Income Children
Cluster 1 -0.074118 -0.525580284 -0.863673
Cluster 2 -0.610832 -0.41375302 -0.863673
Cluster 3 0.4625964 -0.098740784 -0.863673
Cluster Age Income Children
Cluster 1 1.2293307 -0.080467783 -0.863673
Cluster 2 -0.687505 -0.197585752 0.9870553
Cluster 3 0.6159432 -0.277289314 0.9870553
Start 7. Sum of Squares: 60.531938
Start 8. Sum of Squares: 85.652446
Start 9. Sum of Squares: 86.948365
Start 10. Sum of Squares: 63.737762
Start 6. Sum of Squares: 133.415590
Random Starts Summary
66. Cluster Centers
Cluster Age Income Children
Cluster 1 -0.7182 -0.6041 0.4935
Cluster 2 1.1833 1.7700 0.0617
Cluster 3 0.4856 0.0211 -0.7711
Inter-Cluster Distances
Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 1 0 3.0722 1.8545
Cluster 2 3.0722 0 2.05893
Cluster 3 1.8545 2.0589 0
Cluster Summary
Cluster Size Average Distance
Cluster 1 15 1.2413
Cluster 2 5 0.9982
Cluster 3 10 0.8147
Total 30 1.0586
Cluster 1 and Cluster 2 are the
most distinct pairs of clusters
Observations within clusters are
more similar than observations
between clusters
Cluster 3
more
homogenous
69. Cluster 1: youngest customers, lowest
income and largest families
Cluster 2: oldest customers, highest
income and an average of one child
Cluster 3: older customers,
moderate income and few children
Distance between clusters
Cluster 1 and Cluster 2 are the most distinct pairs of clusters
Cluster Age Income Children Count
1 36.6 19734.16 1.47 15
2 61.4 52267.04 1 5
3 52.3 28300.85 0.1 10
Total 45.97 28011.87 0.93 30
70. Cluster Analysis
As an unsupervised learning technique, cluster analysis is not guided
by any explicit measure of accuracy, and thus the notion of a ‘good’
clustering is subjective and is dependent on what the analyst hopes
the cluster analysis will uncover.
We can measure the strength of a cluster by comparing the average
distance in a cluster to the distance between cluster centroids.
Rule of thumb: the ratio of between-cluster distance to within-
cluster distance should exceed 1.0 for useful clusters.
71. Hierarchical Clustering versus K-Means Clustering
71
Hierarchical Clustering k-Means Clustering
Suitable when we have a small data set (e.g., less
than 500 observations) and want to easily
examine solutions with increasing numbers of
clusters
Suitable when you know how many clusters
you want and you have a larger data set (e.g.,
larger than 500 observations)
Convenient method if you want to observe how
clusters are nested
Partitions the observations,
which is appropriate if trying to summarize the
data with k “average” observations
that describe the data with the minimum
amount of error
Very sensitive to outliers
Generally not appropriate for binary or ordinal
data because the average is not meaningful
72. Association Rules
EVALUATING ASSOCIATION RULES
Association rule mining is the data mining process of
finding the rules that may govern associations and causal
objects between sets of items. So in a given transaction
with multiple items, it tries to find the rules that govern how
or why such items are often bought together.
73. Association Rules
Association rules: if-then statements which convey the likelihood of certain items being
purchased together
Although association rules are an important tool in market basket analysis, they are
applicable to other disciplines.
Antecedent: the collection of items (or item set) corresponding to the if portion of the rule
Consequent: the item set corresponding to the then portion of the rule
Support count of an item: number of transactions in the data that include that item set
73
74. When we go grocery shopping, we often have a standard list of
things to buy. Each shopper has a distinctive list, depending on
one’s needs and preferences. A person might buy healthy
ingredients for a family dinner, while a bachelor might buy beer
and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and
Y, that are frequently bought together:
•Both X and Y can be placed on the
same shelf, so that buyers of one
item would be prompted to buy the
other.
•Promotional discounts could be
applied to just one out of the two
items.
•Advertisements on X could be
targeted at buyers who purchase Y.
•X and Y could be combined into a
new product, such as having Y in
flavors of X.
While we may know that certain
items are frequently bought
together, the question is, how do
we uncover these associations?
Besides increasing sales profits,
association rules can also be used
in other fields. In medical diagnosis
for instance, understanding which
symptoms tend to appear together
can help to improve patient care
and medicine prescription.
Definition
Association rules analysis is a technique to
uncover how items are associated to each other.
There are three common ways to measure
association:
The Support Count, the Confidence and the Lift-Ratio
75. https://medium.com/analytics-vidhya/association-rule-mining-7f06401f0601
In 2004, Walmart mined trillions of bytes of data to discover that
Strawberry Pop-Tarts were most purchased, pre-hurricane. It
was later attributed to the no-cook, long-lasting capabilities of the
tarts that made them disaster favourites. This was later proved to be
true when they stocked up on Pop-Tarts pre-hurricane in the future
years only to have them sold-out.
EXAMPLES
Before a hurricane strikes, people tend to stock up on Strawberry Pop-Tarts
just as much as batteries and other essentials.
Fast-food chains learned very early in the game that people who buy fast
food tend to feel thirsty due to the high salt content and end up buying Coke.
76. Association Rule Mining a good tool for businesses.
1. It helps businesses build sales strategies
identifying products that sell better together
2. It helps businesses build marketing strategies
The knowledge that some ornaments do not sell as well others during Christmas may help the
manager offer a sale on the non-frequent ornaments
3. It helps shelf-life planning
If olives don’t sell very often, the manager will not stock up on it. But he still wants to ensure that the
existing stock sells before the expiration date. With the knowledge that people who buy pizza dough
tend to buy olives, the olives can be offered at a lower price in combination with the pizza dough.
4. It helps the in-store organization.
Products which are known to drive the sales of other products can be moved closer together in the store.
For instance, if the sale of butter is driven by the sale of bread, they can be moved to the same aisle
in the store.
77. Walmart analyzed 1.2 million baskets of a store and found
a very interesting association. They found that on Fridays,
between 5 pm and 7 pm, diapers and beer were frequently
bought together.
To test this out, they moved the two closer in the store and
found a significant impact on the sales of these products.
Further analysis of this led them to the following
conclusion: “On Friday evenings, men would head home
from work and grab some beer while also picking up
diapers for their infants.”
78. Example
Shopping-Cart Transactions
78
If bread and jelly, then peanut butter
antecedent consequent
We consider only association
rule with a single consequent
The Support Count = the number of
transactions that include the event
Support count of (bread, jelly)
Rule of thumb: Only consider association rule with a support count of at least 20% of transaction
=4
79. Example
Shopping-Cart Transactions
79
If bread and jelly, then peanut butter
We consider only association
rule with a single consequent
The Support Count = the number of
transactions that include the event
Support count of (bread, jelly)
Rule of thumb: Only consider association rule with a support count of at least 20% of transaction
=4
Support count of peanut butter =4
80. Association Rules
Confidence: Helps identify reliable association rules
Lift ratio: Measure to evaluate the efficiency of a rule
80
a) Find the Confidence of the rule “If Bread and Fruit, then Jelly”
c) Find the Confidence of the rule “If Milk then Peanut Butter,”
b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly”
d) Find the Lift Ratio of the rule “If Milk then Peanut Butter”
The confidence measure helps
identify which product drives the
sale of which other product.
81. a) Find the Confidence of the rule “If Bread and Fruit, then Jelly”
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡, 𝐽𝑒𝑙𝑙𝑦}
𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝐵𝑟𝑒𝑎𝑑, 𝐹𝑟𝑢𝑖𝑡}
=
4
4
= 1
82. a) Find the Confidence of the rule “If Bread and Fruit, then Jelly” = 1.0
b) Find the Lift Ratio of the rule “If Bread and Fruit, then Jelly”
=
1.0
5
10
= 2.00
Identifying a
customer who
purchased both
Bread and Fruit
as one who
also purchased
Jelly is two
times better
than just
guessing that a
random
customer
purchased Jelly
83. =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 {𝑀𝑖𝑙𝑘, 𝑃𝑒𝑎𝑛𝑢𝑡 𝐵𝑢𝑡𝑡𝑒𝑟}
𝑆𝑢𝑝𝑝𝑜𝑟𝑡{𝑀𝑖𝑙𝑘}
c) Find the Confidence of the rule “If Milk then Peanut Butter”
d) Find the Lift Ratio of the rule “If Milk then Peanut Butter”
=
𝟒
𝟔
= 𝟎. 𝟔𝟔
=
𝟎.𝟔𝟔
𝟒
𝟏𝟎
= 𝟏. 𝟔𝟔
Identifying a
customer who
purchased Milk
as one who
also purchased
Peanut Butter
is 66% more
likely to buy
Peanut Butter
than just
guessing that a
random
customer
purchased
Peanut Butter
85. Association Rules
85
The utility of the rules depends on both its support and its lift ratio.
Although a high lift ratio suggests that the rule is very efficient at finding
when the occurrence occurs, if it has a very low support, the rule may not
as useful as another rule that has a lower lift ratio but affects a large
number of transactions ( as demonstrated by a high support).
86. Data Format Binary
Min support
Min confidence
Apriori
4
50
Association Rules: Reporting Parameters
Association Rules: Fitting Parameters
Method
Data Mining - Associate – Association rule
89. Evaluating Association Rules
An association rule is judged on how actionable it is and how
well it explains the relationship between item sets
An association rule is useful if it is well supported and explain
an important previously unknown relationship
89
90. Example
Suppose the support for consequent B = 2
Support (A and B) = 2
Antecedent A is very popular: support of A = 50
100 transactions
Confidence (If A then B) = Support (A and B) / Support (A) = 2/50 = 0.04
Lift ratio (If A then B) =
Confidence (if A then B)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐵)/100
=
0.04
2/100
= 2
If A then B
91. Text Mining
Text mining is the process of transforming unstructured text into
structured data for easy analysis. Text mining uses natural language
processing (NLP), allowing machines to understand the human language
and process it automatically.
Text data is often referred to as unstructured data because in its raw form, it
cannot be stored in a traditional structured database (rows and columns).
Audio and video data are also examples of unstructured data.
Data mining with text data is more challenging than data mining with traditional
numerical data, because it requires more preprocessing to convert the text to a
format amenable for analysis.
92. Basic Methods
1) Word frequency
Word frequency can be used to identify the most recurrent
terms or concepts in a set of data.
2) Collocation
Collocation refers to a sequence of words that commonly appear
near each other.
3) Concordance
Concordance is used to recognize the particular context or
instance in which a word or set of words appears.
Topic Analysis, Sentiment Analysis
93. Example: Triad Airline
◦ Triad solicits feedback from its customers through a follow-up e-mail the
day after the customer has completed a flight.
◦ Survey asks the customer to rate various aspects of the flight and asks the
respondent to type comments into a dialog box in the e-mail; includes:
◦ Quantitative feedback from the ratings.
◦ Comments entered by the respondents which need to be analyzed.
◦ A collection of text documents to be analyzed is called a corpus.
94. Example: Triad Airline 10 respondents
Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.
To be analyzed, text data needs to be converted to structured data
95. Example: Converting text data
To be analyzed, text data needs to be converted to structured data (rows and
columns of numerical data) so that the tools of descriptive statistics, data
visualization and data mining can be applied.
We must convert a group of documents into a matrix of rows and columns
where the rows correspond to a document and the columns correspond to a
particular word.
A presence/absence or binary term-document matrix is a matrix with the rows
representing documents and the columns representing words.
Entries in the columns indicate either the presence or the absence of a
particular word in a particular document.
96. Example: Converting text data
◦ Creating the list of terms to use in the presence/absence matrix can be a
complicated matter:
◦ Too many terms results in a matrix with many columns, which may be
difficult to manage and could yield meaningless results.
◦ Too few terms may miss important relationships.
◦ Term frequency along with the problem context are often used as a guide.
◦ In Triad’s case, management used word frequency and the context of
having a goal of satisfied customers to come up with the following list of
terms they feel are relevant for categorizing the respondent’s comments.
97. Example: Converting text data
Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.
delayed, flight, horrible, recline, rude, seat, and service
99. ◦ The text-mining process converts unstructured text into numerical data and applies
quantitative techniques.
◦ Which terms become the headers of the columns of the term-document matrix can
greatly impact the analysis.
Preprocessing Text Data for Analysis
◦ It is the process of dividing text into separate terms, referred to as tokens:
◦ Symbols and punctuations must be removed from the document, and all
letters should be converted to lowercase.
◦ Different forms of the same word, such as “stacking”, “stacked,” and “stack”
probably should not be considered as distinct terms.
Tokenization
It is the process of converting a word to its stem or root word.
Stemming
100. Recommendations
◦ The goal of preprocessing is to generate a list of most-relevant terms that is
sufficiently small so as to lend itself to analysis:
◦ Frequency can be used to eliminate words from consideration as tokens.
◦ Low-frequency words probably will not be very useful as tokens.
◦ Consolidating words that are synonyms can reduce the set of tokens.
◦ Most text-mining software gives the user the ability to manually specify terms to
include or exclude as tokens.
The use of slang, humor, and sarcasm can cause interpretation problems and might
require more sophisticated data cleansing and subjective intervention on the part of the
analyst to avoid misinterpretation.
Data preprocessing parses the original text data down to the set of tokens deemed
relevant for the topic being studied.
101. Frequency Term-Document Matrix
◦ When the documents in a corpus contain many words and when the frequency
of word occurrence is important to the context of the business problem,
preprocessing can be used to develop a frequency term-document matrix.
◦ A frequency term-document matrix is a matrix whose rows represent
documents and columns represent tokens, and the entries in the matrix are the
frequency of occurrence of each token in each document.
105. Use Hierarchical Cluster analysis to group comments
On the Presence / Absence Term-Document Matrix
# Selected Variables 7
Selected Variables delay flight horribl reclin rude seat servic
Hierarchical Clustering: Fitting Parameters
Similarity Measure JACCARD
Clustering Method COMPLETE LINKAGE
Hierarchical Clustering: Model Parameters
Cluster Assignment TRUE
# Clusters 3
Hierarchical Clustering: Reporting Parameters
Normalized? FALSE
Draw Dendrogram? TRUE
Maximum Number of Leaves in
Dendrogram 10
Data Type Raw Data
106. Record ID Cluster Sub-Cluster
Record 1 1 1
Record 2 2 2
Record 3 3 3
Record 4 2 4
Record 5 1 5
Record 6 1 6
Record 7 3 7
Record 8 2 8
Record 9 3 9
Record 10 2 10
Record ID Cluster Sub-Cluster Comments
Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times.
Record 2 2 2 My seat was uncomfortable.
Record 3 3 3 My flight was delayed 2 hours for no apparent reason.
Record 4 2 4 My seat would not recline.
Record 5 1 5 The man at the ticket counter was rude. Service was horrible.
Record 6 1 6 The flight attendant was rude. Service was bad.
Record 7 3 7 My flight was delayed with no explanation.
Record 8 2 8 My drink spilled when the guy in front of me reclined his seat.
Record 9 3 9 My flight was canceled.
Record 10 2 10 The arm rest of my seat was nasty.
107. Record ID Cluster Sub-Cluster Comments
Record 1 1 1 The wi-fi service was horrible. It was slow and cut off several times.
Record 5 1 5 The man at the ticket counter was rude. Service was horrible.
Record 6 1 6 The flight attendant was rude. Service was bad.
Record 2 2 2 My seat was uncomfortable.
Record 4 2 4 My seat would not recline.
Record 8 2 8 My drink spilled when the guy in front of me reclined his seat.
Record 10 2 10 The arm rest of my seat was nasty.
Record 3 3 3 My flight was delayed 2 hours for no apparent reason.
Record 7 3 7 My flight was delayed with no explanation.
Record 9 3 9 My flight was canceled.
Cluster 1: {1, 5, 6} documents about service issues
Cluster 2: {2, 4, 8, 10} documents about seat issues
Cluster 3: {3, 7, 9} Documents about scheduling issues
108. Example Movie Review Frequency Term-document matrix
Terms
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2
We have 10 reviews from
movie critics. After using
preprocessing techniques,
including text reduction
by synonyms, the number
of tokens is down to two:
Great and Terrible.
109. Frequency Term-document matrix
Terms
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2
Apply K-Means Clustering to the Frequency-
terms matrix
With k = 2
The process of clustering /
categorizing comments or
reviews as positive, negative
or neutral is known as
sentiment analysis
110. Variables
# Selected Variables 2
Selected Variables Great Terrible
K-Means Clustering: Fitting Parameters
# Clusters 2
Start type Random Start
# Iterations 10
Random seed: initial centroids 12345
K-Means Clustering: Reporting Parameters
Show data summary TRUE
Show distance from each cluster TRUE
Normalized? FALSE
Cluster Great Terrible
Cluster 1 4.5714286 1.428571429
Cluster 2 0.6666667 3.333333333
Cluster Size Average Distance
Cluster 1 7 1.1250
Cluster 2 3 1.2136
Total 10 1.1516
One random start and 10 iterations
111. Record ID Cluster Dist.Cluster-1 Dist.Cluster-2
Record 1 1 1.491 5.467
Record 2 1 0.606 4.922
Record 3 1 0.606 4.922
Record 4 1 2.222 2.357
Record 5 1 0.606 4.922
Record 6 2 5.801 1.795
Record 7 1 0.714 4.069
Record 8 1 1.629 4.346
Record 9 2 3.902 0.471
Record 10 2 3.617 1.374
112. Record ID Cluster Great Terrible
Record 1 1 5 0
Record 2 1 5 1
Record 3 1 5 1
Record 4 1 3 3
Record 5 1 5 1
Record 6 2 0 5
Record 7 1 4 1
Record 8 1 5 3
Record 9 2 1 3
Record 10 2 1 2
113. Record ID Cluster Great Terrible
Record 1 1 5 0
Record 2 1 5 1
Record 3 1 5 1
Record 4 1 3 3
Record 5 1 5 1
Record 7 1 4 1
Record 8 1 5 3
Record 6 2 0 5
Record 9 2 1 3
Record 10 2 1 2
Reviews tend to
be positive
Reviews tend to
be negative
(3, 3) corresponds to a
balanced review and is
classified in Cluster 1
114. Sentiment Analysis
The process of clustering / categorizing
comments or reviews as positive, negative
or neutral is known as sentiment analysis
115. Text mining is
• how a business analyst turns 50,000 hotel guest reviews
into specific recommendations;
• how a workforce analyst improves productivity and
reduces employee turnover;
• how healthcare providers and biopharma
researchers understand patient experiences;
• and much, much more.
SAS Text Miner, Python, R, SPSS …
Editor's Notes
Euclidean distance is highly influenced by the scale on which variables are measured.