SlideShare a Scribd company logo
1 of 32
Business Analytics – The Science of Data Driven Decision Making
Business Analytics – The Science of Data Driven Decision Making
CLUSTERING
U Dinesh Kumar
Business Analytics – The Science of Data Driven Decision Making
INTRODUCTION TO CLUSTERING
Clustering is usually one of the first tasks
performed in most analytics projects. It helps data
scientists to analyze individual clusters further.
Business Analytics – The Science of Data Driven Decision Making
Non-overlapping clusters
Cluster in which each observation belongs to
only one cluster. Non-overlapping clusters are
more frequently used clustering techniques in
practice.
Business Analytics – The Science of Data Driven Decision Making
Overlapping clusters
•An observation may belong to more than one
cluster
Business Analytics – The Science of Data Driven Decision Making
Probabilistic clusters
An observation may belong to a cluster according to a
probability distribution.
Business Analytics – The Science of Data Driven Decision Making
Hierarchical clustering
Hierarchical clustering creates subsets of data similar to
a tree-like structure in which the root node corresponds
to the complete set of data. Branches are created from
the root node to split the data into heterogeneous
subsets (clusters).
Business Analytics – The Science of Data Driven Decision Making
Euclidean Distance
Euclidean is one of the frequently used distance
measures when the data are either in interval or ratio
scale.
The Eucledian distance between two n-dimensional
observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n)
is given by
2 2 2
1 2 11 21 12 22 1 2
( , ) ( ) ( ) ( )
n n
D X X x x x x x x
      
Business Analytics – The Science of Data Driven Decision Making
Example
The below table has information about 20 wines sold in the market along
with their alcohol and alkalinity of ash content
Wine Alcohol
Alkalinity of
Ash
Wine Alcohol
Alkalinity of
Ash
1 14.8 28 11 10.7 12.2
2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8
Business Analytics – The Science of Data Driven Decision Making
Clusters of wine based on alcohol and ash
content.
Business Analytics – The Science of Data Driven Decision Making
Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k
stands for the kth observation in the data set). It is
possible that the range of X1k can be much smaller
compared to X2k, resulting in skewed Euclidean distance
value. An easier way of handling the potential bias is to
standardize the data using the following equation:
Standardized value of the attribute =
Where are, respectively, the mean and standard
deviation of ith attribute












i
X
i
ik X
X

and i
i X
X 
Business Analytics – The Science of Data Driven Decision Making
Manhattan Distance (City Block Distance)
Euclidean distance may not be appropriate while
measuring distance between different locations (for
example, distance between two shops in a city). In such
cases, we use Manhattan distance, which is given by
 


n
i
i
i X
X
X
X
DM
1
2
1
2
1 )
,
(
Business Analytics – The Science of Data Driven Decision Making
Minkowski Distance
Minsowski distance is the generalized distance measure
between two cases in the dataset and is given by
When p = 1, Minkowski distance is same as the
Manhattan distance.
For p = 2, Minkowski distance is same as the Euclidean
distance.
1
1 2 1 2
1
Minkowski ( , )
p
p
n
i i
i
D X X X X

 
 
 
 
 

Business Analytics – The Science of Data Driven Decision Making
Jaccard Similarity Coefficient (Jaccard
Index)
 Jaccard similarity coefficient (JSC) or Jaccard index
(Real and Vargas, 1996) is a measure used when the data
is qualitative, especially when attributes can be
represented in binary form.
 JSC for two n-dimensional data (n attributes), X1 and X2,
is given by
Jaccard(X1, X2) =
where n(X1  X2) is the number of attributes that belong to
both X1 and X2 (that is, X1  X2), n(X1  X2) is the number
of attributes that belong to either X1 or X2 (that is, X1  X2).
)
(
)
(
2
1
2
1
X
X
n
X
X
n


Business Analytics – The Science of Data Driven Decision Making
Example
Consider movie DVD purchases made by two customers as
given by the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu
Panda (KFP), Before Sunrise (BS), Bridge of spies (BoS),
Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest
Gump, Iron Man (IM), Kung Fu Panda (KFP), Schindler’s List
(SL), The God Father (TGF)}
In this case, each movie is an attribute. The purchases made
by the two customers are shown in Table
Movie Title BS BoS C FG IM JB KFP SL TGF
Customer 1 1 1 0 1 1 1 1 0 0
Customer 2 0 0 1 1 1 1 1 1 1
Business Analytics – The Science of Data Driven Decision Making
• The JSC is given by
44
.
0
9
4
2)
customer
1
n(customer
2)
customer
1
n(customer
JSC 




Higher the Jaccard coefficient, higher the similarity
between two observations being compared. The value of
JSC lies between 0 and 1.
Business Analytics – The Science of Data Driven Decision Making
Cosine Similarity
The cosine similarity between X1 and X2 is given by
Similarity (X1, X2) = cos() =
In cosine similarity, X1 and X2 are two n-dimensional
vectors and it measures the angle between two vectors
(thus called vector space model).



 






n
i
i
n
i
i
n
i
i
i
X
X
X
X
X
X
X
X
1
2
2
1
2
1
1
2
1
2
1
2
1
Business Analytics – The Science of Data Driven Decision Making
Cosine similarity of different values of .
Business Analytics – The Science of Data Driven Decision Making
Gower’s Similarity Coefficient
Gower’s similarity coefficient (Gower, 1971) is used
when the data has both quantitative and qualitative
data.
Gower’s coefficient between two n-dimensional
observations i and j is given by
where Dijk is the distance between observations (i and j)
for kth variable and Wijk is a binary variable that captures
whether the distance between observations is valid for
kth variable.





n
k
ijk
n
k
ijk
ijk
ij
W
W
D
D
1
1
Business Analytics – The Science of Data Driven Decision Making
Example
Table 14.5 shows 5 customers and their movie downloads from a
portal. The data consists of genre of the movies, maximum
rating given by the customer, and the marital status (code 1
implies married and 0 otherwise). For example, customer 1
downloaded 23 action, 5 romance, 15 comedy, and 0 Sci-fi
movies and his maximum rating was 4.
Customer Number of Movies Downloaded Under Each Genre
Maximum
Rating
(k = 5)
Marital
Status
Action
(k = 1)
Romance
(k = 2)
Comedy
(k = 3)
Sci-fi
(k = 4)
Married
(k = 6)
1 23 5 15 0 4 0
2 5 18 16 2 5 1
3 25 0 0 15 5 0
4 2 30 15 0 4 1
5 45 0 0 10 5 0
Business Analytics – The Science of Data Driven Decision Making
Solution
The Gowers distance between customers 1 and 2 can
be calculated as shown in Table below :
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 Sum
Dijk
0.5814 0.5667 0.9375 0.8667 0.0000 0 2.952
Wijk
1 1 1 1 1 1 6
The Gower’s distance between customers 1 and 2 is
given by 2.952/6 = 0.492.





n
k
ijk
n
k
ijk
ijk
ij
W
W
D
D
1
1
Business Analytics – The Science of Data Driven Decision Making
Quality and Optimal Number of Clusters
Milligan and Cooper (1985) analysed over 30 procedures
for determining the optimal number of clusters and
recommended the index proposed by Calinski and
Harabasz (1974) which is given by
where CH(k) is the Calinski and Harabasz index with k-
clusters (k > 1), B(k) and W(k) are the between and
within clusters sum of squared variations with k clusters.
)
/(
)
(
1
/
)
(
)
(
k
n
k
W
k
k
B
k
CH



Business Analytics – The Science of Data Driven Decision Making
Clustering Algorithms
Clustering algorithms group data into finite number of
mutually exclusive subsets.
Steps followed in clustering algorithms:
• Variable selection.
• Deciding the distance/similarity measure for measuring
distance/dissimilarity between the observations.
• Deciding the number of clusters.
• Validation of the clusters.
Business Analytics – The Science of Data Driven Decision Making
Variable Selection
Ketchen and Shook (1996) suggest inductive, deductive,
and cognitive approaches for variable selection.
• Inductive is basically an exploratory approach and
starts with as many variables as possible.
• On the other hand, in deductive variable selection,
suitability of the variable and theoretical basis
influence selection of variables.
• Under cognitive variable selection, expert opinion
plays a major role in variable selection
Business Analytics – The Science of Data Driven Decision Making
Deciding Distance/Similarity Measures
Choosing the right distance/similarity measure plays an
important role in developing clusters.
Number of Clusters
Several approaches are available for deciding the number
of clusters such as CH index , Hartigan statistic [Eq.
(14.14)], Silhouette statistic, and elbow method in which
the ideal number of clusters is given by the position of
elbow in an L-shaped curve.
Business Analytics – The Science of Data Driven Decision Making
Cluster Validation
The clusters created should be validated for consistency
using different algorithms to ensure that the clusters
represent the structures that exist in the population.
Halkidi et al. (2001) suggest the following measures to
validate the clusters:
• Compactness: Closeness of each member of a
cluster which can be measured through variance.
• Separation: Distance between different clusters.
Business Analytics – The Science of Data Driven Decision Making
K-Means Clustering
• K-means clustering is one of the frequently used
clustering algorithms.
• It is a non-hierarchical clustering method in which the
number of clusters (K) is decided a priori.
Business Analytics – The Science of Data Driven Decision Making
K-Means Clustering - Steps
1) Choose K observations from the data that are likely to be
in different clusters. There are many ways of choosing
these initial K values; easiest approach is to choose
observations that are farthest (in one of the parameters of
the data).
2) The K observations chosen in step 1 are the centroids of
those clusters.
3) For remaining observations, find the cluster closest to the
centroid. Add the new observation (say observation j) to
the cluster with closest centroid. Adjust the centroid after
adding a new observation to the cluster. The closest
centroid is chosen based on an appropriate distance
measure.
4) Repeat step 3 till all observations are assigned to a cluster.
Business Analytics – The Science of Data Driven Decision Making
Hierarchical Clustering
Hierarchical clustering is a clustering algorithm which uses the
following steps to develop clusters:
1) Start with each data point in a single cluster.
2) Find the data points with shortest distance (using an
appropriate distance measure) and merge them to form
a cluster.
3) Repeat step 2 until all data points are merged to form a
single cluster
The above procedure is called agglomerative hierarchical
cluster
Business Analytics – The Science of Data Driven Decision Making
Dendrogram for movie clustering
Business Analytics – The Science of Data Driven Decision Making
Summary
• Clustering is an unsupervised learning algorithms that
divides the data set into mutually exclusive and
exhaustive subsets (in non-overlapping clusters) that
that are homogeneous within the group and
heterogeneous between the groups.
• Clustering is one of the frequently used techniques and
practitioners first cluster the data and develop
predictive models for each cluster for better
management.
Business Analytics – The Science of Data Driven Decision Making
• Several distance measures such as Euclidian distance,
Gower distance are used in clustering algorithms.
Similarity coefficients such as Jaccard coefficient and
Cosine similarity are used depending on the data type.
• K-means clustering and Hierarchical clustering are two
popular techniques used for clustering.
• One of the decisions to be taken during clustering is to
decide on the number of cluster. Usually this is carried
out using elbow curve. The cluster number at which
the elbow (bend) occurs in the elbow curve is the
optimal number of clusters.

More Related Content

Similar to CHAPTER 14 CLUSTERING.PPTX

Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
South West Data Meetup
 

Similar to CHAPTER 14 CLUSTERING.PPTX (20)

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Kinetic bands versus Bollinger Bands
Kinetic bands versus Bollinger  BandsKinetic bands versus Bollinger  Bands
Kinetic bands versus Bollinger Bands
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Time series.ppt
Time series.pptTime series.ppt
Time series.ppt
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
Data Mining Theory and Python Project.pptx
Data Mining Theory and Python Project.pptxData Mining Theory and Python Project.pptx
Data Mining Theory and Python Project.pptx
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
LuisMiguelPaz5
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Recently uploaded (20)

Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 

CHAPTER 14 CLUSTERING.PPTX

  • 1. Business Analytics – The Science of Data Driven Decision Making
  • 2. Business Analytics – The Science of Data Driven Decision Making CLUSTERING U Dinesh Kumar
  • 3. Business Analytics – The Science of Data Driven Decision Making INTRODUCTION TO CLUSTERING Clustering is usually one of the first tasks performed in most analytics projects. It helps data scientists to analyze individual clusters further.
  • 4. Business Analytics – The Science of Data Driven Decision Making Non-overlapping clusters Cluster in which each observation belongs to only one cluster. Non-overlapping clusters are more frequently used clustering techniques in practice.
  • 5. Business Analytics – The Science of Data Driven Decision Making Overlapping clusters •An observation may belong to more than one cluster
  • 6. Business Analytics – The Science of Data Driven Decision Making Probabilistic clusters An observation may belong to a cluster according to a probability distribution.
  • 7. Business Analytics – The Science of Data Driven Decision Making Hierarchical clustering Hierarchical clustering creates subsets of data similar to a tree-like structure in which the root node corresponds to the complete set of data. Branches are created from the root node to split the data into heterogeneous subsets (clusters).
  • 8. Business Analytics – The Science of Data Driven Decision Making Euclidean Distance Euclidean is one of the frequently used distance measures when the data are either in interval or ratio scale. The Eucledian distance between two n-dimensional observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n) is given by 2 2 2 1 2 11 21 12 22 1 2 ( , ) ( ) ( ) ( ) n n D X X x x x x x x       
  • 9. Business Analytics – The Science of Data Driven Decision Making Example The below table has information about 20 wines sold in the market along with their alcohol and alkalinity of ash content Wine Alcohol Alkalinity of Ash Wine Alcohol Alkalinity of Ash 1 14.8 28 11 10.7 12.2 2 11.05 12 12 14.3 27 3 12.2 21 13 12.4 19.5 4 12 20 14 14.85 29.2 5 14.5 29.5 15 10.9 13.6 6 11.2 13 16 13.9 29.7 7 11.5 12 17 10.4 12.2 8 12.8 19 18 10.8 13.6 9 14.75 28.8 19 14 28.8 10 10.5 14 20 12.47 22.8
  • 10. Business Analytics – The Science of Data Driven Decision Making Clusters of wine based on alcohol and ash content.
  • 11. Business Analytics – The Science of Data Driven Decision Making Standardized Euclidean Distance Let X1k and X2k be two attributes of the data (where k stands for the kth observation in the data set). It is possible that the range of X1k can be much smaller compared to X2k, resulting in skewed Euclidean distance value. An easier way of handling the potential bias is to standardize the data using the following equation: Standardized value of the attribute = Where are, respectively, the mean and standard deviation of ith attribute             i X i ik X X  and i i X X 
  • 12. Business Analytics – The Science of Data Driven Decision Making Manhattan Distance (City Block Distance) Euclidean distance may not be appropriate while measuring distance between different locations (for example, distance between two shops in a city). In such cases, we use Manhattan distance, which is given by     n i i i X X X X DM 1 2 1 2 1 ) , (
  • 13. Business Analytics – The Science of Data Driven Decision Making Minkowski Distance Minsowski distance is the generalized distance measure between two cases in the dataset and is given by When p = 1, Minkowski distance is same as the Manhattan distance. For p = 2, Minkowski distance is same as the Euclidean distance. 1 1 2 1 2 1 Minkowski ( , ) p p n i i i D X X X X            
  • 14. Business Analytics – The Science of Data Driven Decision Making Jaccard Similarity Coefficient (Jaccard Index)  Jaccard similarity coefficient (JSC) or Jaccard index (Real and Vargas, 1996) is a measure used when the data is qualitative, especially when attributes can be represented in binary form.  JSC for two n-dimensional data (n attributes), X1 and X2, is given by Jaccard(X1, X2) = where n(X1  X2) is the number of attributes that belong to both X1 and X2 (that is, X1  X2), n(X1  X2) is the number of attributes that belong to either X1 or X2 (that is, X1  X2). ) ( ) ( 2 1 2 1 X X n X X n  
  • 15. Business Analytics – The Science of Data Driven Decision Making Example Consider movie DVD purchases made by two customers as given by the following sets Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda (KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)} Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God Father (TGF)} In this case, each movie is an attribute. The purchases made by the two customers are shown in Table Movie Title BS BoS C FG IM JB KFP SL TGF Customer 1 1 1 0 1 1 1 1 0 0 Customer 2 0 0 1 1 1 1 1 1 1
  • 16. Business Analytics – The Science of Data Driven Decision Making • The JSC is given by 44 . 0 9 4 2) customer 1 n(customer 2) customer 1 n(customer JSC      Higher the Jaccard coefficient, higher the similarity between two observations being compared. The value of JSC lies between 0 and 1.
  • 17. Business Analytics – The Science of Data Driven Decision Making Cosine Similarity The cosine similarity between X1 and X2 is given by Similarity (X1, X2) = cos() = In cosine similarity, X1 and X2 are two n-dimensional vectors and it measures the angle between two vectors (thus called vector space model).            n i i n i i n i i i X X X X X X X X 1 2 2 1 2 1 1 2 1 2 1 2 1
  • 18. Business Analytics – The Science of Data Driven Decision Making Cosine similarity of different values of .
  • 19. Business Analytics – The Science of Data Driven Decision Making Gower’s Similarity Coefficient Gower’s similarity coefficient (Gower, 1971) is used when the data has both quantitative and qualitative data. Gower’s coefficient between two n-dimensional observations i and j is given by where Dijk is the distance between observations (i and j) for kth variable and Wijk is a binary variable that captures whether the distance between observations is valid for kth variable.      n k ijk n k ijk ijk ij W W D D 1 1
  • 20. Business Analytics – The Science of Data Driven Decision Making Example Table 14.5 shows 5 customers and their movie downloads from a portal. The data consists of genre of the movies, maximum rating given by the customer, and the marital status (code 1 implies married and 0 otherwise). For example, customer 1 downloaded 23 action, 5 romance, 15 comedy, and 0 Sci-fi movies and his maximum rating was 4. Customer Number of Movies Downloaded Under Each Genre Maximum Rating (k = 5) Marital Status Action (k = 1) Romance (k = 2) Comedy (k = 3) Sci-fi (k = 4) Married (k = 6) 1 23 5 15 0 4 0 2 5 18 16 2 5 1 3 25 0 0 15 5 0 4 2 30 15 0 4 1 5 45 0 0 10 5 0
  • 21. Business Analytics – The Science of Data Driven Decision Making Solution The Gowers distance between customers 1 and 2 can be calculated as shown in Table below : k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 Sum Dijk 0.5814 0.5667 0.9375 0.8667 0.0000 0 2.952 Wijk 1 1 1 1 1 1 6 The Gower’s distance between customers 1 and 2 is given by 2.952/6 = 0.492.      n k ijk n k ijk ijk ij W W D D 1 1
  • 22. Business Analytics – The Science of Data Driven Decision Making Quality and Optimal Number of Clusters Milligan and Cooper (1985) analysed over 30 procedures for determining the optimal number of clusters and recommended the index proposed by Calinski and Harabasz (1974) which is given by where CH(k) is the Calinski and Harabasz index with k- clusters (k > 1), B(k) and W(k) are the between and within clusters sum of squared variations with k clusters. ) /( ) ( 1 / ) ( ) ( k n k W k k B k CH   
  • 23. Business Analytics – The Science of Data Driven Decision Making Clustering Algorithms Clustering algorithms group data into finite number of mutually exclusive subsets. Steps followed in clustering algorithms: • Variable selection. • Deciding the distance/similarity measure for measuring distance/dissimilarity between the observations. • Deciding the number of clusters. • Validation of the clusters.
  • 24. Business Analytics – The Science of Data Driven Decision Making Variable Selection Ketchen and Shook (1996) suggest inductive, deductive, and cognitive approaches for variable selection. • Inductive is basically an exploratory approach and starts with as many variables as possible. • On the other hand, in deductive variable selection, suitability of the variable and theoretical basis influence selection of variables. • Under cognitive variable selection, expert opinion plays a major role in variable selection
  • 25. Business Analytics – The Science of Data Driven Decision Making Deciding Distance/Similarity Measures Choosing the right distance/similarity measure plays an important role in developing clusters. Number of Clusters Several approaches are available for deciding the number of clusters such as CH index , Hartigan statistic [Eq. (14.14)], Silhouette statistic, and elbow method in which the ideal number of clusters is given by the position of elbow in an L-shaped curve.
  • 26. Business Analytics – The Science of Data Driven Decision Making Cluster Validation The clusters created should be validated for consistency using different algorithms to ensure that the clusters represent the structures that exist in the population. Halkidi et al. (2001) suggest the following measures to validate the clusters: • Compactness: Closeness of each member of a cluster which can be measured through variance. • Separation: Distance between different clusters.
  • 27. Business Analytics – The Science of Data Driven Decision Making K-Means Clustering • K-means clustering is one of the frequently used clustering algorithms. • It is a non-hierarchical clustering method in which the number of clusters (K) is decided a priori.
  • 28. Business Analytics – The Science of Data Driven Decision Making K-Means Clustering - Steps 1) Choose K observations from the data that are likely to be in different clusters. There are many ways of choosing these initial K values; easiest approach is to choose observations that are farthest (in one of the parameters of the data). 2) The K observations chosen in step 1 are the centroids of those clusters. 3) For remaining observations, find the cluster closest to the centroid. Add the new observation (say observation j) to the cluster with closest centroid. Adjust the centroid after adding a new observation to the cluster. The closest centroid is chosen based on an appropriate distance measure. 4) Repeat step 3 till all observations are assigned to a cluster.
  • 29. Business Analytics – The Science of Data Driven Decision Making Hierarchical Clustering Hierarchical clustering is a clustering algorithm which uses the following steps to develop clusters: 1) Start with each data point in a single cluster. 2) Find the data points with shortest distance (using an appropriate distance measure) and merge them to form a cluster. 3) Repeat step 2 until all data points are merged to form a single cluster The above procedure is called agglomerative hierarchical cluster
  • 30. Business Analytics – The Science of Data Driven Decision Making Dendrogram for movie clustering
  • 31. Business Analytics – The Science of Data Driven Decision Making Summary • Clustering is an unsupervised learning algorithms that divides the data set into mutually exclusive and exhaustive subsets (in non-overlapping clusters) that that are homogeneous within the group and heterogeneous between the groups. • Clustering is one of the frequently used techniques and practitioners first cluster the data and develop predictive models for each cluster for better management.
  • 32. Business Analytics – The Science of Data Driven Decision Making • Several distance measures such as Euclidian distance, Gower distance are used in clustering algorithms. Similarity coefficients such as Jaccard coefficient and Cosine similarity are used depending on the data type. • K-means clustering and Hierarchical clustering are two popular techniques used for clustering. • One of the decisions to be taken during clustering is to decide on the number of cluster. Usually this is carried out using elbow curve. The cluster number at which the elbow (bend) occurs in the elbow curve is the optimal number of clusters.