SlideShare a Scribd company logo
1 of 34
Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
K- Means Clustering
Parameter Tuning & Use cases
Terminologies
Introduction & Example
Standard input/tuning parameters & Sample UI
Sample output UI
Interpretation of Output
Limitations
Business use cases
What Are
All Covered
Introduction With
Example
What is it used for?
It’s a process by which objects are classified into
number of groups so that they are as much
dissimilar as possible from one group to another
group and as much similar as possible within
each group
Thus it’s simply a grouping of similar things
/data points
For example ,objects within group 1(cluster 1)
shown in image above should be as similar as
possible
But there should be much difference between
an object in group 1 & group 2
The attributes of objects decide which objects
should be grouped together
Thus natural grouping of data points can be
achieved
Some Examples
Let’s
take a
few
examples
for more
clarity :
Loan applicants in a bank can be grouped into : low ,
medium , high risk applicants based on their age, annual
income ,employment tenure, loan amount , times
delinquent etc. using K means clustering algorithm
Movie tickets booking website users can be grouped into
movie freaks/moderate watchers/ rare watchers based
on their past movie tickets purchase behavior such as
days from last movie seen , average number of tickets
booked each time , frequency of tickets booking per
month , etc.
Retail customers can be clubbed into loyal / infrequent /
rare customer groups based on their retail
outlet/website visits per month , purchase amount per
month , purchase frequency per month etc.
It is used to find groups which have not been
explicitly labeled in the data. This can be
used to confirm business assumptions about
what types of groups exist or to identify
unknown groups in complex data sets.
Once the algorithm has been run and the
groups are defined, any new data can be
easily assigned to the correct group
How it works
Step 1: Begin with a decision on
the value of k : Number of clusters
(groups) and input variables. Use
silhouette score to determine k.
Step 2: Scale the data using [(x-
min(x)/max(x)-min(x)] and initialize
cluster centers. Randomly select k
observations from the scaled data
and consider them as initial cluster
centers.
Step 3: Calculate euclidean
distance between an observation
and initial cluster centers.
• Based on euclidean distance, each
observation is assigned to one of the
clusters - based on minimum distance.
Step 4: Move onto next
observation , calculate euclidean
distance, update cluster centers
and assign this observation a
cluster membership based on
minimum distance same as step 3.
Step 5: Repeat step 4 until all
observations are assigned a cluster
membership.
Step 6 : Check cluster plot and
silhouette score to measure the
goodness of clusters generated.
How it works – Steps
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Data sample :
Cluster
Initial cluster centers
Height Weight
K1 185 72
K2 170 56
Step 1: Input
• Scaled variables and Number of
Clusters (k)
• In this example, only two
variables –height and weight –
are considered for clustering
• Let’s consider number of clusters
=2
Step 2: Initialize cluster
centers
• Let’s initialize cluster centers
with first two observations
Step 3: Calculate
Euclidean distance
• Euclidean distance between an
observation and initial cluster
centers 1 and 2 is calculated.
• Based on Euclidean distance,
each observation is assigned to
one of the clusters - based
on minimum distance
Example
Height Weight
185 72
170 56
First two observations
Cluster Height Weight
K1 185 72
K2 170 56
Updated centers
Euclidian Distance from
Cluster 1
Euclidian Distance from
Cluster 2
Cluster
Assignment
SQRT [(185-185)2+(72-72)2 ] =0
SQRT [(185-170)2+(72-56)2] =
21.93
1
SQRT [(170-185)2+(56-72)2] =
21.93
SQRT [(170-170)2+(56-56)2] = 0 2
Euclidean Distance from each of the clusters is calculated:
Step 3: Continue…
There is no change in centers as we considered same two observations as initial centers
Example
Height Weight
168 60
Next observation
Cluster Height Weight
K1 185 72
K2
(170 +168)/2
=169
(56 +60)/2=
58
Updated cluster centers
Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2
Cluster
Assignment
SQRT [(168-185) 2+(60-72) 2] =20.808 SQRT[((168-170)2+(60-56) 2] = 4.472 2
Step 4 :
Move onto next observation,
calculate euclidean distance,
assign cluster membership and
update cluster centers
• Since distance is minimum from cluster 2, the observation is assigned to cluster 2.
• Now revise Cluster centers – Mean value of observations’ Height and Weight.
• Addition is only to cluster 2, so centroid of cluster 2 will be updated as follows :
Example
Height Weight
179 68
Next observation
Cluster Height Weight
K1
(185 +179)/2
=182
(72 +68)/2
=70
K2 169 58
Updated cluster centers
Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2
Cluster
Assignment
SQRT [(179-185) 2+(68-72) 2] =7.21 SQRT[((179-170)2+(68-56) 2] = 14.14 1
Step 5:
Repeat steps 4 : calculate
Euclidean distance for next
observation, assign next
observation based on minimum
distance & update the cluster
centers until all observations are
assigned a cluster membership
o Since distance is minimum from cluster 1, the observation is assigned to cluster 1.
o Now revise Cluster Centroid – Mean value of observations’ Height and Weight.
o Addition is only to cluster 1, so centroid of cluster 1 will be updated as follows :
Example
Step 6 :
Draw cluster plot to see
how clusters are
distributed. Lesser the
overlap between
clusters , better the
distribution and cluster
assignments.
Cluster
Updated
Centroid
Height Weight
K=1 182.8 72
K=2 169 58
Final assignments
Final cluster centers
Cluster plot
silhouette = 0.8
Indicating very good quality clusters
Closer the silhouette score to 1 , better the quality of clusters
Example
Standard Tuning
Parameters
Standard Tuning Parameter
oNumber of clusters (K) :
• The desired number of clusters
• Suggested range : 3 to 5
• The actual number could be
smaller in the output if there are
no divisible clusters in the data
• This parameter input can be
automated using silhouette
score(explained in later slides).
Max Iterations:
• The max number of k-means
iterations to split clusters
• By default this value should be
set to 20
Sample UI For Input
Variables & Parameters
Selection & Output
Sample UI for selecting predictors and applying
tuning parameters: For Two Predictors
Select the variables you
would like to use as
predictors to build
clusters
Height
Weight
BMI
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters
Height(cm) Weight(Kg) BMI
Cluster
Number
158 60 23 1
160 65 25 2
170 70 26 2
149 50 21 1
180 80 27 3
165 80 28 3
200 90 23 1
Each customer is assigned a cluster membership as shown in the table in left
Height
Weight
silhouette = 0.7
Indicating good quality clusters
Output UI: For Two
Predictors
As clusters are built using only 2 predictors
here , scatter plot axis will reflect actual
predictors i.e. Height and Weight instead of
principle components .
Again, lesser the overlap in cluster outlines ,
better the clusters’ assignment
Alternatively silhouette score can be checked
to evaluate how clusters are partitioned -
Closer this value to 1 , better the partition
quality.
Sample UI for Selecting Predictors And Applying
Tuning Parameters: For Four Predictors
Select the variables you
would like to use as
predictors to build
clusters
Purchase amount
Purchase frequency
Total purchase quantity
Annual income
Website visits
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters
Customer ID
Purchase
amount(per
month)
Purchase
frequency
(per month)
Total
purchase
quantity
Annual
Income slab
Website
visits (per
month)
Cluster
Number
1 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
2 1k to 5k 1 4 1 lac to 2 lac <3 1
3 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
4 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
5 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
6 1k to 5k 1 2 1 lac to 2 lac <3 1
7 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
8 1k to 5k 1 2 1 lac to 2 lac <3 1
9 1k to 5k 1 2 1 lac to 2 lac <3 1
10 >20k 4 32
10 lac to 15
lac
>10 3
Each customer is assigned a cluster membership as shown in the table in left
First principal component
Secondprincipalcomponent
Output UI: For Four
Predictors
In the 2D cluster plot shown in right , clusters
distribution is plotted.
In this case, axis will reflect first two principle
components (check definition below ) instead of
actual predictors as number of predictors is >2. In
case of 3D plot , first three principal components
will be shown. Lesser the overlap between clusters
, better the clusters’ assignment.
Also silhouette score can be checked to evaluate
how clusters are partitioned - Closer this value to 1
, better the partition quality.
Whenever there are more than 2 predictors as
input , axis will reflect principle components
instead of actual predictors
Principle components are linear combination of
original predictors which captures the maximum
variance in data set.
Most of the variance in data is explained by first
three principle components so we can ignore
remaining components.
Other Sample Output
Formats
silhouette = -0.5
Indicating poor quality clusters
silhouette = 0.7
Indicating good quality clusters
 Clusters with silhouette score closer to 1 are more desirable. So this index can be
used to measure the goodness of clusters.
silhouette = 0.3
Average quality clusters
Other Sample Output Formats
Limitations
Limitations
• The number of clusters, k, must be determined before hand. Instead the
algorithm should auto suggest this number for better user friendliness.
• It does not yield the same result with each run, since the resulting clusters
depend on the initial random assignments for group centers.
• If it is inputted in a different order it may produce different cluster if the number
of data points are few, hence number of data points must be large enough.
• It has been suggested that 2m can be used (where m = number of clustering variables) as a
rule to decide sample data size.
• K-means is suitable only for numeric data.
• Scale of data points influences Euclidean distance , so variable standardization
becomes necessary.
• Empty clusters can be obtained if no points are allocated to a cluster during the
assignment step.
Applications & Business
Use Cases
General applications of K-means Clustering
• Some examples of use cases are:
• Behavioral segmentation:
• Segment customers by purchase history
• Segment users by activities on application, website, or platform
• Define personas based on interests
• Create consumer profiles based on activity monitoring
• Some other general applications :
• Pattern recognitions, Market segmentation, Classification analysis, Artificial
intelligence, Image processing , astronomy , agriculture and many others.
Use case 1
• Business problem :
• Grouping loan applicants into high/medium/low risk applicants based on
attributes such as Loan amount , Monthly installment, Employment tenure ,
Times delinquent, Annual income, Debt to income ratio etc.
• Business benefit:
• Once segments are identified , bank will have a loan applicants’ dataset with
each applicant labeled as high/medium/low risk.
• Based on this labels , bank can easily make a decision on whether to give loan
to an applicant or not and if yes then how much credit limit and interest rate
each applicant is eligible for based on the amount of risk involved.
Use case 1
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
1039153 21000 701.73 105000 9 5 4
1069697 15000 483.38 92000 11 5 2
1068120 25600 824.96 110000 10 9 2
563175 23000 534.94 80000 9 2 12
562842 19750 483.65 57228 11 3 21
562681 25000 571.78 113000 10 0 9
562404 21250 471.2 31008 12 1 12
700159 14400 448.99 82000 20 6 6
696484 10000 241.33 45000 18 8 2
702598 11700 381.61 45192 20 7 3
702470 10000 243.29 38000 17 9 7
702373 4800 144.77 54000 19 8 2
701975 12500 455.81 43560 15 8 4
Input dataset :
Use case 1
Output : Each record will have the cluster (segment) assignment as shown below :
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Cluster
1039153 21000 701.73 105000 9 5 4 Medium
1069697 15000 483.38 92000 11 5 2 Medium
1068120 25600 824.96 110000 10 9 2 Medium
563175 23000 534.94 80000 9 2 12 Low
562842 19750 483.65 57228 11 3 21 Low
562681 25000 571.78 113000 10 0 9 Low
562404 21250 471.2 31008 12 1 12 Low
700159 14400 448.99 82000 20 6 6 High
696484 10000 241.33 45000 18 8 2 High
702598 11700 381.61 45192 20 7 3 High
702470 10000 243.29 38000 17 9 7 High
702373 4800 144.77 54000 19 8 2 High
701975 12500 455.81 43560 15 8 4 High
Use case 1
Output : Cluster profiles :
As can be seen in the table above,
there are distinctive characteristics of
high /medium and low risk segments
High risk segment has high likelihood
to be delinquent, highest debt to
income ratio and lowest employment
tenure as compared to other two
segments
Whereas low risk segment exhibits
exactly the opposite pattern i.e.
lowest debt to income ratio, lowest
delinquency and highest employment
tenure as compared to other two
segments
Hence , delinquency , employment
tenure and debt to income ratio are
the determinant factors when it
comes to segmenting loan applicants
Cluster
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Risk
Segment
1 10447.30 304.87 66467.74 9.58 1.69 16.82 Low
2 21391.58 598.54 94912.59 12.37 5.98 4.58 Medium
3 7521.32 227.43 60935.28 16.55 6.91 4.01 High
Use case 1
Output : Cluster
distribution:
In the cluster distribution plot , there is
negligible overlap in cluster outlines so we
can say that cluster assignments is good in
our case
Clusters with silhouette width average closer
to 1 are more desirable. So this index can be
used to test the quality of clusters’
distribution.
silhouette = 0.6
Indicating good quality
clusters
Use case 2
Business benefit:
• Once segments are identified
, marketing messages and
even products can be
customized for each segment.
• The better the segment(s)
chosen for targeting by a
particular organization , the
more successful it is assumed
to be in the market place.
Business problem :
• Organizing customers into
groups/segments based on
similar traits, product
preferences and expectations
• Segments are constructed on
basis of the customers’
demographic characteristics,
psychographics, past behavior
and product use behaviors
Use case 3
Business benefit:
• Business marketing team can focus
on risky customer segments in
efficient way in order to avert them
from churning/leaving
• Sales team segments which are
facing challenges based on current
discounting strategy can be
identified and deal negotiation
strategy can be improved
/optimized for them.
Business problem :
• Discount Analysis and Customer
Retention – Visualize ‘segments of
sales group based on discount
behavior’ and ‘customer churn -
segments of customers on verge of
leaving’
Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018

More Related Content

What's hot

Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningAbhishek Vijayvargia
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Decision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationDecision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationKomal Kotak
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?Smarten Augmented Analytics
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 
1.9. minimization of dfa
1.9. minimization of dfa1.9. minimization of dfa
1.9. minimization of dfaSampath Kumar S
 
Multidimensional array in C
Multidimensional array in CMultidimensional array in C
Multidimensional array in CSmit Parikh
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.Megha Sharma
 
K means clustering
K means clusteringK means clustering
K means clusteringKuppusamy P
 
Transportation Problem
Transportation ProblemTransportation Problem
Transportation ProblemAlvin Niere
 

What's hot (20)

Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Decision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationDecision Tree and Bayesian Classification
Decision Tree and Bayesian Classification
 
K means report
K means reportK means report
K means report
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
 
Classification Using Decision tree
Classification Using Decision treeClassification Using Decision tree
Classification Using Decision tree
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
1.9. minimization of dfa
1.9. minimization of dfa1.9. minimization of dfa
1.9. minimization of dfa
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
R data types
R   data typesR   data types
R data types
 
Multidimensional array in C
Multidimensional array in CMultidimensional array in C
Multidimensional array in C
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Data frame operations
Data frame operationsData frame operations
Data frame operations
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Branch & bound
Branch & boundBranch & bound
Branch & bound
 
Transportation Problem
Transportation ProblemTransportation Problem
Transportation Problem
 

Similar to What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...Smarten Augmented Analytics
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptSyedNahin1
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesKush Kulshrestha
 
K – means cluster analysis.pptx
K – means cluster analysis.pptxK – means cluster analysis.pptx
K – means cluster analysis.pptxagniva pradhan
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxJebaRaj26
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classificationJamshed Khan
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 

Similar to What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data? (20)

What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Data Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptxData Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptx
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 
07 learning
07 learning07 learning
07 learning
 
K – means cluster analysis.pptx
K – means cluster analysis.pptxK – means cluster analysis.pptx
K – means cluster analysis.pptx
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 

More from Smarten Augmented Analytics

Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenSmarten Augmented Analytics
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...Smarten Augmented Analytics
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...Smarten Augmented Analytics
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?Smarten Augmented Analytics
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?Smarten Augmented Analytics
 
Students' Academic Performance Predictive Analytics Use Case – Smarten
Students' Academic Performance Predictive Analytics Use Case – SmartenStudents' Academic Performance Predictive Analytics Use Case – Smarten
Students' Academic Performance Predictive Analytics Use Case – SmartenSmarten Augmented Analytics
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values Smarten Augmented Analytics
 
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Smarten Augmented Analytics
 
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...Smarten Augmented Analytics
 
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...Smarten Augmented Analytics
 
Fraud Mitigation Predictive Analytics Use Case – Smarten
Fraud Mitigation Predictive Analytics Use Case – SmartenFraud Mitigation Predictive Analytics Use Case – Smarten
Fraud Mitigation Predictive Analytics Use Case – SmartenSmarten Augmented Analytics
 
Quality Control Predictive Analytics Use Case - Smarten
Quality Control Predictive Analytics Use Case - SmartenQuality Control Predictive Analytics Use Case - Smarten
Quality Control Predictive Analytics Use Case - SmartenSmarten Augmented Analytics
 
Machine Maintenance Management Predictive Analytics Use Case - Smarten
Machine Maintenance Management Predictive Analytics Use Case - SmartenMachine Maintenance Management Predictive Analytics Use Case - Smarten
Machine Maintenance Management Predictive Analytics Use Case - SmartenSmarten Augmented Analytics
 
Predictive Analytics Using External Data Augmented Analytics Use Case - Smarten
Predictive Analytics Using External Data Augmented Analytics Use Case - SmartenPredictive Analytics Using External Data Augmented Analytics Use Case - Smarten
Predictive Analytics Using External Data Augmented Analytics Use Case - SmartenSmarten Augmented Analytics
 
Marketing Optimization Augmented Analytics Use Cases - Smarten
Marketing Optimization Augmented Analytics Use Cases - SmartenMarketing Optimization Augmented Analytics Use Cases - Smarten
Marketing Optimization Augmented Analytics Use Cases - SmartenSmarten Augmented Analytics
 
Human Resource Attrition Augmented Analytics Use Case - Smarten
Human Resource Attrition Augmented Analytics Use Case - SmartenHuman Resource Attrition Augmented Analytics Use Case - Smarten
Human Resource Attrition Augmented Analytics Use Case - SmartenSmarten Augmented Analytics
 
Customer Targeting Augmented Analytics Use Case - Smarten
Customer Targeting Augmented Analytics Use Case - SmartenCustomer Targeting Augmented Analytics Use Case - Smarten
Customer Targeting Augmented Analytics Use Case - SmartenSmarten Augmented Analytics
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?Smarten Augmented Analytics
 
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...Smarten Augmented Analytics
 
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...Smarten Augmented Analytics
 

More from Smarten Augmented Analytics (20)

Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – Smarten
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
 
Students' Academic Performance Predictive Analytics Use Case – Smarten
Students' Academic Performance Predictive Analytics Use Case – SmartenStudents' Academic Performance Predictive Analytics Use Case – Smarten
Students' Academic Performance Predictive Analytics Use Case – Smarten
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
 
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
 
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
 
Fraud Mitigation Predictive Analytics Use Case – Smarten
Fraud Mitigation Predictive Analytics Use Case – SmartenFraud Mitigation Predictive Analytics Use Case – Smarten
Fraud Mitigation Predictive Analytics Use Case – Smarten
 
Quality Control Predictive Analytics Use Case - Smarten
Quality Control Predictive Analytics Use Case - SmartenQuality Control Predictive Analytics Use Case - Smarten
Quality Control Predictive Analytics Use Case - Smarten
 
Machine Maintenance Management Predictive Analytics Use Case - Smarten
Machine Maintenance Management Predictive Analytics Use Case - SmartenMachine Maintenance Management Predictive Analytics Use Case - Smarten
Machine Maintenance Management Predictive Analytics Use Case - Smarten
 
Predictive Analytics Using External Data Augmented Analytics Use Case - Smarten
Predictive Analytics Using External Data Augmented Analytics Use Case - SmartenPredictive Analytics Using External Data Augmented Analytics Use Case - Smarten
Predictive Analytics Using External Data Augmented Analytics Use Case - Smarten
 
Marketing Optimization Augmented Analytics Use Cases - Smarten
Marketing Optimization Augmented Analytics Use Cases - SmartenMarketing Optimization Augmented Analytics Use Cases - Smarten
Marketing Optimization Augmented Analytics Use Cases - Smarten
 
Human Resource Attrition Augmented Analytics Use Case - Smarten
Human Resource Attrition Augmented Analytics Use Case - SmartenHuman Resource Attrition Augmented Analytics Use Case - Smarten
Human Resource Attrition Augmented Analytics Use Case - Smarten
 
Customer Targeting Augmented Analytics Use Case - Smarten
Customer Targeting Augmented Analytics Use Case - SmartenCustomer Targeting Augmented Analytics Use Case - Smarten
Customer Targeting Augmented Analytics Use Case - Smarten
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?
 
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
 
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
 

Recently uploaded

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Recently uploaded (20)

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

  • 1. Master the Art of Analytics A Simplistic Explainer Series For Citizen Data Scientists J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
  • 2. K- Means Clustering Parameter Tuning & Use cases
  • 3. Terminologies Introduction & Example Standard input/tuning parameters & Sample UI Sample output UI Interpretation of Output Limitations Business use cases What Are All Covered
  • 5. What is it used for? It’s a process by which objects are classified into number of groups so that they are as much dissimilar as possible from one group to another group and as much similar as possible within each group Thus it’s simply a grouping of similar things /data points For example ,objects within group 1(cluster 1) shown in image above should be as similar as possible But there should be much difference between an object in group 1 & group 2 The attributes of objects decide which objects should be grouped together Thus natural grouping of data points can be achieved
  • 6. Some Examples Let’s take a few examples for more clarity : Loan applicants in a bank can be grouped into : low , medium , high risk applicants based on their age, annual income ,employment tenure, loan amount , times delinquent etc. using K means clustering algorithm Movie tickets booking website users can be grouped into movie freaks/moderate watchers/ rare watchers based on their past movie tickets purchase behavior such as days from last movie seen , average number of tickets booked each time , frequency of tickets booking per month , etc. Retail customers can be clubbed into loyal / infrequent / rare customer groups based on their retail outlet/website visits per month , purchase amount per month , purchase frequency per month etc. It is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group
  • 8. Step 1: Begin with a decision on the value of k : Number of clusters (groups) and input variables. Use silhouette score to determine k. Step 2: Scale the data using [(x- min(x)/max(x)-min(x)] and initialize cluster centers. Randomly select k observations from the scaled data and consider them as initial cluster centers. Step 3: Calculate euclidean distance between an observation and initial cluster centers. • Based on euclidean distance, each observation is assigned to one of the clusters - based on minimum distance. Step 4: Move onto next observation , calculate euclidean distance, update cluster centers and assign this observation a cluster membership based on minimum distance same as step 3. Step 5: Repeat step 4 until all observations are assigned a cluster membership. Step 6 : Check cluster plot and silhouette score to measure the goodness of clusters generated. How it works – Steps
  • 9. Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76 Data sample : Cluster Initial cluster centers Height Weight K1 185 72 K2 170 56 Step 1: Input • Scaled variables and Number of Clusters (k) • In this example, only two variables –height and weight – are considered for clustering • Let’s consider number of clusters =2 Step 2: Initialize cluster centers • Let’s initialize cluster centers with first two observations Step 3: Calculate Euclidean distance • Euclidean distance between an observation and initial cluster centers 1 and 2 is calculated. • Based on Euclidean distance, each observation is assigned to one of the clusters - based on minimum distance Example
  • 10. Height Weight 185 72 170 56 First two observations Cluster Height Weight K1 185 72 K2 170 56 Updated centers Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Cluster Assignment SQRT [(185-185)2+(72-72)2 ] =0 SQRT [(185-170)2+(72-56)2] = 21.93 1 SQRT [(170-185)2+(56-72)2] = 21.93 SQRT [(170-170)2+(56-56)2] = 0 2 Euclidean Distance from each of the clusters is calculated: Step 3: Continue… There is no change in centers as we considered same two observations as initial centers Example
  • 11. Height Weight 168 60 Next observation Cluster Height Weight K1 185 72 K2 (170 +168)/2 =169 (56 +60)/2= 58 Updated cluster centers Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Cluster Assignment SQRT [(168-185) 2+(60-72) 2] =20.808 SQRT[((168-170)2+(60-56) 2] = 4.472 2 Step 4 : Move onto next observation, calculate euclidean distance, assign cluster membership and update cluster centers • Since distance is minimum from cluster 2, the observation is assigned to cluster 2. • Now revise Cluster centers – Mean value of observations’ Height and Weight. • Addition is only to cluster 2, so centroid of cluster 2 will be updated as follows : Example
  • 12. Height Weight 179 68 Next observation Cluster Height Weight K1 (185 +179)/2 =182 (72 +68)/2 =70 K2 169 58 Updated cluster centers Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Cluster Assignment SQRT [(179-185) 2+(68-72) 2] =7.21 SQRT[((179-170)2+(68-56) 2] = 14.14 1 Step 5: Repeat steps 4 : calculate Euclidean distance for next observation, assign next observation based on minimum distance & update the cluster centers until all observations are assigned a cluster membership o Since distance is minimum from cluster 1, the observation is assigned to cluster 1. o Now revise Cluster Centroid – Mean value of observations’ Height and Weight. o Addition is only to cluster 1, so centroid of cluster 1 will be updated as follows : Example
  • 13. Step 6 : Draw cluster plot to see how clusters are distributed. Lesser the overlap between clusters , better the distribution and cluster assignments. Cluster Updated Centroid Height Weight K=1 182.8 72 K=2 169 58 Final assignments Final cluster centers Cluster plot silhouette = 0.8 Indicating very good quality clusters Closer the silhouette score to 1 , better the quality of clusters Example
  • 15. Standard Tuning Parameter oNumber of clusters (K) : • The desired number of clusters • Suggested range : 3 to 5 • The actual number could be smaller in the output if there are no divisible clusters in the data • This parameter input can be automated using silhouette score(explained in later slides). Max Iterations: • The max number of k-means iterations to split clusters • By default this value should be set to 20
  • 16. Sample UI For Input Variables & Parameters Selection & Output
  • 17. Sample UI for selecting predictors and applying tuning parameters: For Two Predictors Select the variables you would like to use as predictors to build clusters Height Weight BMI 21 Tuning parameters Number of clusters Maximum iterations  The silhouette score is another useful criterion for assessing the natural and optimal number of clusters as well as for checking overall quality of partition  The largest silhouette score, over different K, indicates the best number of clusters
  • 18. Height(cm) Weight(Kg) BMI Cluster Number 158 60 23 1 160 65 25 2 170 70 26 2 149 50 21 1 180 80 27 3 165 80 28 3 200 90 23 1 Each customer is assigned a cluster membership as shown in the table in left Height Weight silhouette = 0.7 Indicating good quality clusters Output UI: For Two Predictors As clusters are built using only 2 predictors here , scatter plot axis will reflect actual predictors i.e. Height and Weight instead of principle components . Again, lesser the overlap in cluster outlines , better the clusters’ assignment Alternatively silhouette score can be checked to evaluate how clusters are partitioned - Closer this value to 1 , better the partition quality.
  • 19. Sample UI for Selecting Predictors And Applying Tuning Parameters: For Four Predictors Select the variables you would like to use as predictors to build clusters Purchase amount Purchase frequency Total purchase quantity Annual income Website visits 21 Tuning parameters Number of clusters Maximum iterations  The silhouette score is another useful criterion for assessing the natural and optimal number of clusters as well as for checking overall quality of partition  The largest silhouette score, over different K, indicates the best number of clusters
  • 20. Customer ID Purchase amount(per month) Purchase frequency (per month) Total purchase quantity Annual Income slab Website visits (per month) Cluster Number 1 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2 2 1k to 5k 1 4 1 lac to 2 lac <3 1 3 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3 4 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2 5 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3 6 1k to 5k 1 2 1 lac to 2 lac <3 1 7 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2 8 1k to 5k 1 2 1 lac to 2 lac <3 1 9 1k to 5k 1 2 1 lac to 2 lac <3 1 10 >20k 4 32 10 lac to 15 lac >10 3 Each customer is assigned a cluster membership as shown in the table in left First principal component Secondprincipalcomponent Output UI: For Four Predictors In the 2D cluster plot shown in right , clusters distribution is plotted. In this case, axis will reflect first two principle components (check definition below ) instead of actual predictors as number of predictors is >2. In case of 3D plot , first three principal components will be shown. Lesser the overlap between clusters , better the clusters’ assignment. Also silhouette score can be checked to evaluate how clusters are partitioned - Closer this value to 1 , better the partition quality. Whenever there are more than 2 predictors as input , axis will reflect principle components instead of actual predictors Principle components are linear combination of original predictors which captures the maximum variance in data set. Most of the variance in data is explained by first three principle components so we can ignore remaining components.
  • 22. silhouette = -0.5 Indicating poor quality clusters silhouette = 0.7 Indicating good quality clusters  Clusters with silhouette score closer to 1 are more desirable. So this index can be used to measure the goodness of clusters. silhouette = 0.3 Average quality clusters Other Sample Output Formats
  • 24. Limitations • The number of clusters, k, must be determined before hand. Instead the algorithm should auto suggest this number for better user friendliness. • It does not yield the same result with each run, since the resulting clusters depend on the initial random assignments for group centers. • If it is inputted in a different order it may produce different cluster if the number of data points are few, hence number of data points must be large enough. • It has been suggested that 2m can be used (where m = number of clustering variables) as a rule to decide sample data size. • K-means is suitable only for numeric data. • Scale of data points influences Euclidean distance , so variable standardization becomes necessary. • Empty clusters can be obtained if no points are allocated to a cluster during the assignment step.
  • 26. General applications of K-means Clustering • Some examples of use cases are: • Behavioral segmentation: • Segment customers by purchase history • Segment users by activities on application, website, or platform • Define personas based on interests • Create consumer profiles based on activity monitoring • Some other general applications : • Pattern recognitions, Market segmentation, Classification analysis, Artificial intelligence, Image processing , astronomy , agriculture and many others.
  • 27. Use case 1 • Business problem : • Grouping loan applicants into high/medium/low risk applicants based on attributes such as Loan amount , Monthly installment, Employment tenure , Times delinquent, Annual income, Debt to income ratio etc. • Business benefit: • Once segments are identified , bank will have a loan applicants’ dataset with each applicant labeled as high/medium/low risk. • Based on this labels , bank can easily make a decision on whether to give loan to an applicant or not and if yes then how much credit limit and interest rate each applicant is eligible for based on the amount of risk involved.
  • 28. Use case 1 Customer ID Loan amount Monthly installment Annual income Debt to income ratio Times delinquent Employment tenure 1039153 21000 701.73 105000 9 5 4 1069697 15000 483.38 92000 11 5 2 1068120 25600 824.96 110000 10 9 2 563175 23000 534.94 80000 9 2 12 562842 19750 483.65 57228 11 3 21 562681 25000 571.78 113000 10 0 9 562404 21250 471.2 31008 12 1 12 700159 14400 448.99 82000 20 6 6 696484 10000 241.33 45000 18 8 2 702598 11700 381.61 45192 20 7 3 702470 10000 243.29 38000 17 9 7 702373 4800 144.77 54000 19 8 2 701975 12500 455.81 43560 15 8 4 Input dataset :
  • 29. Use case 1 Output : Each record will have the cluster (segment) assignment as shown below : Customer ID Loan amount Monthly installment Annual income Debt to income ratio Times delinquent Employment tenure Cluster 1039153 21000 701.73 105000 9 5 4 Medium 1069697 15000 483.38 92000 11 5 2 Medium 1068120 25600 824.96 110000 10 9 2 Medium 563175 23000 534.94 80000 9 2 12 Low 562842 19750 483.65 57228 11 3 21 Low 562681 25000 571.78 113000 10 0 9 Low 562404 21250 471.2 31008 12 1 12 Low 700159 14400 448.99 82000 20 6 6 High 696484 10000 241.33 45000 18 8 2 High 702598 11700 381.61 45192 20 7 3 High 702470 10000 243.29 38000 17 9 7 High 702373 4800 144.77 54000 19 8 2 High 701975 12500 455.81 43560 15 8 4 High
  • 30. Use case 1 Output : Cluster profiles : As can be seen in the table above, there are distinctive characteristics of high /medium and low risk segments High risk segment has high likelihood to be delinquent, highest debt to income ratio and lowest employment tenure as compared to other two segments Whereas low risk segment exhibits exactly the opposite pattern i.e. lowest debt to income ratio, lowest delinquency and highest employment tenure as compared to other two segments Hence , delinquency , employment tenure and debt to income ratio are the determinant factors when it comes to segmenting loan applicants Cluster Loan amount Monthly installment Annual income Debt to income ratio Times delinquent Employment tenure Risk Segment 1 10447.30 304.87 66467.74 9.58 1.69 16.82 Low 2 21391.58 598.54 94912.59 12.37 5.98 4.58 Medium 3 7521.32 227.43 60935.28 16.55 6.91 4.01 High
  • 31. Use case 1 Output : Cluster distribution: In the cluster distribution plot , there is negligible overlap in cluster outlines so we can say that cluster assignments is good in our case Clusters with silhouette width average closer to 1 are more desirable. So this index can be used to test the quality of clusters’ distribution. silhouette = 0.6 Indicating good quality clusters
  • 32. Use case 2 Business benefit: • Once segments are identified , marketing messages and even products can be customized for each segment. • The better the segment(s) chosen for targeting by a particular organization , the more successful it is assumed to be in the market place. Business problem : • Organizing customers into groups/segments based on similar traits, product preferences and expectations • Segments are constructed on basis of the customers’ demographic characteristics, psychographics, past behavior and product use behaviors
  • 33. Use case 3 Business benefit: • Business marketing team can focus on risky customer segments in efficient way in order to avert them from churning/leaving • Sales team segments which are facing challenges based on current discounting strategy can be identified and deal negotiation strategy can be improved /optimized for them. Business problem : • Discount Analysis and Customer Retention – Visualize ‘segments of sales group based on discount behavior’ and ‘customer churn - segments of customers on verge of leaving’
  • 34. Want to Learn More? Get in touch with us @ support@Smarten.com And Do Checkout the Learning section on Smarten.com June 2018