SlideShare a Scribd company logo
1 of 17
Download to read offline
Representative Based Clustering Algorithm: Part 1,
K-Means
Ananda Swarup Das, Technical Staff Member,
IBM India Research Labs, New Delhi, anandaswarup@gmail.com.
December 18, 2016
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 1 / 15
Standing on the Shoulders of the Giants.
Please note that, I use the excellent book titled ”Python Machine
Learning” [3] for most of the programming examples in this
presentation.
The Theoretical text is covered from multiple sources like [1], [2] and
[4].
Thanks to all the authors for such great books.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 2 / 15
Definition of Clustering
A Formal Definition: Clustering can be defined as partitioning a
given data set D = {xi }n
i=1 where each xi ∈ Rd into k sub-partitions
denoted by C = {C1, . . . , Ck} such that that D ∩ Cj = ∅ , for 1 ≤ j ≤ k
and ∪n
j=1Cj = D. The k is an user-defined/chosen parameter.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 3 / 15
Is the definition okay ?
The definition is incomplete in the sense that it misses the definition
of quality of each sub-partition.
Going simply by the previous definition, the points can be grouped
arbitrarily into k sub-groups. (Will that help ?)
Did we miss something ? (Well, Yes,. . ., We did not speak about how
we represent each cluster . . .)
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 4 / 15
The Representative Based Clustering
For each cluster, we try to find a representative point that
summarizes the cluster.
Ideally, it is the mean of the cluster.
K-Means algorithm is an example of the representative based
clustering.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 5 / 15
K Means Clustering: Definition and the Objective Function
Given the task of clustering, the first important factor is to find an
appropriate scoring function to ensure the quality of the clustering.
The K-means clustering greedily finds the k-number of means
µ1 . . . µk for the clusters c1, . . . , ck.
The sum of squared error for each cluster Ci is given as
SSE(ci ) = xj ∈ci
||xj − µi ||2.
The sum of squared error for the clustering scheme C is define as
SSE(C) = k
i=1 xj ∈ci
||xj − µi ||2.
The objective is therefore to find the clustering scheme C such that
C = arg min C{SSE(C)} .
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 6 / 15
K Means Clustering: Algorithmic Steps and Hard
Assignment
As stated in [4],
1 At the first step t = 0, randomly initialize k centroids denoted by
µt
1 . . . µt
k.
2 repeat
Increment the iteration index t by 1.
Let Cj = ∅ for all j = 1 . . . k.
for each xj in the data set D, do,
Find j = arg mini {||xj − µt−1
i ||2
}
Cj = Cj ∪ {xj }.
3 Update Centroids for each cluster as µt
i = 1
|Ci | xj ∈Ci
xj .
4 Stop if k
i=1 ||µt
i − µt−1
i ||2 ≤ where is a user-defined parameter.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 7 / 15
K Means Clustering: Algorithmic Steps and Hard
Assignment
As stated in [4],
1 At the first step t = 0, randomly initialize k centroids denoted by
µt
1 . . . µt
k.
2 repeat
Increment the iteration index t by 1.
Let Cj = ∅ for all j = 1 . . . k.
for each xj in the data set D, do,
Find j = arg mini {||xj − µt−1
i ||2
}
Cj = Cj ∪ {xj }.
3 Update Centroids for each cluster as µt
i = 1
|Ci | xj ∈Ci
xj .
4 Stop if k
i=1 ||µt
i − µt−1
i ||2 ≤ where is a user-defined parameter.
Notice that in each iteration of k means, a point in D is greedily assigned
to at most one cluster. This is called hard assignment.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 7 / 15
A Question to Ponder
Can we do something so that instead of greedily assigning a point to one
cluster at most, we assign the point to multiple clusters ?
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 8 / 15
A Question to Ponder
Can we do something so that instead of greedily assigning a point to one
cluster at most, we assign the point to multiple clusters ?
Yes, we can, but we will defer the answer to that for sometime as we have
some maths to brush up. Part 2 series of this slide will answer the question.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 8 / 15
Few Things to Learn
1 Clustering is an unsupervised technique.
You are not provided with any training data of any labeled data to
train a system.
You are trying to find some group/pattern in the data.
2 How to decide an ideal value for the parameter k.
Use Elbow-Method
If the dimension is not too high, one can also use Bayesian Information
Criterion (bic)
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 9 / 15
Visualizations
I am using make−blobs from sklearn-datasets following examples from [3] to
generate the 2-d sample data set with four centers. It is a synthetic data (for
demo purpose) and in practice, one will rarely get such well clustered data.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 10 / 15
Introducing k-Means from sklearn-cluster
This is as simple as follows:
1 from sklearn.cluster import KMeans
2 km = KMeans(n−clusters = 4, init=’random’, n−init = 10,
max−iter = 800, tol = 1e − 04, random−state = 0)
The important terms:
n−cluster, denotes the number of clusters you want. This is actually your
value of k.
init=’random’ means k random points will be initially selected as the
centroid/means.
n−init denote the number of times the k-means algorithm will be run with
different centroid seeds.
max−iter = 800 denotes the maximum number of iterations the KMeans
algorithm will run. Default is 800.
tol = 1e − 04 Minimum tolerance to declare convergence. Remember,
k
i=1 ||µt
i − µt−1
i ||2
≤ . is the tol.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 11 / 15
Deciding the Cluster Number of k
1 Remember the sum of squared error for the clustering scheme C is
define as SSE(C) = k
i=1 xj ∈ci
||xj − µi ||2. This parameter is also
known as cluster distortion or cluster inertia.
2 The Kmeans module of skelarn.cluster will give you that value as
km.inertia−.
3 Run your Kmeans algorithm in a loop where at each iteration, you
choose a different value of k. Collect the cluster inertia for that value
of k.
4 Make a plot and find the elbow.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 12 / 15
The Elbow Method
Figure: Notice the Sharp decline of distortion from 3 to 4. This is called the
Elbow. This gives us an idea that probably k = 4 is a good choice.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 13 / 15
In the Next Series
In the next part of this series (probably in a time of week), we will
introduce the Expectation Maximization Algorithm with elaborate
details and explanations.
Till then Happy Data Science with Python.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 14 / 15
Citations
G. James, D. Witten, T. Hastie, and R. Tibshirani.
An Introduction to Statistical Learning: with Applications in R.
Springer Texts in Statistics. Springer New York, 2014.
C. D. Manning, P. Raghavan, and H. Sch¨utze.
Introduction to information retrieval.
Cambridge University Press, 2008.
S. Raschka.
Python Machine Learning.
Packt Publishing, 2015.
M. J. Zaki and W. Meira.
Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge University Press, 2014.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 15 / 15

More Related Content

What's hot

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286Ninad Samel
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetAmazon Web Services
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Edureka!
 
Boosted tree
Boosted treeBoosted tree
Boosted treeZhuyi Xue
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...csandit
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷Eiji Sekiya
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...Simplilearn
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 

What's hot (20)

Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
J0945761
J0945761J0945761
J0945761
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
 
Boosted tree
Boosted treeBoosted tree
Boosted tree
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 
Ihdels presentation
Ihdels presentationIhdels presentation
Ihdels presentation
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 

Similar to Representative basedclustering

Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Data Structures problems 2002
Data Structures problems 2002Data Structures problems 2002
Data Structures problems 2002Sanjay Goel
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET Journal
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learningGReshma10
 
Training deep auto encoders for collaborative filtering
Training deep auto encoders for collaborative filteringTraining deep auto encoders for collaborative filtering
Training deep auto encoders for collaborative filteringMarlesson Santana
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
 
Memory efficient programming
Memory efficient programmingMemory efficient programming
Memory efficient programmingindikaMaligaspe
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Ryo 亮 Kawahara 河原
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 

Similar to Representative basedclustering (20)

slides.pptx
slides.pptxslides.pptx
slides.pptx
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Data Structures problems 2002
Data Structures problems 2002Data Structures problems 2002
Data Structures problems 2002
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learning
 
DynaML: Splash 2016
DynaML: Splash 2016DynaML: Splash 2016
DynaML: Splash 2016
 
Training deep auto encoders for collaborative filtering
Training deep auto encoders for collaborative filteringTraining deep auto encoders for collaborative filtering
Training deep auto encoders for collaborative filtering
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Memory efficient programming
Memory efficient programmingMemory efficient programming
Memory efficient programming
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 

Recently uploaded

Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 

Recently uploaded (20)

Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 

Representative basedclustering

  • 1. Representative Based Clustering Algorithm: Part 1, K-Means Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com. December 18, 2016 Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 1 / 15
  • 2. Standing on the Shoulders of the Giants. Please note that, I use the excellent book titled ”Python Machine Learning” [3] for most of the programming examples in this presentation. The Theoretical text is covered from multiple sources like [1], [2] and [4]. Thanks to all the authors for such great books. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 2 / 15
  • 3. Definition of Clustering A Formal Definition: Clustering can be defined as partitioning a given data set D = {xi }n i=1 where each xi ∈ Rd into k sub-partitions denoted by C = {C1, . . . , Ck} such that that D ∩ Cj = ∅ , for 1 ≤ j ≤ k and ∪n j=1Cj = D. The k is an user-defined/chosen parameter. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 3 / 15
  • 4. Is the definition okay ? The definition is incomplete in the sense that it misses the definition of quality of each sub-partition. Going simply by the previous definition, the points can be grouped arbitrarily into k sub-groups. (Will that help ?) Did we miss something ? (Well, Yes,. . ., We did not speak about how we represent each cluster . . .) Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 4 / 15
  • 5. The Representative Based Clustering For each cluster, we try to find a representative point that summarizes the cluster. Ideally, it is the mean of the cluster. K-Means algorithm is an example of the representative based clustering. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 5 / 15
  • 6. K Means Clustering: Definition and the Objective Function Given the task of clustering, the first important factor is to find an appropriate scoring function to ensure the quality of the clustering. The K-means clustering greedily finds the k-number of means µ1 . . . µk for the clusters c1, . . . , ck. The sum of squared error for each cluster Ci is given as SSE(ci ) = xj ∈ci ||xj − µi ||2. The sum of squared error for the clustering scheme C is define as SSE(C) = k i=1 xj ∈ci ||xj − µi ||2. The objective is therefore to find the clustering scheme C such that C = arg min C{SSE(C)} . Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 6 / 15
  • 7. K Means Clustering: Algorithmic Steps and Hard Assignment As stated in [4], 1 At the first step t = 0, randomly initialize k centroids denoted by µt 1 . . . µt k. 2 repeat Increment the iteration index t by 1. Let Cj = ∅ for all j = 1 . . . k. for each xj in the data set D, do, Find j = arg mini {||xj − µt−1 i ||2 } Cj = Cj ∪ {xj }. 3 Update Centroids for each cluster as µt i = 1 |Ci | xj ∈Ci xj . 4 Stop if k i=1 ||µt i − µt−1 i ||2 ≤ where is a user-defined parameter. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 7 / 15
  • 8. K Means Clustering: Algorithmic Steps and Hard Assignment As stated in [4], 1 At the first step t = 0, randomly initialize k centroids denoted by µt 1 . . . µt k. 2 repeat Increment the iteration index t by 1. Let Cj = ∅ for all j = 1 . . . k. for each xj in the data set D, do, Find j = arg mini {||xj − µt−1 i ||2 } Cj = Cj ∪ {xj }. 3 Update Centroids for each cluster as µt i = 1 |Ci | xj ∈Ci xj . 4 Stop if k i=1 ||µt i − µt−1 i ||2 ≤ where is a user-defined parameter. Notice that in each iteration of k means, a point in D is greedily assigned to at most one cluster. This is called hard assignment. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 7 / 15
  • 9. A Question to Ponder Can we do something so that instead of greedily assigning a point to one cluster at most, we assign the point to multiple clusters ? Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 8 / 15
  • 10. A Question to Ponder Can we do something so that instead of greedily assigning a point to one cluster at most, we assign the point to multiple clusters ? Yes, we can, but we will defer the answer to that for sometime as we have some maths to brush up. Part 2 series of this slide will answer the question. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 8 / 15
  • 11. Few Things to Learn 1 Clustering is an unsupervised technique. You are not provided with any training data of any labeled data to train a system. You are trying to find some group/pattern in the data. 2 How to decide an ideal value for the parameter k. Use Elbow-Method If the dimension is not too high, one can also use Bayesian Information Criterion (bic) Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 9 / 15
  • 12. Visualizations I am using make−blobs from sklearn-datasets following examples from [3] to generate the 2-d sample data set with four centers. It is a synthetic data (for demo purpose) and in practice, one will rarely get such well clustered data. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 10 / 15
  • 13. Introducing k-Means from sklearn-cluster This is as simple as follows: 1 from sklearn.cluster import KMeans 2 km = KMeans(n−clusters = 4, init=’random’, n−init = 10, max−iter = 800, tol = 1e − 04, random−state = 0) The important terms: n−cluster, denotes the number of clusters you want. This is actually your value of k. init=’random’ means k random points will be initially selected as the centroid/means. n−init denote the number of times the k-means algorithm will be run with different centroid seeds. max−iter = 800 denotes the maximum number of iterations the KMeans algorithm will run. Default is 800. tol = 1e − 04 Minimum tolerance to declare convergence. Remember, k i=1 ||µt i − µt−1 i ||2 ≤ . is the tol. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 11 / 15
  • 14. Deciding the Cluster Number of k 1 Remember the sum of squared error for the clustering scheme C is define as SSE(C) = k i=1 xj ∈ci ||xj − µi ||2. This parameter is also known as cluster distortion or cluster inertia. 2 The Kmeans module of skelarn.cluster will give you that value as km.inertia−. 3 Run your Kmeans algorithm in a loop where at each iteration, you choose a different value of k. Collect the cluster inertia for that value of k. 4 Make a plot and find the elbow. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 12 / 15
  • 15. The Elbow Method Figure: Notice the Sharp decline of distortion from 3 to 4. This is called the Elbow. This gives us an idea that probably k = 4 is a good choice. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 13 / 15
  • 16. In the Next Series In the next part of this series (probably in a time of week), we will introduce the Expectation Maximization Algorithm with elaborate details and explanations. Till then Happy Data Science with Python. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 14 / 15
  • 17. Citations G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer New York, 2014. C. D. Manning, P. Raghavan, and H. Sch¨utze. Introduction to information retrieval. Cambridge University Press, 2008. S. Raschka. Python Machine Learning. Packt Publishing, 2015. M. J. Zaki and W. Meira. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, 2014. Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 15 / 15