"k-means-clustering" presentation @ Papers We Love Bucharest

•Download as PPTX, PDF•

3 likes•552 views

Adrian Florea

6th presentation @ Papers We Love - Bucharest Chapter on 22nd February, 2016

Software

k-means Clustering
Papers We Love
Bucharest Chapter
February 22, 2016

What is clustering?
◉branch of Machine Learning (unsupervised learning i.e. find
underlying patterns in the data)
◉method for identifying data items that closely resemble one
another, assembling them into clusters [ODS]

Clustering applications
◉Customer segmentation

Definitions and notations
◉data to be clustered (N data points)
◉iteratively refined clustering solution
◉cluster membership vector
◉closeness cost function
◉Minkowski distance
p=1 (Manhattan), p=2 (Euclidian)

Voronoi diagrams
◉Euclidian distance ◉Manhattan distance

K-means algorithm
◉input: dataset D, number of clusters k
◉output: cluster solution C, cluster membership m
◉initialize C by randomly choose k data points sets from D
◉repeat
reasign points in D to closest cluster mean
update m
update C such that is mean of points in cluster,
◉until convergence of KM(D, C)

K-means iterations example
N=12, k=2
randomly
choose
2 clusters
calculate
cluster
means
reassign points to
closest cluster mean
update
cluster
means
loop if needed

Poor initialization/Poor clustering
Initial (I) Final (I) Initial (II) Final (II)
Initial (III) Final (III)

Pros
◉simple
◉common in practice
◉easy to adapt
◉relatively fast
◉sensitive to initial partitions
◉optimal k is difficult to choose
◉restricted to data which has
the notion of a center
◉cannot handle well non-
convex clusters
◉does not identify outliers
(because mean is not a “robust”
statistic function)
Cons

Tips & tricks
◉run the algorithm multiple times with different initial centroids
and select the best result
◉if possible, use the knowledge about the dataset in choosing k
◉trying several k and choosing the one based on closeness cost
function is naive
◉use a more appropriate distance measure for the dataset
◉use k-means together with another algorithm
◉Exploit the triangular inequality to avoid to compare each data
point with all centroids

Tips & tricks - continued
◉recursively split the least compact cluster in 2
clusters by using 2-means
◉remove outliers in preprocessing (anomaly
detection)
◉eliminate small clusters or merge close clusters
at postprocessing
◉in case of empty clusters reinitialize their
centroids or steal points from the largest cluster

Tools & frameworks
◉Frontline Systems XLMiner (Excel Add-in)
◉SciPy K-means Clustering and Vector
Quantization Modules (scipy.cluster.vq.kmeans)
◉stats package in R
◉Azure Machine Learning

Bibliography
◉ H.-H. Bock, "Origins and extensions of the k-means algorithm in cluster analysis", JEHPS, Vol.
4, No. 2, December 2008
◉ D. Chappell, “Understanding Machine Learning”, PluralSight.com, 4 February 2016
◉ J. Ghosh, A. Liu, "K-Means", pp. 21-35 in X. Wu, V. Kumar (Eds.), "The Top Ten Algorithms in
Data Mining", Chapman & Hall/CRC, 2009
◉ G.J. Hamerly, "Learning structure and concepts in data through data clustering", PhD Thesis,
University of California, San Diego, 2003
◉ J. Han, M. Kamber, J. Pei, "Chapter 10. Cluster Analysis: Basic Concepts and Methods" in "Data
Mining: Concepts and Techniques", Morgan Kaufmann, 2011
◉ S.P. Lloyd, "Least Squares Quantization in PCM", IEEE Transactions on Information Theory, Vol.
28, No. 2, 1982
◉ J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, 5th
Berkeley Symposium, 1967
◉ J. McCaffrey, "Machine Learning Using C# Succinctly", Syncfusion, 2014
◉ [ODS] G. Upton, I. Cook, “A Dictionary of Statistics”, 3rd Ed., Oxford University Press, 2014
◉ ***, “k-means clustering”, Wikipedia
◉ ***, “Voronoi diagram”, Wikipedia

Any questions ?
You can find me at
Thanks!

What's hot

Enhance The K Means Algorithm On Spatial DatasetAlaaZ

K-means ClusteringAnna Fensel

KmeansNikita Goyal

Clustering: A SurveyRaffaele Capaldo

Types of clustering and different types of clustering algorithmsPrashanth Guntal

K means clusteringkeshav goyal

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest

Cure, Clustering AlgorithmLino Possamai

05 k-means clusteringSubhas Kumar Ghosh

Pattern recognition binoy k means clustering108kaushik

Clustering, k-means clusteringMegha Sharma

DATA MINING:Clustering TypesAshwin Shenoy M

K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc

3.3 hierarchical methodsKrish_ver2

Data clustering GARIMA SHAKYA

Chapter 11 cluster advanced : web and text miningHouw Liong The

K mean-clustering algorithmparry prabhu

K-means clustering algorithmVinit Dantkale

Dataa miiningSUBBIAH SURESH

Clique sk_klms

What's hot (20)

Enhance The K Means Algorithm On Spatial Dataset

K-means Clustering

Kmeans

Clustering: A Survey

Types of clustering and different types of clustering algorithms

K means clustering

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...

Cure, Clustering Algorithm

05 k-means clustering

Pattern recognition binoy k means clustering

Clustering, k-means clustering

DATA MINING:Clustering Types

K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...

3.3 hierarchical methods

Data clustering

Chapter 11 cluster advanced : web and text mining

K mean-clustering algorithm

K-means clustering algorithm

Dataa miining

Clique

Viewers also liked

Melanoma Image Segmentation using Self Organized Features MapsMunnangi Anirudh

Systolic arraySyed Zaid Irshad

Lec2 finalGichelle Amon

A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...CSCJournals

Timo Honkela: Turning quantity into quality and making concepts visible using...Timo Honkela

Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...Timo Honkela

Timo Honkela: Self-Organizing Map as a Means for Gaining PerspectivesTimo Honkela

Self Organizing MapsDaksh Raj Chopra

MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...Daksh Raj Chopra

PptGMRIT

Neural networks Self Organizing Map by Engr. Edgar Carrillo IIEdgar Carrillo

Neural Networks: Self-Organizing Maps (SOM)Mostafa G. M. Mostafa

Bioinformatics Project Training for 2,4,6 monthbiinoida

A Simple Tutorial on Conjoint and Cluster AnalysisIterative Path

Methods of Manifold Learning for Dimension Reduction of Large Data SetsRyan B Harvey, CSDP, CSM

Customer Clustering For Retail MarketingJonathan Sedar

Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...wl820609

XL-MINER: Data ExplorationDataminingTools Inc

Introduction To XL-MinerDataminingTools Inc

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn

Viewers also liked (20)

Melanoma Image Segmentation using Self Organized Features Maps

Systolic array

Lec2 final

A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...

Timo Honkela: Turning quantity into quality and making concepts visible using...

Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...

Timo Honkela: Self-Organizing Map as a Means for Gaining Perspectives

Self Organizing Maps

MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...

Ppt

Neural networks Self Organizing Map by Engr. Edgar Carrillo II

Neural Networks: Self-Organizing Maps (SOM)

Bioinformatics Project Training for 2,4,6 month

A Simple Tutorial on Conjoint and Cluster Analysis

Methods of Manifold Learning for Dimension Reduction of Large Data Sets

Customer Clustering For Retail Marketing

Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...

XL-MINER: Data Exploration

Introduction To XL-Miner

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest

DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs

CS583-unsupervised-learning.pptHathiramN1

CS583-unsupervised-learning.ppt learningssuserb02eff

Unsupervised Learning.pptxGandhiMathy6

15857 cse422 unsupervised-learningAnil Yadav

Neural nw k meansEng. Dr. Dennis N. Mwighusa

K-means Clustering Method for the Analysis of Log Dataidescitation

Document clustering for forensic analysis an approach for improving compute...Madan Golla

iiit delhi unsupervised pdf.pdfVIKASGUPTA127897

Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications

A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant

A PSO-Based Subtractive Data Clustering AlgorithmIJORCS

Data clustering using kernel basedIJITCA Journal

Ensemble based Distributed K-Modes ClusteringIJERD Editor

Introduction to data mining and machine learningTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Unsupervised Machine Learning Algorithm K-means-Clustering.pptxAnupama Kate

Unsupervised learning clusteringArshad Farhad

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean

Master's Thesis Presentation●๋•máńíکhá Gőýálツ

Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest (20)

DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

CS583-unsupervised-learning.ppt

CS583-unsupervised-learning.ppt learning

Unsupervised Learning.pptx

15857 cse422 unsupervised-learning

Neural nw k means

K-means Clustering Method for the Analysis of Log Data

Document clustering for forensic analysis an approach for improving compute...

iiit delhi unsupervised pdf.pdf

Premeditated Initial Points for K-Means Clustering

A Comparative Study Of Various Clustering Algorithms In Data Mining

A PSO-Based Subtractive Data Clustering Algorithm

Data clustering using kernel based

Ensemble based Distributed K-Modes Clustering

Introduction to data mining and machine learning

Unsupervised Machine Learning Algorithm K-means-Clustering.pptx

Unsupervised learning clustering

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Master's Thesis Presentation

Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt

Recently uploaded

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

What is Binary Language? Computer Number SystemsJheuzeDellosa

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

Asset Management Software - InfographicHr365.us smith

chapter--4-software-project-planning.pptkotipi9215

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

cybersecurity notes for mca students for learningVitsRangannavar

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

5 Signs You Need a Fashion PLM Software.pdfWave PLM

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

What is Binary Language? Computer Number Systems

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

XpertSolvers: Your Partner in Building Innovative Software Solutions

Asset Management Software - Infographic

chapter--4-software-project-planning.ppt

Cloud Management Software Platforms: OpenStack

cybersecurity notes for mca students for learning

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Der Spagat zwischen BIAS und FAIRNESS (2024)

5 Signs You Need a Fashion PLM Software.pdf

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Salesforce Certified Field Service Consultant

"k-means-clustering" presentation @ Papers We Love Bucharest

1. k-means Clustering Papers We Love Bucharest Chapter February 22, 2016

2. I am Adrian Florea Architect Lead @ IBM Bucharest Software Lab Hello!

3. What is clustering? ◉branch of Machine Learning (unsupervised learning i.e. find underlying patterns in the data) ◉method for identifying data items that closely resemble one another, assembling them into clusters [ODS]

4. Clustering applications ◉Customer segmentation

5. Clustering applications ◉Google News

6. Definitions and notations ◉data to be clustered (N data points) ◉iteratively refined clustering solution ◉cluster membership vector ◉closeness cost function ◉Minkowski distance p=1 (Manhattan), p=2 (Euclidian)

7. Voronoi diagrams ◉Euclidian distance ◉Manhattan distance

8. K-means algorithm ◉input: dataset D, number of clusters k ◉output: cluster solution C, cluster membership m ◉initialize C by randomly choose k data points sets from D ◉repeat reasign points in D to closest cluster mean update m update C such that is mean of points in cluster, ◉until convergence of KM(D, C)

9. K-means iterations example N=12, k=2 randomly choose 2 clusters calculate cluster means reassign points to closest cluster mean update cluster means loop if needed

10. Poor initialization/Poor clustering Initial (I) Final (I) Initial (II) Final (II) Initial (III) Final (III)

11. Pros ◉simple ◉common in practice ◉easy to adapt ◉relatively fast ◉sensitive to initial partitions ◉optimal k is difficult to choose ◉restricted to data which has the notion of a center ◉cannot handle well non- convex clusters ◉does not identify outliers (because mean is not a “robust” statistic function) Cons

12. Tips & tricks ◉run the algorithm multiple times with different initial centroids and select the best result ◉if possible, use the knowledge about the dataset in choosing k ◉trying several k and choosing the one based on closeness cost function is naive ◉use a more appropriate distance measure for the dataset ◉use k-means together with another algorithm ◉Exploit the triangular inequality to avoid to compare each data point with all centroids

13. Tips & tricks - continued ◉recursively split the least compact cluster in 2 clusters by using 2-means ◉remove outliers in preprocessing (anomaly detection) ◉eliminate small clusters or merge close clusters at postprocessing ◉in case of empty clusters reinitialize their centroids or steal points from the largest cluster

14. Tools & frameworks ◉Frontline Systems XLMiner (Excel Add-in) ◉SciPy K-means Clustering and Vector Quantization Modules (scipy.cluster.vq.kmeans) ◉stats package in R ◉Azure Machine Learning

15. Bibliography ◉ H.-H. Bock, "Origins and extensions of the k-means algorithm in cluster analysis", JEHPS, Vol. 4, No. 2, December 2008 ◉ D. Chappell, “Understanding Machine Learning”, PluralSight.com, 4 February 2016 ◉ J. Ghosh, A. Liu, "K-Means", pp. 21-35 in X. Wu, V. Kumar (Eds.), "The Top Ten Algorithms in Data Mining", Chapman & Hall/CRC, 2009 ◉ G.J. Hamerly, "Learning structure and concepts in data through data clustering", PhD Thesis, University of California, San Diego, 2003 ◉ J. Han, M. Kamber, J. Pei, "Chapter 10. Cluster Analysis: Basic Concepts and Methods" in "Data Mining: Concepts and Techniques", Morgan Kaufmann, 2011 ◉ S.P. Lloyd, "Least Squares Quantization in PCM", IEEE Transactions on Information Theory, Vol. 28, No. 2, 1982 ◉ J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, 5th Berkeley Symposium, 1967 ◉ J. McCaffrey, "Machine Learning Using C# Succinctly", Syncfusion, 2014 ◉ [ODS] G. Upton, I. Cook, “A Dictionary of Statistics”, 3rd Ed., Oxford University Press, 2014 ◉ ***, “k-means clustering”, Wikipedia ◉ ***, “Voronoi diagram”, Wikipedia

16. Any questions ? You can find me at Thanks!

Editor's Notes

- partition the customer database into several groups, where customers in each group have similar purchasing history and demographics

"k-means-clustering" presentation @ Papers We Love Bucharest

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest (20)

Recently uploaded

Recently uploaded (20)

"k-means-clustering" presentation @ Papers We Love Bucharest

Editor's Notes