06 how to write a map reduce version of k-means clustering

•Download as PPTX, PDF•

0 likes•1,001 views

Subhas Kumar Ghosh

Hadoop

Software

How to write a MapReduce Version of K-Means
Clustering

Recall
iterate {
Compute distance from all points to all k-centers
Assign each point to the nearest k-center
Compute the average of all points assigned to all specific k-centers
Replace the k-centers with the new averages
}

Recall: Parallelizing k-means
• In order to parallelize k-means, we want to come up with a scheme where
we can operate on each point in the data set independently.
• In the first step of the iterative process of k-means, it is necessary to
compute the distance from each point to each of the k cluster centers and
assign that point to the cluster with the minimum distance.
• Thus, there is a small amount of shared data – namely the cluster centers.
• However, this is small in comparison to the number of data points.
• So the parallelization scheme involves duplicating the cluster centers,
however once this is duplicated each data point can be operated on
independently of the others and we can gain a nice speedup.

K-means using MapReduce
• It is necessary to maintain a small amount of shared data, the cluster
centers.
• Thus when we partition points among MapReduce nodes, we also distribute
a copy of the cluster centers.
• This results in a small amount of data duplication, but very minimal.
• In this way each of the points can be operated on independently.
• Our map phase takes in points in the data set and outputs one (ClusterID,
Point) pair for each point, where the ClusterID is the integer ID of the cluster
which is closest to the point.
• During our reduce phase, the outputs of the map phase are grouped by
ClusterID, and for each ClusterID the centroid of the points associated with
that ClusterID is calculated.
• The output of our reduce phase are (ClusterID, Centroid) pairs, which
represent the newly calculated cluster centers.
• Each iteration of the algorithm is structured as a single MapReduce job,
driven by our library.
• After each phase, our library reads the output, determines whether
convergence has been reached by calculating by how much distance the
clusters have moved, and then runs another MapReduce job if necessary.

End of session
Day – 4: How to write a MapReduce Version of K-Means Clustering

What's hot

Hadoop map reduce v2Subhas Kumar Ghosh

Mapreduce advancedChirag Ahuja

Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea

MapReduce: Simplified Data Processing On Large Clusterskazuma_sato

MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin

Introduction to MapReduceChicago Hadoop Users Group

Map ReduceManuel Correa

Parallel Algorithms K – means ClusteringAndreina Uzcategui

Introduction to map reduceM Baddar

Map reduce - simplified data processing on large clustersCleverence Kombe

MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh

Mapreduce scriptHaripritha

Introduction to MapReduceHassan A-j

Map reduce in Hadoopishan0019

Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh

Map reduce presentationateeq ateeq

Map ReduceSri Prasanna

2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTINGijccsa

What's hot (20)

Hadoop map reduce v2

Mapreduce advanced

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...

MapReduce: Simplified Data Processing On Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

Introduction to MapReduce

Map Reduce

Parallel Algorithms K – means Clustering

Introduction to map reduce

Map reduce - simplified data processing on large clusters

MapReduce : Simplified Data Processing on Large Clusters

Mapreduce script

Introduction to MapReduce

Map reduce in Hadoop

Mapreduce - Simplified Data Processing on Large Clusters

Map reduce presentation

Map Reduce

2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING

Viewers also liked

MachineLearning_MPI_vs_SparkXudong Brandon Liang

Optimization for iterative queries on Mapreducemakoto onizuka

Seeds Affinity Propagation Based on Text ClusteringIJRES Journal

Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic

Lec4 Clusteringmobius.cn

05 k-means clusteringSubhas Kumar Ghosh

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru

Data clustering using map reduceVarad Meru

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris

Modeling with Hadoop kdd2011Milind Bhandarkar

Hoodie: Incremental processing on hadoopPrasanna Rajaperumal

Incremental clustering in search enginesPraxitelis Nikolaos Kouroupetroglou

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti

Viewers also liked (14)

MachineLearning_MPI_vs_Spark

Optimization for iterative queries on Mapreduce

Seeds Affinity Propagation Based on Text Clustering

Spark Bi-Clustering - OW2 Big Data Initiative, altic

Lec4 Clustering

05 k-means clustering

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...

Data clustering using map reduce

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...

Modeling with Hadoop kdd2011

Hoodie: Incremental processing on hadoop

Incremental clustering in search engines

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...

Similar to 06 how to write a map reduce version of k-means clustering

Parallel Implementation of K Means Clustering on CUDAprithan

Unsupervised Learning in Machine LearningPyingkodi Maran

Parallel kmeans clustering in ErlangChinmay Patel

Neural nw k meansEng. Dr. Dennis N. Mwighusa

CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1

50120140505013IAEME Publication

Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox

Data Mining Lecture_7.pptxSubrata Kumer Paul

CSA 3702 machine learning module 3Nandhini S

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya

13 unsupervised learning clusteringTanmayVijay1

k-mean-clustering.pdfYatharthKhichar1

hierarchical clustering.pptxPriyadharshiniG41

Hadoop classes in mumbaiVibrant Technologies & Computers

Evaluation of programs codes using machine learningVivek Maskara

Enhance The K Means Algorithm On Spatial DatasetAlaaZ

Clustering Algorithm by Vishal.pdfRenasHDarweesh

DS9 - Clustering.pptxJK970901

Load balancing in public cloud combining the concepts of data mining and netw...eSAT Publishing House

Scalable Graph Clustering with PregelSqrrl

Similar to 06 how to write a map reduce version of k-means clustering (20)

Parallel Implementation of K Means Clustering on CUDA

Unsupervised Learning in Machine Learning

Parallel kmeans clustering in Erlang

Neural nw k means

CLUSTER ANALYSIS ALGORITHMS.pptx

50120140505013

Parallel Computing 2007: Bring your own parallel application

Data Mining Lecture_7.pptx

CSA 3702 machine learning module 3

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...

13 unsupervised learning clustering

k-mean-clustering.pdf

hierarchical clustering.pptx

Hadoop classes in mumbai

Evaluation of programs codes using machine learning

Enhance The K Means Algorithm On Spatial Dataset

Clustering Algorithm by Vishal.pdf

DS9 - Clustering.pptx

Load balancing in public cloud combining the concepts of data mining and netw...

Scalable Graph Clustering with Pregel

Recently uploaded

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

What is Fashion PLM and Why Do You Need ItWave PLM

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

DNT_Corporate presentation know about usDynamic Netsoft

Professional Resume Template for Software DevelopersVinodh Ram

EY_Graph Database Powered SustainabilityNeo4j

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Recently uploaded (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

HR Software Buyers Guide in 2024 - HRSoftware.com

The Evolution of Karaoke From Analog to App.pdf

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

XpertSolvers: Your Partner in Building Innovative Software Solutions

What is Fashion PLM and Why Do You Need It

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

DNT_Corporate presentation know about us

Professional Resume Template for Software Developers

EY_Graph Database Powered Sustainability

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Project Based Learning (A.I).pptx detail explanation

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

06 how to write a map reduce version of k-means clustering

1. How to write a MapReduce Version of K-Means Clustering

2. Recall iterate { Compute distance from all points to all k-centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages }

3. Recall: Parallelizing k-means • In order to parallelize k-means, we want to come up with a scheme where we can operate on each point in the data set independently. • In the first step of the iterative process of k-means, it is necessary to compute the distance from each point to each of the k cluster centers and assign that point to the cluster with the minimum distance. • Thus, there is a small amount of shared data – namely the cluster centers. • However, this is small in comparison to the number of data points. • So the parallelization scheme involves duplicating the cluster centers, however once this is duplicated each data point can be operated on independently of the others and we can gain a nice speedup.

4. K-means using MapReduce • It is necessary to maintain a small amount of shared data, the cluster centers. • Thus when we partition points among MapReduce nodes, we also distribute a copy of the cluster centers. • This results in a small amount of data duplication, but very minimal. • In this way each of the points can be operated on independently. • Our map phase takes in points in the data set and outputs one (ClusterID, Point) pair for each point, where the ClusterID is the integer ID of the cluster which is closest to the point. • During our reduce phase, the outputs of the map phase are grouped by ClusterID, and for each ClusterID the centroid of the points associated with that ClusterID is calculated. • The output of our reduce phase are (ClusterID, Centroid) pairs, which represent the newly calculated cluster centers. • Each iteration of the algorithm is structured as a single MapReduce job, driven by our library. • After each phase, our library reads the output, determines whether convergence has been reached by calculating by how much distance the clusters have moved, and then runs another MapReduce job if necessary.

5. End of session Day – 4: How to write a MapReduce Version of K-Means Clustering

06 how to write a map reduce version of k-means clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to 06 how to write a map reduce version of k-means clustering

Similar to 06 how to write a map reduce version of k-means clustering (20)

More from Subhas Kumar Ghosh

More from Subhas Kumar Ghosh (20)

Recently uploaded

Recently uploaded (20)

06 how to write a map reduce version of k-means clustering