SlideShare a Scribd company logo
1 of 10
Download to read offline
Lecture - 26
Parallel K-means using Map Reduce on Big Data Cluster Analysis
Parallel k-means using MapReduce on big data cluster analysis we have seen in this particular section you
will see the parallel k-means algorithm that means we will use the MapReduce.
Refer slide time :( 0:36)
We will apply, the MapReduce apply the matter MapReduce on k-means algorithm. Now, important thing
here is that this particular method of applying the MapReduce on k-means will make this data parallel
algorithm. Now, after doing this data parallel k-means algorithm using MapReduce now we can use this
particular algorithm for the analysis of a big data. So, we can use this particular algorithm that is called ‘
Parallel k-means’ algorithm using MapReduce with the big data. And, we can extract the insights of big
data. So, let us see the more details of how MapReduce can be applied on the k-means to achieve this
parallel k-means using MapReduce which is used for the big data analysis.
Refer slide time :( 02:19)
Now, this particular iterations of the k-means which is the step number one which we have seen in the
previous slide that is to classify that means all the we have to assign the the data set the data points in the
data set to the closest centroid or to the closest cluster Center which is represented by this particular
equation let us understand this particular equation. Now, if we are given a set of centroids then for a
particular data item Xi we are calculating the similarity measure and we have to identify that particular
centroid which is having that minimum distance. So, this particular arm in this particular formula will
identify that particular cluster which is having that closest closest or which is very similar or which is
having the least distance or a minimum distance with that data set compared to the other plus other
centroids. Hence, this particular j is the label which will be identified for a particular Xi. Now, we are now
going to get Zi as the label or as a centroid for this data point Xi. Now, the next step in the k-means
algorithm which we have seen in the previous slides is called ‘Recenter’ that means after identifying the
clusters that means after classifying the data set points into the different clusters. Now, for every cluster
we are going to calculate the mean of that particular cluster. So, for every cluster we are going to compute
the mean and that will become the new centroid for every cluster this is called ‘Recenter’. And, these two
step is going to iterate. So, this these two step will form the first iteration of the MapReduce k-means
algorithm. Now, let us use the map function and reduce function which can be applied now on k-means
algorithm. Now, the map function would be here for each data point we have to we will identify this
particular centroid which is closest to this a point Xi and let us assume that Zi is the closest centroid which
is being identified for a data point Xi and the map function will perform this operation in a data parallel
operation as a data parallel operation. And, it will emit this key value pair. So, key will be key will be the
the centroid or you can also call it as a label and the value will be that data point data point is in the data
set or the data sample. Now, as far as the step number 2 of the first iteration which is called ‘Computing’
the recenter which is nothing but identifying the mean of that particular cluster mean of the cluster points.
So, mean of a cluster point will become the new centroid which is nothing but we call it as recenter. Now,
as far as the reduced function for this operation will be performed by taking this key value pair from the
previous MapReduce and for that particular that means it will group by that particular centroid all data
points and perform the average. So, average of all data points for that but your cluster which is Zi.
Refer slide time :( 07:24)
Now, let us see in more details how this map and reduce function is being implemented. So, the first step
is called classification or a classifying classify the data points data points in the data set to the closest
centroid in a data parallel operation. So, in this particular equation you can see that if you are given the
set of centroids. So, for a given data point it will calculate the distance measure the differences and the
minimum of that particular difference will be that data will be that centroid which is very closest to the
data point compared to all other one and that will become as the label or the centroid which is closest to
the data point. So, this way if we perform this activity for all data sets in data parallel operation then we
can specify using my function. So, map function will have the input as the set of all data all all centroids.
So, this is the set of all centroids ( μ1 , μ2 ) these are all let us seek a different centroids are given in the
map function and Xi is the data point. Now, this Xi will be computing the the similarity measure or that
the distance measure with all the centroids and the minimum distance centroid let us assume it is Z which
will be identified which will be assigned to Zi and the map function will omit that Zi and this Xi. So, the
key which this map function will emit is the centroid ID which is the closest to this data point and the
value will be that data point.
Refer slide time :( 10:18)
Now, let us see the reduce function which is nothing but computing the recenter. Now, after doing the
mean or it will identify the mean of all the data points within a particular data within a particular cluster
and after doing the normalization summation and the normalization it will identify the new centroid. So,
for all the cluster ID all the clusters there will be are center or computing the the summation of all the data
points and normalizing it that is nothing but the calculating the mean for all things. Let, us see how it will
be done performed using a reduced function. Now, as we have seen from the map function that there will
be key value pair which will be emit from the map. So, map will emit the key value pair. So, key will be
that centroidID let us say this is the centroid and this value will be the set of all set of data points which
are assigned to the centroid J. So, what it will do is it will compute it will find out the total it will find out
the summation of all data points X within it within the clusters and then it will do the normalization. So,
here for that there will be a for loop. So, it will iterate through all the all the data points in that particular
set of that cluster and it will do this particular summation that means it will do the summation of it this
particular equation it will operate in this way and then it will also do a counting and finally the count
value will contain the total number of points which are doing which are which are being summed up for
normalization. So, now it will do this mean or the normalization normalized aggregation is being done
normalized aggregation or it will be computed the mean for every cluster. And, it will emit this is the key
key means it will emit the centroid and value will be the normalized value which is a new centroid. So,
after calculating the new centroid now again it will go to the next iteration and this iteration will repeat.
So, we have seen that how the map and reduce has been implementing this k-means algorithm.
Refer slide time :( 13:46)
Let, us see the same thing in the form of a picture. So, you can see that this is the data set and this
particular map function will be performed on different data splits. So, map function will be applied on a
different data splits and this will map will emit the key value pair. So, key will be the centroid and for all
the values of this data points it will emit reduce function what it will do it will now group by key. And, it
will collect all such data points which belong to a particular cluster. And, it will form the normalized
aggregation that is nothing but it will calculate the mean of it. So, for that particular centroid J it will
compute the mean that is nothing but the new centroid and this particular way this k-means algorithm will
iterate to the next step until it converges.
Refer slide time :( 15:06)
So, this particular new data centers centers once it is being identified then again it will go and classify all
the data set accordingly. And, so on this particular iterations keeps on iterating unless there is a stopping
criteria and once the stopping criteria is met then these iterations will stop and the algorithm.
Refer slide time :( 15:36)
will terminate in summary we have seen the parallel k-means algorithm which is nothing but an
implementation of MapReduce for k-means algorithm. So, we have also seen that the two steps that is the
using map and reduce. We, have implemented we have seen the implementation using map function for
the classification step wherein the data parallel operation is performed or different data points to classify
the data points into the key value pair. Now, second operation which is called a ‘Reduce Operation’
which is nothing but to come to recomputed the means are the new centroids the new centroids are
computed based on the mean of data points in the clusters which is identified in the step number one. So,
after identifying it then it will go to the next iteration and so on. So, these iteration these two operations
that is the classification and recomputation we have seen how using map and reduce is being
implemented.
Refer slide time :( 17:05)
Now, we see that this k-means algorithm is an iterative version is an iterative algorithm and which
requires an iterative version of a MapReduce. And, in the the previous one chance of Hadoop and
MapReduce version 1 we have we have there were several new implementation of iterative MapReduce.
Now, with the new version of Hadoop which internally supports the iterative MapReduce in is spark this
particular iteration is not going to be a problematic and is going to be very straight forward that we will
see in further lectures more detail that this is not that we can use it iterative version of MapReduce using
some other means such as is spark. Now, another practical consideration is about the mapper which needs
to get the data points and all the data and all the centers lot of data is going to be moved in the old or in
the previous version of MapReduce but now in a newer implementation like in spark and all. So, in
memory configuration is being used to store the data and over the iterations all these are implemented and
handled for better optimization and efficient implementation.
Refer slide time :( 18:36)
Conclusion in this lecture: We have given an overview of the parallel k-means algorithm as an
implementation of Map Reduce and which is one of the very important unsupervised machine learning for
different big data analytics application. Thank you.

More Related Content

What's hot

Scale free network Visualiuzation
Scale free network VisualiuzationScale free network Visualiuzation
Scale free network VisualiuzationHarshit Srivastava
 
IRJET- Clustering the Real Time Moving Object Adjacent Tracking
IRJET-  	  Clustering the Real Time Moving Object Adjacent TrackingIRJET-  	  Clustering the Real Time Moving Object Adjacent Tracking
IRJET- Clustering the Real Time Moving Object Adjacent TrackingIRJET Journal
 
Simulation of Scale-Free Networks
Simulation of Scale-Free NetworksSimulation of Scale-Free Networks
Simulation of Scale-Free NetworksGabriele D'Angelo
 
Paper id 26201483
Paper id 26201483Paper id 26201483
Paper id 26201483IJRAT
 
Image classification using neural network
Image classification using neural networkImage classification using neural network
Image classification using neural networkBhavyateja Potineni
 
Iterative Closest Point Algorithm - analysis and implementation
Iterative Closest Point Algorithm - analysis and implementationIterative Closest Point Algorithm - analysis and implementation
Iterative Closest Point Algorithm - analysis and implementationPankaj Gautam
 
Cloud Partitioning of Load Balancing Using Round Robin Model
Cloud Partitioning of Load Balancing Using Round Robin ModelCloud Partitioning of Load Balancing Using Round Robin Model
Cloud Partitioning of Load Balancing Using Round Robin ModelIJCERT
 
Communication and networking in Multi
Communication and networking in Multi Communication and networking in Multi
Communication and networking in Multi Utkarsh Ahuja
 
Graph and Density Based Clustering
Graph and Density Based ClusteringGraph and Density Based Clustering
Graph and Density Based ClusteringAyushAnand105
 
Deep learning algorithms
Deep learning algorithmsDeep learning algorithms
Deep learning algorithmsRevanth Kumar
 
APPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGAPPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGRevanth Kumar
 
Concurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed SimulationsConcurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed SimulationsGabriele D'Angelo
 
Deep neural networks & computational graphs
Deep neural networks & computational graphsDeep neural networks & computational graphs
Deep neural networks & computational graphsRevanth Kumar
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmHadi Fadlallah
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Simplilearn
 
LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014multimediaeval
 
Dce a novel delay correlation
Dce a novel delay correlationDce a novel delay correlation
Dce a novel delay correlationijdpsjournal
 

What's hot (20)

Scale free network Visualiuzation
Scale free network VisualiuzationScale free network Visualiuzation
Scale free network Visualiuzation
 
IRJET- Clustering the Real Time Moving Object Adjacent Tracking
IRJET-  	  Clustering the Real Time Moving Object Adjacent TrackingIRJET-  	  Clustering the Real Time Moving Object Adjacent Tracking
IRJET- Clustering the Real Time Moving Object Adjacent Tracking
 
Simulation of Scale-Free Networks
Simulation of Scale-Free NetworksSimulation of Scale-Free Networks
Simulation of Scale-Free Networks
 
Paper id 26201483
Paper id 26201483Paper id 26201483
Paper id 26201483
 
Image classification using neural network
Image classification using neural networkImage classification using neural network
Image classification using neural network
 
Shaders project
Shaders projectShaders project
Shaders project
 
Iterative Closest Point Algorithm - analysis and implementation
Iterative Closest Point Algorithm - analysis and implementationIterative Closest Point Algorithm - analysis and implementation
Iterative Closest Point Algorithm - analysis and implementation
 
Cloud Partitioning of Load Balancing Using Round Robin Model
Cloud Partitioning of Load Balancing Using Round Robin ModelCloud Partitioning of Load Balancing Using Round Robin Model
Cloud Partitioning of Load Balancing Using Round Robin Model
 
Disjoint set
Disjoint setDisjoint set
Disjoint set
 
Communication and networking in Multi
Communication and networking in Multi Communication and networking in Multi
Communication and networking in Multi
 
Graph and Density Based Clustering
Graph and Density Based ClusteringGraph and Density Based Clustering
Graph and Density Based Clustering
 
Deep learning algorithms
Deep learning algorithmsDeep learning algorithms
Deep learning algorithms
 
07. disjoint set
07. disjoint set07. disjoint set
07. disjoint set
 
APPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGAPPLIED MACHINE LEARNING
APPLIED MACHINE LEARNING
 
Concurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed SimulationsConcurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed Simulations
 
Deep neural networks & computational graphs
Deep neural networks & computational graphsDeep neural networks & computational graphs
Deep neural networks & computational graphs
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
 
LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014
 
Dce a novel delay correlation
Dce a novel delay correlationDce a novel delay correlation
Dce a novel delay correlation
 

Similar to lec15_ref.pdf

3D Reconstruction from Multiple uncalibrated 2D Images of an Object
3D Reconstruction from Multiple uncalibrated 2D Images of an Object3D Reconstruction from Multiple uncalibrated 2D Images of an Object
3D Reconstruction from Multiple uncalibrated 2D Images of an ObjectAnkur Tyagi
 
Spme 2013 segmentation
Spme 2013 segmentationSpme 2013 segmentation
Spme 2013 segmentationQujiang Lei
 
Panoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFPanoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFEric Jansen
 
Practical Digital Image Processing 5
Practical Digital Image Processing 5Practical Digital Image Processing 5
Practical Digital Image Processing 5Aly Abdelkareem
 
Exploring Algorithms
Exploring AlgorithmsExploring Algorithms
Exploring AlgorithmsSri Prasanna
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchEshanAgarwal4
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpankit_ppt
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegressionDaniel K
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 

Similar to lec15_ref.pdf (20)

lec5_ref.pdf
lec5_ref.pdflec5_ref.pdf
lec5_ref.pdf
 
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
3D Reconstruction from Multiple uncalibrated 2D Images of an Object3D Reconstruction from Multiple uncalibrated 2D Images of an Object
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
 
Spme 2013 segmentation
Spme 2013 segmentationSpme 2013 segmentation
Spme 2013 segmentation
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
lec14_ref.pdf
lec14_ref.pdflec14_ref.pdf
lec14_ref.pdf
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Panoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFPanoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURF
 
Practical Digital Image Processing 5
Practical Digital Image Processing 5Practical Digital Image Processing 5
Practical Digital Image Processing 5
 
lec7_ref.pdf
lec7_ref.pdflec7_ref.pdf
lec7_ref.pdf
 
Exploring Algorithms
Exploring AlgorithmsExploring Algorithms
Exploring Algorithms
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
 
lec6_ref.pdf
lec6_ref.pdflec6_ref.pdf
lec6_ref.pdf
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegression
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 

More from vishal choudhary (20)

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
 
XML.pptx
XML.pptxXML.pptx
XML.pptx
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
 
SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 

lec15_ref.pdf

  • 1. Lecture - 26 Parallel K-means using Map Reduce on Big Data Cluster Analysis
  • 2. Parallel k-means using MapReduce on big data cluster analysis we have seen in this particular section you will see the parallel k-means algorithm that means we will use the MapReduce. Refer slide time :( 0:36) We will apply, the MapReduce apply the matter MapReduce on k-means algorithm. Now, important thing here is that this particular method of applying the MapReduce on k-means will make this data parallel algorithm. Now, after doing this data parallel k-means algorithm using MapReduce now we can use this particular algorithm for the analysis of a big data. So, we can use this particular algorithm that is called ‘ Parallel k-means’ algorithm using MapReduce with the big data. And, we can extract the insights of big data. So, let us see the more details of how MapReduce can be applied on the k-means to achieve this parallel k-means using MapReduce which is used for the big data analysis. Refer slide time :( 02:19)
  • 3. Now, this particular iterations of the k-means which is the step number one which we have seen in the previous slide that is to classify that means all the we have to assign the the data set the data points in the data set to the closest centroid or to the closest cluster Center which is represented by this particular equation let us understand this particular equation. Now, if we are given a set of centroids then for a particular data item Xi we are calculating the similarity measure and we have to identify that particular centroid which is having that minimum distance. So, this particular arm in this particular formula will identify that particular cluster which is having that closest closest or which is very similar or which is having the least distance or a minimum distance with that data set compared to the other plus other centroids. Hence, this particular j is the label which will be identified for a particular Xi. Now, we are now going to get Zi as the label or as a centroid for this data point Xi. Now, the next step in the k-means algorithm which we have seen in the previous slides is called ‘Recenter’ that means after identifying the clusters that means after classifying the data set points into the different clusters. Now, for every cluster we are going to calculate the mean of that particular cluster. So, for every cluster we are going to compute the mean and that will become the new centroid for every cluster this is called ‘Recenter’. And, these two step is going to iterate. So, this these two step will form the first iteration of the MapReduce k-means algorithm. Now, let us use the map function and reduce function which can be applied now on k-means algorithm. Now, the map function would be here for each data point we have to we will identify this particular centroid which is closest to this a point Xi and let us assume that Zi is the closest centroid which is being identified for a data point Xi and the map function will perform this operation in a data parallel operation as a data parallel operation. And, it will emit this key value pair. So, key will be key will be the the centroid or you can also call it as a label and the value will be that data point data point is in the data set or the data sample. Now, as far as the step number 2 of the first iteration which is called ‘Computing’ the recenter which is nothing but identifying the mean of that particular cluster mean of the cluster points. So, mean of a cluster point will become the new centroid which is nothing but we call it as recenter. Now, as far as the reduced function for this operation will be performed by taking this key value pair from the
  • 4. previous MapReduce and for that particular that means it will group by that particular centroid all data points and perform the average. So, average of all data points for that but your cluster which is Zi. Refer slide time :( 07:24) Now, let us see in more details how this map and reduce function is being implemented. So, the first step is called classification or a classifying classify the data points data points in the data set to the closest centroid in a data parallel operation. So, in this particular equation you can see that if you are given the set of centroids. So, for a given data point it will calculate the distance measure the differences and the minimum of that particular difference will be that data will be that centroid which is very closest to the data point compared to all other one and that will become as the label or the centroid which is closest to the data point. So, this way if we perform this activity for all data sets in data parallel operation then we can specify using my function. So, map function will have the input as the set of all data all all centroids. So, this is the set of all centroids ( μ1 , μ2 ) these are all let us seek a different centroids are given in the map function and Xi is the data point. Now, this Xi will be computing the the similarity measure or that the distance measure with all the centroids and the minimum distance centroid let us assume it is Z which will be identified which will be assigned to Zi and the map function will omit that Zi and this Xi. So, the key which this map function will emit is the centroid ID which is the closest to this data point and the value will be that data point. Refer slide time :( 10:18)
  • 5. Now, let us see the reduce function which is nothing but computing the recenter. Now, after doing the mean or it will identify the mean of all the data points within a particular data within a particular cluster and after doing the normalization summation and the normalization it will identify the new centroid. So, for all the cluster ID all the clusters there will be are center or computing the the summation of all the data points and normalizing it that is nothing but the calculating the mean for all things. Let, us see how it will be done performed using a reduced function. Now, as we have seen from the map function that there will be key value pair which will be emit from the map. So, map will emit the key value pair. So, key will be that centroidID let us say this is the centroid and this value will be the set of all set of data points which are assigned to the centroid J. So, what it will do is it will compute it will find out the total it will find out the summation of all data points X within it within the clusters and then it will do the normalization. So, here for that there will be a for loop. So, it will iterate through all the all the data points in that particular set of that cluster and it will do this particular summation that means it will do the summation of it this particular equation it will operate in this way and then it will also do a counting and finally the count value will contain the total number of points which are doing which are which are being summed up for normalization. So, now it will do this mean or the normalization normalized aggregation is being done normalized aggregation or it will be computed the mean for every cluster. And, it will emit this is the key key means it will emit the centroid and value will be the normalized value which is a new centroid. So, after calculating the new centroid now again it will go to the next iteration and this iteration will repeat. So, we have seen that how the map and reduce has been implementing this k-means algorithm. Refer slide time :( 13:46)
  • 6. Let, us see the same thing in the form of a picture. So, you can see that this is the data set and this particular map function will be performed on different data splits. So, map function will be applied on a different data splits and this will map will emit the key value pair. So, key will be the centroid and for all the values of this data points it will emit reduce function what it will do it will now group by key. And, it will collect all such data points which belong to a particular cluster. And, it will form the normalized aggregation that is nothing but it will calculate the mean of it. So, for that particular centroid J it will compute the mean that is nothing but the new centroid and this particular way this k-means algorithm will iterate to the next step until it converges. Refer slide time :( 15:06)
  • 7. So, this particular new data centers centers once it is being identified then again it will go and classify all the data set accordingly. And, so on this particular iterations keeps on iterating unless there is a stopping criteria and once the stopping criteria is met then these iterations will stop and the algorithm. Refer slide time :( 15:36)
  • 8. will terminate in summary we have seen the parallel k-means algorithm which is nothing but an implementation of MapReduce for k-means algorithm. So, we have also seen that the two steps that is the using map and reduce. We, have implemented we have seen the implementation using map function for the classification step wherein the data parallel operation is performed or different data points to classify the data points into the key value pair. Now, second operation which is called a ‘Reduce Operation’ which is nothing but to come to recomputed the means are the new centroids the new centroids are computed based on the mean of data points in the clusters which is identified in the step number one. So, after identifying it then it will go to the next iteration and so on. So, these iteration these two operations that is the classification and recomputation we have seen how using map and reduce is being implemented. Refer slide time :( 17:05)
  • 9. Now, we see that this k-means algorithm is an iterative version is an iterative algorithm and which requires an iterative version of a MapReduce. And, in the the previous one chance of Hadoop and MapReduce version 1 we have we have there were several new implementation of iterative MapReduce. Now, with the new version of Hadoop which internally supports the iterative MapReduce in is spark this particular iteration is not going to be a problematic and is going to be very straight forward that we will see in further lectures more detail that this is not that we can use it iterative version of MapReduce using some other means such as is spark. Now, another practical consideration is about the mapper which needs to get the data points and all the data and all the centers lot of data is going to be moved in the old or in the previous version of MapReduce but now in a newer implementation like in spark and all. So, in memory configuration is being used to store the data and over the iterations all these are implemented and handled for better optimization and efficient implementation. Refer slide time :( 18:36)
  • 10. Conclusion in this lecture: We have given an overview of the parallel k-means algorithm as an implementation of Map Reduce and which is one of the very important unsupervised machine learning for different big data analytics application. Thank you.