SlideShare a Scribd company logo
1 of 32
K-Means Clustering
Scale! 
• Scale to large datasets 
– Hadoop MapReduce implementations that scales linearly 
with data. 
– Fast sequential algorithms whose runtime doesn’t depend 
on the size of the data 
– Goal: To be as fast as possible for any algorithm 
• Scalable to support your business case 
– Apache Software License 2 
• Scalable community 
– Vibrant, responsive and diverse 
– Come to the mailing list and find out more
History of Mahout 
• Summer 2007 
– Developers needed scalable ML 
– Mailing list formed 
• Community formed 
– Apache contributors 
– Academia & industry 
– Lots of initial interest 
• Project formed under Apache Lucene 
– January 25, 2008
Where is ML Used Today 
• Internet search clustering 
• Knowledge management systems 
• Social network mapping 
• Taxonomy transformations 
• Marketing analytics 
• Recommendation systems 
• Log analysis & event filtering 
• SPAM filtering, fraud detection
Clustering 
• Call it fuzzy grouping based on a notion of similarity
Mahout Clustering 
• Plenty of Algorithms: K-Means, 
Fuzzy K-Means, Mean Shift, 
Canopy, Dirichlet 
• Group similar looking objects 
• Notion of similarity: Distance measure: 
– Euclidean 
– Cosine 
– Tanimoto 
– Manhattan
Classification 
• Predicting the type of a new object based on its features 
• The types are predetermined 
Dog Cat
Mahout Classification 
• Plenty of algorithms 
– Naïve Bayes 
– Complementary Naïve Bayes 
– Random Forests 
– Logistic Regression (SGD) 
– Support Vector Machines (patch ready) 
• Learn a model from a manually classified data 
• Predict the class of a new object based on its 
features and the learned model
Understanding data - Vectors 
X = 5 , Y = 3 
(5, 3) 
Y 
X 
• The vector denoted by point (5, 3) is simply 
Array([5, 3]) or HashMap([0 => 5], [1 => 3])
Representing Vectors – The basics 
• Now think 3, 4, 5, ….. n-dimensional 
• Think of a document as a bag of words. 
“she sells sea shells on the sea shore” 
• Now map them to integers 
she => 0 
sells => 1 
sea => 2 
and so on 
• The resulting vector [1.0, 1.0, 2.0, … ]
Vectors 
• Imagine one dimension for each word. 
• Each dimension is also called a feature 
• Two techniques 
– Dictionary Based 
– Randomizer Based
K-Means clustering 
• K-means clustering is a classical clustering algorithm that uses an 
expectation maximization like technique to partition a number of data points 
into k clusters. 
• K-means clustering is commonly used for a number of classification 
applications. 
• It is often run on extremely large data sets, on the order of hundreds of 
millions of points and tens of gigabytes of data. 
• Because k-means is run on such large data sets, and because of certain 
characteristics of the algorithm, it is a good candidate for parallelization.
K-Means clustering 
• At its simplest, the algorithm is given as inputs a set of n d-dimensional points 
and a number of desired clusters k. 
• For the purposes of this explanation, we will consider points in a Euclidean 
space. 
• However, the clustering algorithm will work in any space provided a distance 
metric is given as input as well. 
• Initially, k points are chosen as cluster centers. 
• There is no fixed way to determine these initial points, instead a number of 
heuristics are used, most commonly, k-points are chosen at random from the 
sample of n points. 
• Once the k initial centers are chosen, the distance is calculated from every point 
in the set to each of the k centers and each point is assigned to the particular 
cluster center whose distance is closest. 
• Using this assignation of points to cluster centers, each cluster center is 
recalculated as the centroid of its member points. 
• This process is then iterated until convergence is reached. 
• That is, points are reassigned to centers, and centroids recalculated until the k 
cluster centers shift by less than some delta value. 
• Because of the way k-means works it has the weakness that it can converge to a 
local optimum but not a global optimum. 
• This happens when the initial k center points are chosen badly.
Pseudocode 
iterate { 
Compute distance from all points to all k-centers 
Assign each point to the nearest k-center 
Compute the average of all points assigned to all specific k-centers 
Replace the k-centers with the new averages 
}
c1 
c2 
c3 
K-Means clustering
c1 
c2 
c3 
K-Means clustering
c1 
c2 
c3 
c1 
c2 
c3 
K-Means clustering
c1 
c2 
c3 
K-Means clustering
Parallelizing k-means 
• In order to parallelize k-means, we want to come up with a scheme where 
we can operate on each point in the data set independently. 
• In the first step of the iterative process of k-means, it is necessary to 
compute the distance from each point to each of the k cluster centers and 
assign that point to the cluster with the minimum distance. 
• Thus, there is a small amount of shared data – namely the cluster centers. 
• However, this is small in comparison to the number of data points. 
• So the parallelization scheme involves duplicating the cluster centers, 
however once this is duplicated each data point can be operated on 
independently of the others and we can gain a nice speedup.
Step 1 – Convert dataset into a Hadoop Sequence File 
• SGML files 
– $ mkdir –p /reuters $ cd reuters-sgm && tar xzf 
../reuters21578.tar.gz && cd .. && cd .. 
• Extract content from SGML to text file 
– $ mahout org.apache.lucene.benchmark.utils.ExtractReuters 
reuters reuters/out
Step 1 – Convert dataset into a Hadoop Sequence File 
• Use seqdirectory tool to convert text file into a 
Hadoop Sequence File 
– $ mahout seqdirectory  
-i reuters/out  
-o reuters/seqdir  
-c UTF-8 
-chunk 5
Hadoop Sequence File 
• Sequence of Records, where each record is a <Key, Value> pair 
– <Key1, Value1> 
– <Key2, Value2> 
– … 
– … 
– … 
– <Keyn, Valuen> 
• Key and Value needs to be of class org.apache.hadoop.io.Text 
– Key = Record name or File name or unique identifier 
– Value = Content as UTF-8 encoded string 
• TIP: Dump data from your database directly into Hadoop Sequence Files
Writing to Sequence Files 
Configuration conf = new Configuration(); 
FileSystem fs = FileSystem.get(conf); 
Path path = new Path("testdata/part-00000"); 
SequenceFile.Writer writer = new SequenceFile.Writer( 
fs, conf, path, Text.class, Text.class); 
for (int i = 0; i < MAX_DOCS; i++) 
writer.append(new Text(documents(i).Id()), 
new Text(documents(i).Content())); 
} 
writer.close();
Generate Vectors from Sequence Files 
• Steps 
1. Compute Dictionary 
2. Assign integers for words 
3. Compute feature weights 
4. Create vector for each document using word-integer mapping 
and feature-weight 
Or 
• Simply run $ mahout seq2sparse
Generate Vectors from Sequence Files 
• $ mahout seq2sparse  
-i reuter/seqdir/  
-o reuters/sparse 
• Important options 
– Ngrams 
– Lucene Analyzer for tokenizing 
– Feature Pruning 
• Min support 
• Max Document Frequency 
• Min LLR (for ngrams) 
– Weighting Method 
• TF v/s TFIDF 
• lp-Norm 
• Log normalize length
Start K-Means clustering 
• $ mahout kmeans  
-i reuters/sparse/tfidf/  
-c reuters-kmeans-clusters  
-o reuters-kmeans  
-dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1  
-x 10 -k 20 –ow 
• Things to watch out for 
– Number of iterations 
– Convergence delta 
– Distance Measure 
– Creating assignments
Inspect clusters 
• $ bin/mahout clusterdump  
-s reuters-kmeans/clusters-9  
-d reuters-out-seqdir-sparse-kmeans/dictionary.file-0  
-dt sequencefile -b 100 -n 20 
Typical output 
:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, … 
Top Terms: 
iran => 3.1861672217321213 
strike => 2.567886952727918 
iranian => 2.133417966282966 
union => 2.116033937940266 
said => 2.101773806290277 
workers => 2.066259451354332 
gulf => 1.9501374918521601 
had => 1.6077752463145605 
he => 1.5355078004962228
FAQs 
• How to get rid of useless words 
• How to see documents to cluster assignments 
• How to choose appropriate weighting 
• How to run this on a cluster 
• How to scale 
• How to choose k 
• How to improve similarity measurement
FAQs 
• How to get rid of useless words 
– Use StopwordsAnalyzer 
• How to see documents to cluster assignments 
– Run clustering process at the end of centroid generation using –cl 
• How to choose appropriate weighting 
– If its long text, go with tfidf. Use normalization if documents different in 
length 
• How to run this on a cluster 
– Set HADOOP_CONF directory to point to your hadoop cluster conf 
directory 
• How to scale 
– Use small value of k to partially cluster data and then do full clustering 
on each cluster.
FAQs 
• How to choose k 
– Figure out based on the data you have. Trial and error 
– Or use Canopy Clustering and distance threshold to figure it 
out 
– Or use Spectral clustering 
• How to improve Similarity Measurement 
– Not all features are equal 
– Small weight difference for certain types creates a large 
semantic difference 
– Use WeightedDistanceMeasure 
– Or write a custom DistanceMeasure
More clustering algorithms 
• Canopy 
• Fuzzy K-Means 
• Mean Shift 
• Dirichlet process clustering 
• Spectral clustering.
End of session 
Day – 4: K-Means Clustering

More Related Content

What's hot

New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithmVinit Dantkale
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERINGsingh7599
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering108kaushik
 
"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love BucharestAdrian Florea
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering TypesAshwin Shenoy M
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 

What's hot (20)

New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Lect4
Lect4Lect4
Lect4
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Clustering
ClusteringClustering
Clustering
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest
 
K means clustring @jax
K means clustring @jaxK means clustring @jax
K means clustring @jax
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 

Viewers also liked

Viewers also liked (8)

phase rule & phase diagram
phase rule & phase diagramphase rule & phase diagram
phase rule & phase diagram
 
MOLECULAR DOCKING
MOLECULAR DOCKINGMOLECULAR DOCKING
MOLECULAR DOCKING
 
Phase rule
Phase rulePhase rule
Phase rule
 
The phase rule
The phase ruleThe phase rule
The phase rule
 
Coacervation Phase Separation Techniques
Coacervation Phase Separation TechniquesCoacervation Phase Separation Techniques
Coacervation Phase Separation Techniques
 
Phase Diagrams and Phase Rule
Phase Diagrams and Phase RulePhase Diagrams and Phase Rule
Phase Diagrams and Phase Rule
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 

Similar to K-Means Clustering Algorithms in Mahout

Cassandra
CassandraCassandra
Cassandraexsuns
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise Group
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptxPriyadharshiniG41
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.pptDanBarcan2
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Data minig.pptx
Data minig.pptxData minig.pptx
Data minig.pptxSabthamiS1
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptxssusere1fd42
 

Similar to K-Means Clustering Algorithms in Mahout (20)

Cassandra
CassandraCassandra
Cassandra
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Birch1
Birch1Birch1
Birch1
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Data minig.pptx
Data minig.pptxData minig.pptx
Data minig.pptx
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 

Recently uploaded

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 

Recently uploaded (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 

K-Means Clustering Algorithms in Mahout

  • 2. Scale! • Scale to large datasets – Hadoop MapReduce implementations that scales linearly with data. – Fast sequential algorithms whose runtime doesn’t depend on the size of the data – Goal: To be as fast as possible for any algorithm • Scalable to support your business case – Apache Software License 2 • Scalable community – Vibrant, responsive and diverse – Come to the mailing list and find out more
  • 3. History of Mahout • Summer 2007 – Developers needed scalable ML – Mailing list formed • Community formed – Apache contributors – Academia & industry – Lots of initial interest • Project formed under Apache Lucene – January 25, 2008
  • 4. Where is ML Used Today • Internet search clustering • Knowledge management systems • Social network mapping • Taxonomy transformations • Marketing analytics • Recommendation systems • Log analysis & event filtering • SPAM filtering, fraud detection
  • 5. Clustering • Call it fuzzy grouping based on a notion of similarity
  • 6. Mahout Clustering • Plenty of Algorithms: K-Means, Fuzzy K-Means, Mean Shift, Canopy, Dirichlet • Group similar looking objects • Notion of similarity: Distance measure: – Euclidean – Cosine – Tanimoto – Manhattan
  • 7. Classification • Predicting the type of a new object based on its features • The types are predetermined Dog Cat
  • 8. Mahout Classification • Plenty of algorithms – Naïve Bayes – Complementary Naïve Bayes – Random Forests – Logistic Regression (SGD) – Support Vector Machines (patch ready) • Learn a model from a manually classified data • Predict the class of a new object based on its features and the learned model
  • 9. Understanding data - Vectors X = 5 , Y = 3 (5, 3) Y X • The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])
  • 10. Representing Vectors – The basics • Now think 3, 4, 5, ….. n-dimensional • Think of a document as a bag of words. “she sells sea shells on the sea shore” • Now map them to integers she => 0 sells => 1 sea => 2 and so on • The resulting vector [1.0, 1.0, 2.0, … ]
  • 11. Vectors • Imagine one dimension for each word. • Each dimension is also called a feature • Two techniques – Dictionary Based – Randomizer Based
  • 12. K-Means clustering • K-means clustering is a classical clustering algorithm that uses an expectation maximization like technique to partition a number of data points into k clusters. • K-means clustering is commonly used for a number of classification applications. • It is often run on extremely large data sets, on the order of hundreds of millions of points and tens of gigabytes of data. • Because k-means is run on such large data sets, and because of certain characteristics of the algorithm, it is a good candidate for parallelization.
  • 13. K-Means clustering • At its simplest, the algorithm is given as inputs a set of n d-dimensional points and a number of desired clusters k. • For the purposes of this explanation, we will consider points in a Euclidean space. • However, the clustering algorithm will work in any space provided a distance metric is given as input as well. • Initially, k points are chosen as cluster centers. • There is no fixed way to determine these initial points, instead a number of heuristics are used, most commonly, k-points are chosen at random from the sample of n points. • Once the k initial centers are chosen, the distance is calculated from every point in the set to each of the k centers and each point is assigned to the particular cluster center whose distance is closest. • Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points. • This process is then iterated until convergence is reached. • That is, points are reassigned to centers, and centroids recalculated until the k cluster centers shift by less than some delta value. • Because of the way k-means works it has the weakness that it can converge to a local optimum but not a global optimum. • This happens when the initial k center points are chosen badly.
  • 14. Pseudocode iterate { Compute distance from all points to all k-centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages }
  • 15. c1 c2 c3 K-Means clustering
  • 16. c1 c2 c3 K-Means clustering
  • 17. c1 c2 c3 c1 c2 c3 K-Means clustering
  • 18. c1 c2 c3 K-Means clustering
  • 19. Parallelizing k-means • In order to parallelize k-means, we want to come up with a scheme where we can operate on each point in the data set independently. • In the first step of the iterative process of k-means, it is necessary to compute the distance from each point to each of the k cluster centers and assign that point to the cluster with the minimum distance. • Thus, there is a small amount of shared data – namely the cluster centers. • However, this is small in comparison to the number of data points. • So the parallelization scheme involves duplicating the cluster centers, however once this is duplicated each data point can be operated on independently of the others and we can gain a nice speedup.
  • 20. Step 1 – Convert dataset into a Hadoop Sequence File • SGML files – $ mkdir –p /reuters $ cd reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd .. • Extract content from SGML to text file – $ mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters reuters/out
  • 21. Step 1 – Convert dataset into a Hadoop Sequence File • Use seqdirectory tool to convert text file into a Hadoop Sequence File – $ mahout seqdirectory -i reuters/out -o reuters/seqdir -c UTF-8 -chunk 5
  • 22. Hadoop Sequence File • Sequence of Records, where each record is a <Key, Value> pair – <Key1, Value1> – <Key2, Value2> – … – … – … – <Keyn, Valuen> • Key and Value needs to be of class org.apache.hadoop.io.Text – Key = Record name or File name or unique identifier – Value = Content as UTF-8 encoded string • TIP: Dump data from your database directly into Hadoop Sequence Files
  • 23. Writing to Sequence Files Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer( fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; i++) writer.append(new Text(documents(i).Id()), new Text(documents(i).Content())); } writer.close();
  • 24. Generate Vectors from Sequence Files • Steps 1. Compute Dictionary 2. Assign integers for words 3. Compute feature weights 4. Create vector for each document using word-integer mapping and feature-weight Or • Simply run $ mahout seq2sparse
  • 25. Generate Vectors from Sequence Files • $ mahout seq2sparse -i reuter/seqdir/ -o reuters/sparse • Important options – Ngrams – Lucene Analyzer for tokenizing – Feature Pruning • Min support • Max Document Frequency • Min LLR (for ngrams) – Weighting Method • TF v/s TFIDF • lp-Norm • Log normalize length
  • 26. Start K-Means clustering • $ mahout kmeans -i reuters/sparse/tfidf/ -c reuters-kmeans-clusters -o reuters-kmeans -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 -x 10 -k 20 –ow • Things to watch out for – Number of iterations – Convergence delta – Distance Measure – Creating assignments
  • 27. Inspect clusters • $ bin/mahout clusterdump -s reuters-kmeans/clusters-9 -d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 Typical output :VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, … Top Terms: iran => 3.1861672217321213 strike => 2.567886952727918 iranian => 2.133417966282966 union => 2.116033937940266 said => 2.101773806290277 workers => 2.066259451354332 gulf => 1.9501374918521601 had => 1.6077752463145605 he => 1.5355078004962228
  • 28. FAQs • How to get rid of useless words • How to see documents to cluster assignments • How to choose appropriate weighting • How to run this on a cluster • How to scale • How to choose k • How to improve similarity measurement
  • 29. FAQs • How to get rid of useless words – Use StopwordsAnalyzer • How to see documents to cluster assignments – Run clustering process at the end of centroid generation using –cl • How to choose appropriate weighting – If its long text, go with tfidf. Use normalization if documents different in length • How to run this on a cluster – Set HADOOP_CONF directory to point to your hadoop cluster conf directory • How to scale – Use small value of k to partially cluster data and then do full clustering on each cluster.
  • 30. FAQs • How to choose k – Figure out based on the data you have. Trial and error – Or use Canopy Clustering and distance threshold to figure it out – Or use Spectral clustering • How to improve Similarity Measurement – Not all features are equal – Small weight difference for certain types creates a large semantic difference – Use WeightedDistanceMeasure – Or write a custom DistanceMeasure
  • 31. More clustering algorithms • Canopy • Fuzzy K-Means • Mean Shift • Dirichlet process clustering • Spectral clustering.
  • 32. End of session Day – 4: K-Means Clustering

Editor's Notes

  1. 9.1 top left
  2. 9.1 top right
  3. 9.1 bottom left
  4. 9.1 bottom right