K-Means Clustering Algorithms in Mahout

Scale!
• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly
with data.
– Fast sequential algorithms whose runtime doesn’t depend
on the size of the data
– Goal: To be as fast as possible for any algorithm
• Scalable to support your business case
– Apache Software License 2
• Scalable community
– Vibrant, responsive and diverse
– Come to the mailing list and find out more

History of Mahout
• Summer 2007
– Developers needed scalable ML
– Mailing list formed
• Community formed
– Apache contributors
– Academia & industry
– Lots of initial interest
• Project formed under Apache Lucene
– January 25, 2008

Where is ML Used Today
• Internet search clustering
• Knowledge management systems
• Social network mapping
• Taxonomy transformations
• Marketing analytics
• Recommendation systems
• Log analysis & event filtering
• SPAM filtering, fraud detection

Clustering
• Call it fuzzy grouping based on a notion of similarity

Mahout Clustering
• Plenty of Algorithms: K-Means,
Fuzzy K-Means, Mean Shift,
Canopy, Dirichlet
• Group similar looking objects
• Notion of similarity: Distance measure:
– Euclidean
– Cosine
– Tanimoto
– Manhattan

Classification
• Predicting the type of a new object based on its features
• The types are predetermined
Dog Cat

Mahout Classification
• Plenty of algorithms
– Naïve Bayes
– Complementary Naïve Bayes
– Random Forests
– Logistic Regression (SGD)
– Support Vector Machines (patch ready)
• Learn a model from a manually classified data
• Predict the class of a new object based on its
features and the learned model

Understanding data - Vectors
X = 5 , Y = 3
(5, 3)
Y
X
• The vector denoted by point (5, 3) is simply
Array([5, 3]) or HashMap([0 => 5], [1 => 3])

Representing Vectors – The basics
• Now think 3, 4, 5, ….. n-dimensional
• Think of a document as a bag of words.
“she sells sea shells on the sea shore”
• Now map them to integers
she => 0
sells => 1
sea => 2
and so on
• The resulting vector [1.0, 1.0, 2.0, … ]

Vectors
• Imagine one dimension for each word.
• Each dimension is also called a feature
• Two techniques
– Dictionary Based
– Randomizer Based

K-Means clustering
• K-means clustering is a classical clustering algorithm that uses an
expectation maximization like technique to partition a number of data points
into k clusters.
• K-means clustering is commonly used for a number of classification
applications.
• It is often run on extremely large data sets, on the order of hundreds of
millions of points and tens of gigabytes of data.
• Because k-means is run on such large data sets, and because of certain
characteristics of the algorithm, it is a good candidate for parallelization.

K-Means clustering
• At its simplest, the algorithm is given as inputs a set of n d-dimensional points
and a number of desired clusters k.
• For the purposes of this explanation, we will consider points in a Euclidean
space.
• However, the clustering algorithm will work in any space provided a distance
metric is given as input as well.
• Initially, k points are chosen as cluster centers.
• There is no fixed way to determine these initial points, instead a number of
heuristics are used, most commonly, k-points are chosen at random from the
sample of n points.
• Once the k initial centers are chosen, the distance is calculated from every point
in the set to each of the k centers and each point is assigned to the particular
cluster center whose distance is closest.
• Using this assignation of points to cluster centers, each cluster center is
recalculated as the centroid of its member points.
• This process is then iterated until convergence is reached.
• That is, points are reassigned to centers, and centroids recalculated until the k
cluster centers shift by less than some delta value.
• Because of the way k-means works it has the weakness that it can converge to a
local optimum but not a global optimum.
• This happens when the initial k center points are chosen badly.

Pseudocode
iterate {
Compute distance from all points to all k-centers
Assign each point to the nearest k-center
Compute the average of all points assigned to all specific k-centers
Replace the k-centers with the new averages
}

c1
c2
c3
K-Means clustering

c1
c2
c3
c1
c2
c3
K-Means clustering

Parallelizing k-means
• In order to parallelize k-means, we want to come up with a scheme where
we can operate on each point in the data set independently.
• In the first step of the iterative process of k-means, it is necessary to
compute the distance from each point to each of the k cluster centers and
assign that point to the cluster with the minimum distance.
• Thus, there is a small amount of shared data – namely the cluster centers.
• However, this is small in comparison to the number of data points.
• So the parallelization scheme involves duplicating the cluster centers,
however once this is duplicated each data point can be operated on
independently of the others and we can gain a nice speedup.

Step 1 – Convert dataset into a Hadoop Sequence File
• SGML files
– $ mkdir –p /reuters $ cd reuters-sgm && tar xzf
../reuters21578.tar.gz && cd .. && cd ..
• Extract content from SGML to text file
– $ mahout org.apache.lucene.benchmark.utils.ExtractReuters
reuters reuters/out

Step 1 – Convert dataset into a Hadoop Sequence File
• Use seqdirectory tool to convert text file into a
Hadoop Sequence File
– $ mahout seqdirectory
-i reuters/out
-o reuters/seqdir
-c UTF-8
-chunk 5

Hadoop Sequence File
• Sequence of Records, where each record is a <Key, Value> pair
– <Key1, Value1>
– <Key2, Value2>
– …
– …
– …
– <Keyn, Valuen>
• Key and Value needs to be of class org.apache.hadoop.io.Text
– Key = Record name or File name or unique identifier
– Value = Content as UTF-8 encoded string
• TIP: Dump data from your database directly into Hadoop Sequence Files

Writing to Sequence Files
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path path = new Path("testdata/part-00000");
SequenceFile.Writer writer = new SequenceFile.Writer(
fs, conf, path, Text.class, Text.class);
for (int i = 0; i < MAX_DOCS; i++)
writer.append(new Text(documents(i).Id()),
new Text(documents(i).Content()));
}
writer.close();

Generate Vectors from Sequence Files
• Steps
1. Compute Dictionary
2. Assign integers for words
3. Compute feature weights
4. Create vector for each document using word-integer mapping
and feature-weight
Or
• Simply run $ mahout seq2sparse

Generate Vectors from Sequence Files
• $ mahout seq2sparse
-i reuter/seqdir/
-o reuters/sparse
• Important options
– Ngrams
– Lucene Analyzer for tokenizing
– Feature Pruning
• Min support
• Max Document Frequency
• Min LLR (for ngrams)
– Weighting Method
• TF v/s TFIDF
• lp-Norm
• Log normalize length

Start K-Means clustering
• $ mahout kmeans
-i reuters/sparse/tfidf/
-c reuters-kmeans-clusters
-o reuters-kmeans
-dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1
-x 10 -k 20 –ow
• Things to watch out for
– Number of iterations
– Convergence delta
– Distance Measure
– Creating assignments

Inspect clusters
• $ bin/mahout clusterdump
-s reuters-kmeans/clusters-9
-d reuters-out-seqdir-sparse-kmeans/dictionary.file-0
-dt sequencefile -b 100 -n 20
Typical output
:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …
Top Terms:
iran => 3.1861672217321213
strike => 2.567886952727918
iranian => 2.133417966282966
union => 2.116033937940266
said => 2.101773806290277
workers => 2.066259451354332
gulf => 1.9501374918521601
had => 1.6077752463145605
he => 1.5355078004962228

FAQs
• How to get rid of useless words
• How to see documents to cluster assignments
• How to choose appropriate weighting
• How to run this on a cluster
• How to scale
• How to choose k
• How to improve similarity measurement

FAQs
• How to get rid of useless words
– Use StopwordsAnalyzer
• How to see documents to cluster assignments
– Run clustering process at the end of centroid generation using –cl
• How to choose appropriate weighting
– If its long text, go with tfidf. Use normalization if documents different in
length
• How to run this on a cluster
– Set HADOOP_CONF directory to point to your hadoop cluster conf
directory
• How to scale
– Use small value of k to partially cluster data and then do full clustering
on each cluster.

FAQs
• How to choose k
– Figure out based on the data you have. Trial and error
– Or use Canopy Clustering and distance threshold to figure it
out
– Or use Spectral clustering
• How to improve Similarity Measurement
– Not all features are equal
– Small weight difference for certain types creates a large
semantic difference
– Use WeightedDistanceMeasure
– Or write a custom DistanceMeasure

More clustering algorithms
• Canopy
• Fuzzy K-Means
• Mean Shift
• Dirichlet process clustering
• Spectral clustering.

End of session
Day – 4: K-Means Clustering

K-Means Clustering Algorithms in Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to K-Means Clustering Algorithms in Mahout

Similar to K-Means Clustering Algorithms in Mahout (20)

More from Subhas Kumar Ghosh

More from Subhas Kumar Ghosh (20)

Recently uploaded

Recently uploaded (20)

K-Means Clustering Algorithms in Mahout

Editor's Notes