• Like
K-means under Mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

K-means under Mahout

  • 250 views
Published

Document clustering example using K-means under Mahout.

Document clustering example using K-means under Mahout.

Published in Data & Analytics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
250
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
11
Comments
2
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In your home directory, edit the file: .bash_profile (for Mac user) or .bashrc (for linux user) and add the following lines. $vi .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HOME=/Users/DoubleJ/Software/hadoop-1.0.3 (replace with your hadoop directory) export HADOOP_CONF_DIR=/Users/DoubleJ/Software/hadoop-1.0.3/conf Then save and exit the file. $source .bash_profile 2. Download data $curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ~/Software/mahout-work/reuters.tar.gz 3. Uncompress data $mkdir ~/Software/mahout-work/sgm $tar xzf ~/Software/mahout-work/reuters.tar.gz -C ~/Software/mahout-work/sgm 4. Obtain the segment data $./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters ~/Software/mahout-work/sgm ~/Software/mahout-work/text 5. Generate the sequence directory, which contains all texts 5.1 Send the segment data to the Hadoop file system $cd $HADOOP_HOME $./bin/start-all.sh $./bin/hadoop fs –mkdir hadoop-mahout
  • 2. $./bin/hadoop fs –put ~/Software/mahout-work/out hadoop-mahout (takes for a while) $./bin/hadoop fs –ls hadoop-mahout/out 5.2 Create the sequence directory $cd ~/Software/mahout-distribution-0.7 $./bin/mahout seqdirectory -i hadoop-mahout/out -o hadoop-mahout/out-seqdir -c UTF-8 - chunk 5 $./bin/hadoop fs –ls hadoop-mahout $./bin/hadoop fs –ls hadoop-mahout/out-seqdir
  • 3. 6. Create vector files to represent these texts $./bin/mahout seq2sparse -i hadoop-mahout/out-seqdir/ -o hadoop-mahout/vectors -- maxDFPercent 85 --namedVector (takes for a while) $cd $HADOOP_HOME $./bin/hadoop fs –ls hadoop-mahout $./bin/hadoop fs –ls hadoop-mahout/vectors In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen. 7. Running K-means $./bin/mahout kmeans -i hadoop-mahout/vectors/tfidf-vectors -c hadoop-mahout/cluster- centroids -o hadoop-mahout/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering –cl $cd $HADOOP_HOME $./bin/hadoop fs –ls hadoop-mahout/ $./bin/hadoop fs –ls hadoop-mahout/cluster-centroids
  • 4. In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen. 8. View results in a human readable way. Dump the documents into a cluster mapping $./bin/mahout seqdumper -i hadoop-mahout/kmeans/clusteredPoints/part-m-00000 > ~/Software/mahout-work/cluster-docs.txt $ls -l ~/Software/mahout-work Then download the view.pl and run it using the following command.
  • 5. $perl view.pl ~/Software/mahout-work/cluster-docs.txt ~/Software/mahout-work/cluster- results.txt The cluster-results.txt is a text file where each line has two columns with a tab separated: clustered and its corresponding file name.