Your SlideShare is downloading. ×

Configuring Mahout Clustering Jobs - Frank Scholten

2,664

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 …

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,664
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
57
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Configuring Clustering Jobs Frank Scholten, DutchWorksfrank.scholten@dutchworks.nl, 19 october 2011
  • 2. My BackgroundFrank Scholten @Frank_ScholtenSoftware Developer atBlogger at user & contributor 2
  • 3. AgendaWhat is clustering?Intro toClustering 3
  • 4. Clustering introductionSo much How to get a nice data... overview? 4
  • 5. What is clustering?Grouping & summarizing dataUnsupervised machine learning“...the assignment of a set of observationsinto subsets so that observations in thesame clusters are similar in some sense...” Source: Wikipedia 5
  • 6. ApplicationsMarket segmentationSpecies identificationsMachine visionInformation retrieval & search...and many more! 6
  • 7. Example - Google news 7
  • 8. 2-D Clustering example Intra-cluster distance Inter-cluster distanceLegend Point Cluster Cluster Center 8
  • 9. K-Means algorithmSelect K random vectorsSpecify distance measure + thresholdEvery iteration● Add vector closest to cluster● Recompute center● Converged if no vectors within threshold 9
  • 10. Intro to 10
  • 11. Is this SPAM? ClassificationCollaborativeFiltering And much more! Clustering 11
  • 12. The ProjectApache project started in 2008Scalable machine learning, often with HadoopSteadily growing communityVersion 0.6 coming soon 12
  • 13. bin/mahout frank@frankthetank:~$ mahout    no HADOOP_HOME set, running locally    An example program must be given as the    first argument.    Valid program names are:         arff.vector: : Generate Vectors from an ARFF                    file or directory    canopy: : Canopy clustering    cat: : Print a file or resource as the           logistic regression models would see       it    ...   13
  • 14. Help frank@frankthetank:~$ mahout kmeans ­­help  usage: <command> [Generic Options]  [Job­specific Options]  Generic Options: ...  Job­specific Options:  ­­input (­i) input Path to job input directory.  ­­output (­o) output The directory pathname for output.    ... 14
  • 15. Java Drivers  String[] args = new String[] {    "­­input", input,    "­­output", output,    "­­clusters",  clusters,    "­­clustering",    "­­numClusters", “10”  };  ToolRunner.run(conf, new KmeansDriver(), args); 15
  • 16. Text clustering process [ 0.03, 0.95, 0.45, 0.34 ] [ 0.02, 0.98, 0.73, 0.55 ]Text files or Vectors Sequence filesLucene index K-means Clusters Find n-grams Dictionary (CL-1, [0.32, 0.6, .. ] (CL-1, [0.76, 0.1, .. ] (CL-1,23) (CL-2, [0.98, 0.2, .. ] Quick fox Dog (CL-1,37) (CL-2,45) Cluster labels Points 16
  • 17. Text clustering programs $ mahout seqdirectory Text files or Sequence files Lucene index [ 0.03, 0.95, .. ] $ mahout seq2sparse [ 0.29, 0.98, .. ] Sequence files Vectors[ 0.03, 0.95, .. ][ 0.29, 0.98, .. ] $ mahout kmeans Vectors Clusters 17
  • 18. Clustering 18
  • 19. ClusteringPublicly available monthly dumpsPosts ~ 5.5 GB ~ 1.4 M questions (April 2011)Lets use to extract a tag cloud! 19
  • 20. Clustering Cluster Vectorize Index [ 0,1,0,1,1,1,0,0,1,0,1 ] [ 0,1,1,1,1,0,0,0,0,0,1 ] Text Join content & clusters Java Git LucenePre-processXML & HTML Regular expressions Post ID Version control & Title 20
  • 21. ClusteringHow to implement?Pre-process XML XML & HTML parsingVectorize Custom AnalyzerClusterIndex 21
  • 22. [ 0,1,0,1,1,1,0,0,1,0,1 ] Vectorize [ 0,1,1,1,1,0,0,0,0,0,1 ]Many options and flags$ mahout seq2sparse              ­­input ..    ­­output ..          ­­analyzerClass .. ­­maxDFPercent .. ­­minDF .. 22
  • 23. ClusterRun one of the clustering algorithms!K-means, Fuzzy K-means, Canopy,Mean-shift, Min-hash, LDAAll with different pros and cons 23
  • 24. IndexCustom code to join data at index timeIndex clusters (cluster_id, cluster_name, size)Index posts (post_id, post_cluster_id, title) 24
  • 25. Demo time! 25
  • 26. ConclusionsClustering is fun!Vectorization & labeling improvementsTools for cluster evaluation? 26
  • 27. ReferencesMahout in Action – Just released! Sean Owen, Ted Dunning, Robin Anil, Ellen Friedman {user|dev}@mahout.apache.org http://jira.apache.org/MAHOUT http://www.searchworkings.org 27
  • 28. Q&A 28

×