Configuring Mahout Clustering Jobs - Frank Scholten
Upcoming SlideShare
Loading in...5
×
 

Configuring Mahout Clustering Jobs - Frank Scholten

on

  • 3,078 views

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 ...

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Statistics

Views

Total Views
3,078
Views on SlideShare
3,078
Embed Views
0

Actions

Likes
2
Downloads
52
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Configuring Mahout Clustering Jobs - Frank Scholten Configuring Mahout Clustering Jobs - Frank Scholten Presentation Transcript

  • Configuring Clustering Jobs Frank Scholten, DutchWorksfrank.scholten@dutchworks.nl, 19 october 2011
  • My BackgroundFrank Scholten @Frank_ScholtenSoftware Developer atBlogger at user & contributor 2
  • AgendaWhat is clustering?Intro toClustering 3
  • Clustering introductionSo much How to get a nice data... overview? 4
  • What is clustering?Grouping & summarizing dataUnsupervised machine learning“...the assignment of a set of observationsinto subsets so that observations in thesame clusters are similar in some sense...” Source: Wikipedia 5
  • ApplicationsMarket segmentationSpecies identificationsMachine visionInformation retrieval & search...and many more! 6
  • Example - Google news 7
  • 2-D Clustering example Intra-cluster distance Inter-cluster distanceLegend Point Cluster Cluster Center 8
  • K-Means algorithmSelect K random vectorsSpecify distance measure + thresholdEvery iteration● Add vector closest to cluster● Recompute center● Converged if no vectors within threshold 9
  • Intro to 10
  • Is this SPAM? ClassificationCollaborativeFiltering And much more! Clustering 11
  • The ProjectApache project started in 2008Scalable machine learning, often with HadoopSteadily growing communityVersion 0.6 coming soon 12
  • bin/mahout frank@frankthetank:~$ mahout    no HADOOP_HOME set, running locally    An example program must be given as the    first argument.    Valid program names are:         arff.vector: : Generate Vectors from an ARFF                    file or directory    canopy: : Canopy clustering    cat: : Print a file or resource as the           logistic regression models would see       it    ...   13
  • Help frank@frankthetank:~$ mahout kmeans ­­help  usage: <command> [Generic Options]  [Job­specific Options]  Generic Options: ...  Job­specific Options:  ­­input (­i) input Path to job input directory.  ­­output (­o) output The directory pathname for output.    ... 14
  • Java Drivers  String[] args = new String[] {    "­­input", input,    "­­output", output,    "­­clusters",  clusters,    "­­clustering",    "­­numClusters", “10”  };  ToolRunner.run(conf, new KmeansDriver(), args); 15
  • Text clustering process [ 0.03, 0.95, 0.45, 0.34 ] [ 0.02, 0.98, 0.73, 0.55 ]Text files or Vectors Sequence filesLucene index K-means Clusters Find n-grams Dictionary (CL-1, [0.32, 0.6, .. ] (CL-1, [0.76, 0.1, .. ] (CL-1,23) (CL-2, [0.98, 0.2, .. ] Quick fox Dog (CL-1,37) (CL-2,45) Cluster labels Points 16
  • Text clustering programs $ mahout seqdirectory Text files or Sequence files Lucene index [ 0.03, 0.95, .. ] $ mahout seq2sparse [ 0.29, 0.98, .. ] Sequence files Vectors[ 0.03, 0.95, .. ][ 0.29, 0.98, .. ] $ mahout kmeans Vectors Clusters 17
  • Clustering 18
  • ClusteringPublicly available monthly dumpsPosts ~ 5.5 GB ~ 1.4 M questions (April 2011)Lets use to extract a tag cloud! 19
  • Clustering Cluster Vectorize Index [ 0,1,0,1,1,1,0,0,1,0,1 ] [ 0,1,1,1,1,0,0,0,0,0,1 ] Text Join content & clusters Java Git LucenePre-processXML & HTML Regular expressions Post ID Version control & Title 20
  • ClusteringHow to implement?Pre-process XML XML & HTML parsingVectorize Custom AnalyzerClusterIndex 21
  • [ 0,1,0,1,1,1,0,0,1,0,1 ] Vectorize [ 0,1,1,1,1,0,0,0,0,0,1 ]Many options and flags$ mahout seq2sparse              ­­input ..    ­­output ..          ­­analyzerClass .. ­­maxDFPercent .. ­­minDF .. 22
  • ClusterRun one of the clustering algorithms!K-means, Fuzzy K-means, Canopy,Mean-shift, Min-hash, LDAAll with different pros and cons 23
  • IndexCustom code to join data at index timeIndex clusters (cluster_id, cluster_name, size)Index posts (post_id, post_cluster_id, title) 24
  • Demo time! 25
  • ConclusionsClustering is fun!Vectorization & labeling improvementsTools for cluster evaluation? 26
  • ReferencesMahout in Action – Just released! Sean Owen, Ted Dunning, Robin Anil, Ellen Friedman {user|dev}@mahout.apache.org http://jira.apache.org/MAHOUT http://www.searchworkings.org 27
  • Q&A 28