Configuring Mahout Clustering Jobs - Frank Scholten
Upcoming SlideShare
Loading in...5
×
 

Configuring Mahout Clustering Jobs - Frank Scholten

on

  • 2,989 views

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 ...

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Statistics

Views

Total Views
2,989
Slideshare-icon Views on SlideShare
2,989
Embed Views
0

Actions

Likes
2
Downloads
51
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Configuring Mahout Clustering Jobs - Frank Scholten Configuring Mahout Clustering Jobs - Frank Scholten Presentation Transcript

    • Configuring Clustering Jobs Frank Scholten, DutchWorksfrank.scholten@dutchworks.nl, 19 october 2011
    • My BackgroundFrank Scholten @Frank_ScholtenSoftware Developer atBlogger at user & contributor 2
    • AgendaWhat is clustering?Intro toClustering 3
    • Clustering introductionSo much How to get a nice data... overview? 4
    • What is clustering?Grouping & summarizing dataUnsupervised machine learning“...the assignment of a set of observationsinto subsets so that observations in thesame clusters are similar in some sense...” Source: Wikipedia 5
    • ApplicationsMarket segmentationSpecies identificationsMachine visionInformation retrieval & search...and many more! 6
    • Example - Google news 7
    • 2-D Clustering example Intra-cluster distance Inter-cluster distanceLegend Point Cluster Cluster Center 8
    • K-Means algorithmSelect K random vectorsSpecify distance measure + thresholdEvery iteration● Add vector closest to cluster● Recompute center● Converged if no vectors within threshold 9
    • Intro to 10
    • Is this SPAM? ClassificationCollaborativeFiltering And much more! Clustering 11
    • The ProjectApache project started in 2008Scalable machine learning, often with HadoopSteadily growing communityVersion 0.6 coming soon 12
    • bin/mahout frank@frankthetank:~$ mahout    no HADOOP_HOME set, running locally    An example program must be given as the    first argument.    Valid program names are:         arff.vector: : Generate Vectors from an ARFF                    file or directory    canopy: : Canopy clustering    cat: : Print a file or resource as the           logistic regression models would see       it    ...   13
    • Help frank@frankthetank:~$ mahout kmeans ­­help  usage: <command> [Generic Options]  [Job­specific Options]  Generic Options: ...  Job­specific Options:  ­­input (­i) input Path to job input directory.  ­­output (­o) output The directory pathname for output.    ... 14
    • Java Drivers  String[] args = new String[] {    "­­input", input,    "­­output", output,    "­­clusters",  clusters,    "­­clustering",    "­­numClusters", “10”  };  ToolRunner.run(conf, new KmeansDriver(), args); 15
    • Text clustering process [ 0.03, 0.95, 0.45, 0.34 ] [ 0.02, 0.98, 0.73, 0.55 ]Text files or Vectors Sequence filesLucene index K-means Clusters Find n-grams Dictionary (CL-1, [0.32, 0.6, .. ] (CL-1, [0.76, 0.1, .. ] (CL-1,23) (CL-2, [0.98, 0.2, .. ] Quick fox Dog (CL-1,37) (CL-2,45) Cluster labels Points 16
    • Text clustering programs $ mahout seqdirectory Text files or Sequence files Lucene index [ 0.03, 0.95, .. ] $ mahout seq2sparse [ 0.29, 0.98, .. ] Sequence files Vectors[ 0.03, 0.95, .. ][ 0.29, 0.98, .. ] $ mahout kmeans Vectors Clusters 17
    • Clustering 18
    • ClusteringPublicly available monthly dumpsPosts ~ 5.5 GB ~ 1.4 M questions (April 2011)Lets use to extract a tag cloud! 19
    • Clustering Cluster Vectorize Index [ 0,1,0,1,1,1,0,0,1,0,1 ] [ 0,1,1,1,1,0,0,0,0,0,1 ] Text Join content & clusters Java Git LucenePre-processXML & HTML Regular expressions Post ID Version control & Title 20
    • ClusteringHow to implement?Pre-process XML XML & HTML parsingVectorize Custom AnalyzerClusterIndex 21
    • [ 0,1,0,1,1,1,0,0,1,0,1 ] Vectorize [ 0,1,1,1,1,0,0,0,0,0,1 ]Many options and flags$ mahout seq2sparse              ­­input ..    ­­output ..          ­­analyzerClass .. ­­maxDFPercent .. ­­minDF .. 22
    • ClusterRun one of the clustering algorithms!K-means, Fuzzy K-means, Canopy,Mean-shift, Min-hash, LDAAll with different pros and cons 23
    • IndexCustom code to join data at index timeIndex clusters (cluster_id, cluster_name, size)Index posts (post_id, post_cluster_id, title) 24
    • Demo time! 25
    • ConclusionsClustering is fun!Vectorization & labeling improvementsTools for cluster evaluation? 26
    • ReferencesMahout in Action – Just released! Sean Owen, Ted Dunning, Robin Anil, Ellen Friedman {user|dev}@mahout.apache.org http://jira.apache.org/MAHOUT http://www.searchworkings.org 27
    • Q&A 28