Your SlideShare is downloading. ×
0
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Intro to Mahout -- DC Hadoop

12,216

Published on

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,216
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
382
Comments
0
Likes
12
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  • Convergence just checks to see how far the centroid has moved from the previous centroid
  • Transcript

    • 1. Intro to Apache Mahout<br />Grant Ingersoll<br />Lucid Imagination<br />http://www.lucidimagination.com<br />
    • 2. Anyone Here Use Machine Learning?<br />Any users of:<br />Google?<br />Search?<br />Priority Inbox?<br />Facebook?<br />Twitter?<br />LinkedIn?<br />
    • 3. Topics<br />Background and Use Cases<br />What can you do in Mahout?<br />Where’s the community at?<br />Resources<br />K-Means in Hadoop (time permitting)<br />
    • 4. Definition<br />“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”<br />Intro. To Machine Learning by E. Alpaydin<br />Subset of Artificial Intelligence<br />Lots of related fields:<br />Information Retrieval<br />Stats<br />Biology<br />Linear algebra<br />Many more<br />
    • 5. Common Use Cases<br />Recommend friends/dates/products<br />Classify content into predefined groups<br />Find similar content<br />Find associations/patterns in actions/behaviors<br />Identify key topics/summarize text<br />Documents and Corpora<br />Detect anomalies/fraud<br />Ranking search results<br />Others?<br />
    • 6. Apache Mahout<br />An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License<br />http://mahout.apache.org<br />Why Mahout?<br />Many Open Source ML libraries either:<br />Lack Community<br />Lack Documentation and Examples<br />Lack Scalability<br />Lack the Apache License<br />Or are research-oriented<br />Definition:http://dictionary.reference.com/browse/mahout<br />
    • 7. What does scalable mean to us?<br />Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm<br />Some algorithms won’t scale to massive machine clusters<br />Others fit logically on a Map Reduce framework like Apache Hadoop<br />Still others will need different distributed programming models<br />Others are already fast (SGD)<br />Be pragmatic<br />
    • 8. Sampling of Who uses Mahout?<br />https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout<br />
    • 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout<br />
    • 10. Collaborative Filtering<br />Extensive framework for collaborative filtering (recommenders)<br />Recommenders<br />User based<br />Item based<br />Online and Offline support<br />Offline can utilize Hadoop<br />Many different Similarity measures<br />Cosine, LLR, Tanimoto, Pearson, others<br />
    • 11. Clustering<br />Document level<br />Group documents based on a notion of similarity<br />K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)<br />All Map/Reduce<br />Distance Measures<br />Manhattan, Euclidean, other<br />Topic Modeling <br />Cluster words across documents to identify topics<br />Latent Dirichlet Allocation (M/R)<br />
    • 12. Categorization<br />Place new items into predefined categories:<br />Sports, politics, entertainment<br />Recommenders<br />Implementations<br />Naïve Bayes (M/R)<br />Compl. Naïve Bayes (M/R)<br />Decision Forests (M/R)<br />Linear Regression (Seq. but Fast!)<br /><ul><li>See Chapter 17 of Mahout in Action for Shop It To Me use case:
    • 13. http://awe.sm/5FyNe</li></li></ul><li>Freq. Pattern Mining<br />Identify frequently co-occurrent items<br />Useful for:<br />Query Recommendations<br />Apple -&gt; iPhone, orange, OS X<br />Related product placement<br />Basket Analysis<br />Map/Reduce<br />http://www.amazon.com<br />
    • 14. Other<br />Primitive Collections!<br />Collocations (M/R)<br />Math library<br />Vectors, Matrices, etc.<br />Noise Reduction via Singular Value Decomp (M/R)<br />
    • 15. Prepare Data from Raw content<br />Data Sources:<br />Lucene integration<br />bin/mahout lucene.vector…<br />Document Vectorizer<br />bin/mahout seqdirectory …<br />bin/mahout seq2sparse …<br />Programmatically<br />See the Utils module in Mahout and the Iterator&lt;Vector&gt; classes<br />Database<br />File system<br />
    • 16. How to: Command Line<br />Most algorithms have a Driver program<br />$MAHOUT_HOME/bin/mahout.shhelps with most tasks<br />Prepare the Data<br />Different algorithms require different setup<br />Run the algorithm<br />Single Node<br />Hadoop<br />Print out the results or incorporate into application<br />Several helper classes: <br />LDAPrintTopics, ClusterDumper, etc.<br />
    • 17. What’s Happening Now?<br />Unified Framework for Clustering and Classification<br />0.5 release on the horizon (May?)<br />Working towards 1.0 release by focusing on:<br />Tests, examples, documentation<br />API cleanup and consistency<br />Gearing up for Google Summer of Code<br />New M/R work for Hidden Markov Models<br />
    • 18. Summary<br />Machine learning is all over the web today<br />Mahout is about scalable machine learning<br />Mahout has functionality for many of today’s common machine learning tasks<br />Many Mahout implementations use Hadoop<br />
    • 19. Resources<br />http://mahout.apache.org<br />http://cwiki.apache.org/MAHOUT<br />{user|dev}@mahout.apache.org<br />http://svn.apache.org/repos/asf/mahout/trunk<br />http://hadoop.apache.org<br />
    • 20. Resources<br />“Mahout in Action” <br />Owen, Anil, Dunning and Friedman<br />http://awe.sm/5FyNe<br />“Introducing Apache Mahout” <br />http://www.ibm.com/developerworks/java/library/j-mahout/<br />“Taming Text” by Ingersoll, Morton, Farris<br />“Programming Collective Intelligence” by Toby Segaran<br />“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank<br />“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer <br />
    • 21. K-Means<br />Clustering Algorithm<br />Nicely parallelizable!<br />http://en.wikipedia.org/wiki/K-means_clustering<br />
    • 22. K-Means in Map-Reduce<br />Input:<br />Mahout Vectors representing the original content<br />Either:<br />A predefined set of initial centroids (Can be from Canopy)<br />--k – The number of clusters to produce<br />Iterate<br />Do the centroid calculation (more in a moment)<br />Clustering Step (optional)<br />Output<br />Centroids (as Mahout Vectors)<br />Points for each Centroid (if Clustering Step was taken)<br />
    • 23. Map-Reduce Iteration<br />Each Iteration calculates the Centroids using:<br />KMeansMapper<br />KMeansCombiner<br />KMeansReducer<br />Clustering Step<br />Calculate the points for each Centroid using:<br />KMeansClusterMapper<br />
    • 24. KMeansMapper<br />During Setup:<br />Load the initial Centroids (or the Centroids from the last iteration)<br />Map Phase<br />For each input<br />Calculate it’s distance from each Centroid and output the closest one<br />Distance Measures are pluggable<br />Manhattan, Euclidean, Squared Euclidean, Cosine, others<br />
    • 25. KMeansReducer<br />Setup:<br />Load up clusters<br />Convergence information<br />Partial sums from KMeansCombiner (more in a moment)<br />Reduce Phase<br />Sum all the vectors in the cluster to produce a new Centroid<br />Check for Convergence<br />Output cluster<br />
    • 26. KMeansCombiner<br />Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper<br />
    • 27. KMeansClusterMapper<br />Some applications only care about what the Centroids are, so this step is optional<br />Setup:<br />Load up the clusters and the DistanceMeasure used<br />Map Phase<br />Calculate which Cluster the point belongs to<br />Output &lt;ClusterId, Vector&gt;<br />

    ×