Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Mahout -- DC Hadoop

13,176 views

Published on

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Published in: Technology
  • Be the first to comment

Intro to Mahout -- DC Hadoop

  1. 1. Intro to Apache Mahout<br />Grant Ingersoll<br />Lucid Imagination<br />http://www.lucidimagination.com<br />
  2. 2. Anyone Here Use Machine Learning?<br />Any users of:<br />Google?<br />Search?<br />Priority Inbox?<br />Facebook?<br />Twitter?<br />LinkedIn?<br />
  3. 3. Topics<br />Background and Use Cases<br />What can you do in Mahout?<br />Where’s the community at?<br />Resources<br />K-Means in Hadoop (time permitting)<br />
  4. 4. Definition<br />“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”<br />Intro. To Machine Learning by E. Alpaydin<br />Subset of Artificial Intelligence<br />Lots of related fields:<br />Information Retrieval<br />Stats<br />Biology<br />Linear algebra<br />Many more<br />
  5. 5. Common Use Cases<br />Recommend friends/dates/products<br />Classify content into predefined groups<br />Find similar content<br />Find associations/patterns in actions/behaviors<br />Identify key topics/summarize text<br />Documents and Corpora<br />Detect anomalies/fraud<br />Ranking search results<br />Others?<br />
  6. 6. Apache Mahout<br />An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License<br />http://mahout.apache.org<br />Why Mahout?<br />Many Open Source ML libraries either:<br />Lack Community<br />Lack Documentation and Examples<br />Lack Scalability<br />Lack the Apache License<br />Or are research-oriented<br />Definition:http://dictionary.reference.com/browse/mahout<br />
  7. 7. What does scalable mean to us?<br />Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm<br />Some algorithms won’t scale to massive machine clusters<br />Others fit logically on a Map Reduce framework like Apache Hadoop<br />Still others will need different distributed programming models<br />Others are already fast (SGD)<br />Be pragmatic<br />
  8. 8. Sampling of Who uses Mahout?<br />https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout<br />
  9. 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout<br />
  10. 10. Collaborative Filtering<br />Extensive framework for collaborative filtering (recommenders)<br />Recommenders<br />User based<br />Item based<br />Online and Offline support<br />Offline can utilize Hadoop<br />Many different Similarity measures<br />Cosine, LLR, Tanimoto, Pearson, others<br />
  11. 11. Clustering<br />Document level<br />Group documents based on a notion of similarity<br />K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)<br />All Map/Reduce<br />Distance Measures<br />Manhattan, Euclidean, other<br />Topic Modeling <br />Cluster words across documents to identify topics<br />Latent Dirichlet Allocation (M/R)<br />
  12. 12. Categorization<br />Place new items into predefined categories:<br />Sports, politics, entertainment<br />Recommenders<br />Implementations<br />Naïve Bayes (M/R)<br />Compl. Naïve Bayes (M/R)<br />Decision Forests (M/R)<br />Linear Regression (Seq. but Fast!)<br /><ul><li>See Chapter 17 of Mahout in Action for Shop It To Me use case:
  13. 13. http://awe.sm/5FyNe</li></li></ul><li>Freq. Pattern Mining<br />Identify frequently co-occurrent items<br />Useful for:<br />Query Recommendations<br />Apple -> iPhone, orange, OS X<br />Related product placement<br />Basket Analysis<br />Map/Reduce<br />http://www.amazon.com<br />
  14. 14. Other<br />Primitive Collections!<br />Collocations (M/R)<br />Math library<br />Vectors, Matrices, etc.<br />Noise Reduction via Singular Value Decomp (M/R)<br />
  15. 15. Prepare Data from Raw content<br />Data Sources:<br />Lucene integration<br />bin/mahout lucene.vector…<br />Document Vectorizer<br />bin/mahout seqdirectory …<br />bin/mahout seq2sparse …<br />Programmatically<br />See the Utils module in Mahout and the Iterator<Vector> classes<br />Database<br />File system<br />
  16. 16. How to: Command Line<br />Most algorithms have a Driver program<br />$MAHOUT_HOME/bin/mahout.shhelps with most tasks<br />Prepare the Data<br />Different algorithms require different setup<br />Run the algorithm<br />Single Node<br />Hadoop<br />Print out the results or incorporate into application<br />Several helper classes: <br />LDAPrintTopics, ClusterDumper, etc.<br />
  17. 17. What’s Happening Now?<br />Unified Framework for Clustering and Classification<br />0.5 release on the horizon (May?)<br />Working towards 1.0 release by focusing on:<br />Tests, examples, documentation<br />API cleanup and consistency<br />Gearing up for Google Summer of Code<br />New M/R work for Hidden Markov Models<br />
  18. 18. Summary<br />Machine learning is all over the web today<br />Mahout is about scalable machine learning<br />Mahout has functionality for many of today’s common machine learning tasks<br />Many Mahout implementations use Hadoop<br />
  19. 19. Resources<br />http://mahout.apache.org<br />http://cwiki.apache.org/MAHOUT<br />{user|dev}@mahout.apache.org<br />http://svn.apache.org/repos/asf/mahout/trunk<br />http://hadoop.apache.org<br />
  20. 20. Resources<br />“Mahout in Action” <br />Owen, Anil, Dunning and Friedman<br />http://awe.sm/5FyNe<br />“Introducing Apache Mahout” <br />http://www.ibm.com/developerworks/java/library/j-mahout/<br />“Taming Text” by Ingersoll, Morton, Farris<br />“Programming Collective Intelligence” by Toby Segaran<br />“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank<br />“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer <br />
  21. 21. K-Means<br />Clustering Algorithm<br />Nicely parallelizable!<br />http://en.wikipedia.org/wiki/K-means_clustering<br />
  22. 22. K-Means in Map-Reduce<br />Input:<br />Mahout Vectors representing the original content<br />Either:<br />A predefined set of initial centroids (Can be from Canopy)<br />--k – The number of clusters to produce<br />Iterate<br />Do the centroid calculation (more in a moment)<br />Clustering Step (optional)<br />Output<br />Centroids (as Mahout Vectors)<br />Points for each Centroid (if Clustering Step was taken)<br />
  23. 23. Map-Reduce Iteration<br />Each Iteration calculates the Centroids using:<br />KMeansMapper<br />KMeansCombiner<br />KMeansReducer<br />Clustering Step<br />Calculate the points for each Centroid using:<br />KMeansClusterMapper<br />
  24. 24. KMeansMapper<br />During Setup:<br />Load the initial Centroids (or the Centroids from the last iteration)<br />Map Phase<br />For each input<br />Calculate it’s distance from each Centroid and output the closest one<br />Distance Measures are pluggable<br />Manhattan, Euclidean, Squared Euclidean, Cosine, others<br />
  25. 25. KMeansReducer<br />Setup:<br />Load up clusters<br />Convergence information<br />Partial sums from KMeansCombiner (more in a moment)<br />Reduce Phase<br />Sum all the vectors in the cluster to produce a new Centroid<br />Check for Convergence<br />Output cluster<br />
  26. 26. KMeansCombiner<br />Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper<br />
  27. 27. KMeansClusterMapper<br />Some applications only care about what the Centroids are, so this step is optional<br />Setup:<br />Load up the clusters and the DistanceMeasure used<br />Map Phase<br />Calculate which Cluster the point belongs to<br />Output <ClusterId, Vector><br />

×