Apache Mahout: Driving the Yellow Elephant


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • A few things come to mind
  • Convergence just checks to see how far the centroid has moved from the previous centroid
  • Apache Mahout: Driving the Yellow Elephant

    1. 1. Apache Mahout – Driving the Yellow Elephant<br />Grant Ingersoll<br />TriHUG http://www.trihug.org<br />
    2. 2. Anyone Here Use Machine Learning?<br />Any users of:<br />Google?<br />Search?<br />Priority Inbox?<br />Facebook?<br />Twitter?<br />LinkedIn?<br />
    3. 3. Topics<br />What is Machine Learning?<br />ML Use Cases<br />What is Mahout?<br />A Word on Scaling<br />What can I do with it right now?<br />Mahout and Hadoop: An Example<br />
    4. 4. Amazon.com<br />What is Machine Learning?<br />Google News<br />
    5. 5. Really it’s…<br />“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”<br />Intro. To Machine Learning by E. Alpaydin<br />Subset of Artificial Intelligence<br />Lots of related fields:<br />Information Retrieval<br />Stats<br />Biology<br />Linear algebra<br />Many more<br />
    6. 6. Common Use Cases<br />Recommend friends/dates/products<br />Classify content into predefined groups<br />Find similar content based on object properties<br />Find associations/patterns in actions/behaviors<br />Identify key topics in large collections of text<br />Detect anomalies in machine output<br />Ranking search results<br />Others?<br />
    7. 7. Apache Mahout<br />http://dictionary.reference.com/browse/mahout<br />An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License<br />http://mahout.apache.org<br />Why Mahout?<br />Many Open Source ML libraries either:<br />Lack Community<br />Lack Documentation and Examples<br />Lack Scalability<br />Lack the Apache License ;-)<br />Or are research-oriented<br />
    8. 8. Who uses Mahout?<br />https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout<br />
    9. 9. What does scalable mean?<br />Ted Dunning (Mahout committer):<br />As data grows linearly, either scale linearly in time or in machines<br />2X data requires 2X time or 2X machines (or less!)<br />Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm<br />Some algorithms won’t scale to massive machine clusters<br />Others fit logically on a Map Reduce framework like Apache Hadoop<br />Still others will need different distributed programming models<br />Be pragmatic<br />
    10. 10. What Can I do with Mahout Right Now?<br />
    11. 11. Recommendations<br />Extensive framework for collaborative filtering<br />Recommenders<br />User based<br />Item based<br />Online and Offline support<br />Offline can utilize Hadoop<br />Many different Similarity measures<br />Cosine, LLR, Tanimoto, Pearson, others<br />It’s Valentine’s Day soon!<br />
    12. 12. Clustering<br />Document level<br />Group documents based on a notion of similarity<br />K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift<br />Distance Measures<br />Manhattan, Euclidean, other<br />Topic Modeling <br />Cluster words across documents to identify topics<br />Latent Dirichlet Allocation<br />
    13. 13. Categorization<br />Place new items into predefined categories:<br />Sports, politics, entertainment<br />Recommenders<br />Implementations<br />Naïve Bayes<br />Compl. Naïve Bayes<br />Decision Forests<br />Linear Regression<br /><ul><li>See Chapter 17 of Mahout in Action for Shop It To Me use case:
    14. 14. http://awe.sm/5FyNe</li></li></ul><li>Freq. Pattern Mining<br />Identify frequently co-occurrent items<br />Useful for:<br />Query Recommendations<br />Apple -> iPhone, orange, OS X<br />Related product placement<br />Basket Analysis<br />http://www.amazon.com<br />
    15. 15. Evolutionary<br />Map-Reduce ready fitness functions for genetic programming<br />Integration with Watchmaker<br />http://watchmaker.uncommons.org/index.php<br />Problems solved:<br />Traveling salesman<br />Class discovery<br />Many others<br />Caveat: Hasn’t received as much attention as others<br />
    16. 16. Other<br />Primitive Collections!<br />Math library<br />Vectors, Matrices, etc.<br />Noise Reduction via Singular Value Decomposition<br />Export from Lucene/Solr and other formats<br />
    17. 17. Mahout and Hadoop<br />Most Mahout implementations are built on Map-Reduce<br />Many also have sequential implementations<br />Linear Regression is blazingly fast without needing M/R<br />Let’s look at how K-Means is implemented in Mahout<br />
    18. 18. K-Means<br />Clustering Algorithm<br />Nicely parallelizable!<br />http://en.wikipedia.org/wiki/K-means_clustering<br />
    19. 19. K-Means in Map-Reduce<br />Input:<br />Mahout Vectors representing the original content<br />Either:<br />A predefined set of initial centroids (Can be from Canopy)<br />--k – The number of clusters to produce<br />Iterate<br />Do the centroid calculation (more in a moment)<br />Clustering Step (optional)<br />Output<br />Centroids (as Mahout Vectors)<br />Points for each Centroid (if Clustering Step was taken)<br />
    20. 20. Map-Reduce Iteration<br />Each Iteration calculates the Centroids using:<br />KMeansMapper<br />KMeansCombiner<br />KMeansReducer<br />Clustering Step<br />Calculate the points for each Centroid using:<br />KMeansClusterMapper<br />
    21. 21. KMeansMapper<br />During Setup:<br />Load the initial Centroids (or the Centroids from the last iteration)<br />Map Phase<br />For each input<br />Calculate it’s distance from each Centroid and output the closest one<br />Distance Measures are pluggable<br />Manhattan, Euclidean, Squared Euclidean, Cosine, others<br />
    22. 22. KMeansReducer<br />Setup:<br />Load up clusters<br />Convergence information<br />Partial sums from KMeansCombiner (more in a moment)<br />Reduce Phase<br />Sum all the vectors in the cluster to produce a new Centroid<br />Check for Convergence<br />Output cluster<br />
    23. 23. KMeansCombiner<br />A Combiner is like a Map-side Reducer which helps save on IO<br />Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper<br />
    24. 24. KMeansClusterMapper<br />Some applications only care about what the Centroids are, so this step is optional<br />Setup:<br />Load up the clusters and the DistanceMeasure used<br />Map Phase<br />Calculate which Cluster the point belongs to<br />Output <ClusterId, Vector><br />
    25. 25. Summary<br />Machine learning is all over the web today<br />Mahout is about scalable machine learning<br />Mahout has functionality for many of today’s common machine learning tasks<br />Many Mahout implementations use Hadoop<br />KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce<br />
    26. 26. Resources<br />http://mahout.apache.org<br />http://cwiki.apache.org/MAHOUT<br />{user|dev}@mahout.apache.org<br />http://svn.apache.org/repos/asf/mahout/trunk<br />http://hadoop.apache.org<br />
    27. 27. Resources<br />“Mahout in Action” by Owen, Anil, Dunning and Friedman<br />http://awe.sm/5FyNe<br />“Introducing Apache Mahout” <br />http://www.ibm.com/developerworks/java/library/j-mahout/<br />“Programming Collective Intelligence” by Toby Segaran<br />“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank<br />
    28. 28. References<br />HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg<br />Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg<br />Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg<br />Google News: http://news.google.com<br />Amazon.com: http://www.amazon.com<br />Facebook: http://www.facebook.com<br />Couple: http://www.vlemx.com/<br />Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/<br />http://www.theregister.co.uk/2006/08/15/beer_diapers/<br />DMOZ: http://www.dmoz.org<br />Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html<br />