Your SlideShare is downloading. ×
0
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Mahout: Driving the Yellow Elephant

8,370

Published on

0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,370
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
291
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • A few things come to mind
  • Convergence just checks to see how far the centroid has moved from the previous centroid
  • Transcript

    • 1. Apache Mahout – Driving the Yellow Elephant
      Grant Ingersoll
      TriHUG http://www.trihug.org
    • 2. Anyone Here Use Machine Learning?
      Any users of:
      Google?
      Search?
      Priority Inbox?
      Facebook?
      Twitter?
      LinkedIn?
    • 3. Topics
      What is Machine Learning?
      ML Use Cases
      What is Mahout?
      A Word on Scaling
      What can I do with it right now?
      Mahout and Hadoop: An Example
    • 4. Amazon.com
      What is Machine Learning?
      Google News
    • 5. Really it’s…
      “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”
      Intro. To Machine Learning by E. Alpaydin
      Subset of Artificial Intelligence
      Lots of related fields:
      Information Retrieval
      Stats
      Biology
      Linear algebra
      Many more
    • 6. Common Use Cases
      Recommend friends/dates/products
      Classify content into predefined groups
      Find similar content based on object properties
      Find associations/patterns in actions/behaviors
      Identify key topics in large collections of text
      Detect anomalies in machine output
      Ranking search results
      Others?
    • 7. Apache Mahout
      http://dictionary.reference.com/browse/mahout
      An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License
      http://mahout.apache.org
      Why Mahout?
      Many Open Source ML libraries either:
      Lack Community
      Lack Documentation and Examples
      Lack Scalability
      Lack the Apache License ;-)
      Or are research-oriented
    • 8. Who uses Mahout?
      https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
    • 9. What does scalable mean?
      Ted Dunning (Mahout committer):
      As data grows linearly, either scale linearly in time or in machines
      2X data requires 2X time or 2X machines (or less!)
      Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm
      Some algorithms won’t scale to massive machine clusters
      Others fit logically on a Map Reduce framework like Apache Hadoop
      Still others will need different distributed programming models
      Be pragmatic
    • 10. What Can I do with Mahout Right Now?
    • 11. Recommendations
      Extensive framework for collaborative filtering
      Recommenders
      User based
      Item based
      Online and Offline support
      Offline can utilize Hadoop
      Many different Similarity measures
      Cosine, LLR, Tanimoto, Pearson, others
      It’s Valentine’s Day soon!
    • 12. Clustering
      Document level
      Group documents based on a notion of similarity
      K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift
      Distance Measures
      Manhattan, Euclidean, other
      Topic Modeling
      Cluster words across documents to identify topics
      Latent Dirichlet Allocation
    • 13. Categorization
      Place new items into predefined categories:
      Sports, politics, entertainment
      Recommenders
      Implementations
      Naïve Bayes
      Compl. Naïve Bayes
      Decision Forests
      Linear Regression
      • See Chapter 17 of Mahout in Action for Shop It To Me use case:
      • 14. http://awe.sm/5FyNe
    • Freq. Pattern Mining
      Identify frequently co-occurrent items
      Useful for:
      Query Recommendations
      Apple -> iPhone, orange, OS X
      Related product placement
      Basket Analysis
      http://www.amazon.com
    • 15. Evolutionary
      Map-Reduce ready fitness functions for genetic programming
      Integration with Watchmaker
      http://watchmaker.uncommons.org/index.php
      Problems solved:
      Traveling salesman
      Class discovery
      Many others
      Caveat: Hasn’t received as much attention as others
    • 16. Other
      Primitive Collections!
      Math library
      Vectors, Matrices, etc.
      Noise Reduction via Singular Value Decomposition
      Export from Lucene/Solr and other formats
    • 17. Mahout and Hadoop
      Most Mahout implementations are built on Map-Reduce
      Many also have sequential implementations
      Linear Regression is blazingly fast without needing M/R
      Let’s look at how K-Means is implemented in Mahout
    • 18. K-Means
      Clustering Algorithm
      Nicely parallelizable!
      http://en.wikipedia.org/wiki/K-means_clustering
    • 19. K-Means in Map-Reduce
      Input:
      Mahout Vectors representing the original content
      Either:
      A predefined set of initial centroids (Can be from Canopy)
      --k – The number of clusters to produce
      Iterate
      Do the centroid calculation (more in a moment)
      Clustering Step (optional)
      Output
      Centroids (as Mahout Vectors)
      Points for each Centroid (if Clustering Step was taken)
    • 20. Map-Reduce Iteration
      Each Iteration calculates the Centroids using:
      KMeansMapper
      KMeansCombiner
      KMeansReducer
      Clustering Step
      Calculate the points for each Centroid using:
      KMeansClusterMapper
    • 21. KMeansMapper
      During Setup:
      Load the initial Centroids (or the Centroids from the last iteration)
      Map Phase
      For each input
      Calculate it’s distance from each Centroid and output the closest one
      Distance Measures are pluggable
      Manhattan, Euclidean, Squared Euclidean, Cosine, others
    • 22. KMeansReducer
      Setup:
      Load up clusters
      Convergence information
      Partial sums from KMeansCombiner (more in a moment)
      Reduce Phase
      Sum all the vectors in the cluster to produce a new Centroid
      Check for Convergence
      Output cluster
    • 23. KMeansCombiner
      A Combiner is like a Map-side Reducer which helps save on IO
      Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
    • 24. KMeansClusterMapper
      Some applications only care about what the Centroids are, so this step is optional
      Setup:
      Load up the clusters and the DistanceMeasure used
      Map Phase
      Calculate which Cluster the point belongs to
      Output <ClusterId, Vector>
    • 25. Summary
      Machine learning is all over the web today
      Mahout is about scalable machine learning
      Mahout has functionality for many of today’s common machine learning tasks
      Many Mahout implementations use Hadoop
      KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce
    • 26. Resources
      http://mahout.apache.org
      http://cwiki.apache.org/MAHOUT
      {user|dev}@mahout.apache.org
      http://svn.apache.org/repos/asf/mahout/trunk
      http://hadoop.apache.org
    • 27. Resources
      “Mahout in Action” by Owen, Anil, Dunning and Friedman
      http://awe.sm/5FyNe
      “Introducing Apache Mahout”
      http://www.ibm.com/developerworks/java/library/j-mahout/
      “Programming Collective Intelligence” by Toby Segaran
      “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
    • 28. References
      HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg
      Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg
      Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg
      Google News: http://news.google.com
      Amazon.com: http://www.amazon.com
      Facebook: http://www.facebook.com
      Couple: http://www.vlemx.com/
      Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/
      http://www.theregister.co.uk/2006/08/15/beer_diapers/
      DMOZ: http://www.dmoz.org
      Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

    ×