Your SlideShare is downloading. ×
  • Like
Intro to Mahout -- DC Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Intro to Mahout -- DC Hadoop


Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  • Convergence just checks to see how far the centroid has moved from the previous centroid


  • 1. Intro to Apache Mahout
    Grant Ingersoll
    Lucid Imagination
  • 2. Anyone Here Use Machine Learning?
    Any users of:
    Priority Inbox?
  • 3. Topics
    Background and Use Cases
    What can you do in Mahout?
    Where’s the community at?
    K-Means in Hadoop (time permitting)
  • 4. Definition
    “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”
    Intro. To Machine Learning by E. Alpaydin
    Subset of Artificial Intelligence
    Lots of related fields:
    Information Retrieval
    Linear algebra
    Many more
  • 5. Common Use Cases
    Recommend friends/dates/products
    Classify content into predefined groups
    Find similar content
    Find associations/patterns in actions/behaviors
    Identify key topics/summarize text
    Documents and Corpora
    Detect anomalies/fraud
    Ranking search results
  • 6. Apache Mahout
    An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License
    Why Mahout?
    Many Open Source ML libraries either:
    Lack Community
    Lack Documentation and Examples
    Lack Scalability
    Lack the Apache License
    Or are research-oriented
  • 7. What does scalable mean to us?
    Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm
    Some algorithms won’t scale to massive machine clusters
    Others fit logically on a Map Reduce framework like Apache Hadoop
    Still others will need different distributed programming models
    Others are already fast (SGD)
    Be pragmatic
  • 8. Sampling of Who uses Mahout?
  • 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout
  • 10. Collaborative Filtering
    Extensive framework for collaborative filtering (recommenders)
    User based
    Item based
    Online and Offline support
    Offline can utilize Hadoop
    Many different Similarity measures
    Cosine, LLR, Tanimoto, Pearson, others
  • 11. Clustering
    Document level
    Group documents based on a notion of similarity
    K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)
    All Map/Reduce
    Distance Measures
    Manhattan, Euclidean, other
    Topic Modeling
    Cluster words across documents to identify topics
    Latent Dirichlet Allocation (M/R)
  • 12. Categorization
    Place new items into predefined categories:
    Sports, politics, entertainment
    Naïve Bayes (M/R)
    Compl. Naïve Bayes (M/R)
    Decision Forests (M/R)
    Linear Regression (Seq. but Fast!)
    • See Chapter 17 of Mahout in Action for Shop It To Me use case:
    • 13.
  • Freq. Pattern Mining
    Identify frequently co-occurrent items
    Useful for:
    Query Recommendations
    Apple -> iPhone, orange, OS X
    Related product placement
    Basket Analysis
  • 14. Other
    Primitive Collections!
    Collocations (M/R)
    Math library
    Vectors, Matrices, etc.
    Noise Reduction via Singular Value Decomp (M/R)
  • 15. Prepare Data from Raw content
    Data Sources:
    Lucene integration
    bin/mahout lucene.vector…
    Document Vectorizer
    bin/mahout seqdirectory …
    bin/mahout seq2sparse …
    See the Utils module in Mahout and the Iterator<Vector> classes
    File system
  • 16. How to: Command Line
    Most algorithms have a Driver program
    $MAHOUT_HOME/bin/mahout.shhelps with most tasks
    Prepare the Data
    Different algorithms require different setup
    Run the algorithm
    Single Node
    Print out the results or incorporate into application
    Several helper classes:
    LDAPrintTopics, ClusterDumper, etc.
  • 17. What’s Happening Now?
    Unified Framework for Clustering and Classification
    0.5 release on the horizon (May?)
    Working towards 1.0 release by focusing on:
    Tests, examples, documentation
    API cleanup and consistency
    Gearing up for Google Summer of Code
    New M/R work for Hidden Markov Models
  • 18. Summary
    Machine learning is all over the web today
    Mahout is about scalable machine learning
    Mahout has functionality for many of today’s common machine learning tasks
    Many Mahout implementations use Hadoop
  • 19. Resources
  • 20. Resources
    “Mahout in Action”
    Owen, Anil, Dunning and Friedman
    “Introducing Apache Mahout”
    “Taming Text” by Ingersoll, Morton, Farris
    “Programming Collective Intelligence” by Toby Segaran
    “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
    “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer
  • 21. K-Means
    Clustering Algorithm
    Nicely parallelizable!
  • 22. K-Means in Map-Reduce
    Mahout Vectors representing the original content
    A predefined set of initial centroids (Can be from Canopy)
    --k – The number of clusters to produce
    Do the centroid calculation (more in a moment)
    Clustering Step (optional)
    Centroids (as Mahout Vectors)
    Points for each Centroid (if Clustering Step was taken)
  • 23. Map-Reduce Iteration
    Each Iteration calculates the Centroids using:
    Clustering Step
    Calculate the points for each Centroid using:
  • 24. KMeansMapper
    During Setup:
    Load the initial Centroids (or the Centroids from the last iteration)
    Map Phase
    For each input
    Calculate it’s distance from each Centroid and output the closest one
    Distance Measures are pluggable
    Manhattan, Euclidean, Squared Euclidean, Cosine, others
  • 25. KMeansReducer
    Load up clusters
    Convergence information
    Partial sums from KMeansCombiner (more in a moment)
    Reduce Phase
    Sum all the vectors in the cluster to produce a new Centroid
    Check for Convergence
    Output cluster
  • 26. KMeansCombiner
    Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • 27. KMeansClusterMapper
    Some applications only care about what the Centroids are, so this step is optional
    Load up the clusters and the DistanceMeasure used
    Map Phase
    Calculate which Cluster the point belongs to
    Output <ClusterId, Vector>