Apache Mahout: Driving the Yellow Elephant
Upcoming SlideShare
Loading in...5
×
 

Apache Mahout: Driving the Yellow Elephant

on

  • 8,918 views

 

Statistics

Views

Total Views
8,918
Views on SlideShare
8,603
Embed Views
315

Actions

Likes
16
Downloads
289
Comments
0

5 Embeds 315

http://log.medcl.net 261
http://localhost 25
http://www.sozialpapier.com 21
http://sozialpapier.com 7
http://cache.baidu.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A few things come to mind
  • Convergence just checks to see how far the centroid has moved from the previous centroid

Apache Mahout: Driving the Yellow Elephant Apache Mahout: Driving the Yellow Elephant Presentation Transcript

  • Apache Mahout – Driving the Yellow Elephant
    Grant Ingersoll
    TriHUG http://www.trihug.org
  • Anyone Here Use Machine Learning?
    Any users of:
    Google?
    Search?
    Priority Inbox?
    Facebook?
    Twitter?
    LinkedIn?
  • Topics
    What is Machine Learning?
    ML Use Cases
    What is Mahout?
    A Word on Scaling
    What can I do with it right now?
    Mahout and Hadoop: An Example
  • Amazon.com
    What is Machine Learning?
    Google News
  • Really it’s…
    “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”
    Intro. To Machine Learning by E. Alpaydin
    Subset of Artificial Intelligence
    Lots of related fields:
    Information Retrieval
    Stats
    Biology
    Linear algebra
    Many more
  • Common Use Cases
    Recommend friends/dates/products
    Classify content into predefined groups
    Find similar content based on object properties
    Find associations/patterns in actions/behaviors
    Identify key topics in large collections of text
    Detect anomalies in machine output
    Ranking search results
    Others?
  • Apache Mahout
    http://dictionary.reference.com/browse/mahout
    An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License
    http://mahout.apache.org
    Why Mahout?
    Many Open Source ML libraries either:
    Lack Community
    Lack Documentation and Examples
    Lack Scalability
    Lack the Apache License ;-)
    Or are research-oriented
  • Who uses Mahout?
    https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
  • What does scalable mean?
    Ted Dunning (Mahout committer):
    As data grows linearly, either scale linearly in time or in machines
    2X data requires 2X time or 2X machines (or less!)
    Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm
    Some algorithms won’t scale to massive machine clusters
    Others fit logically on a Map Reduce framework like Apache Hadoop
    Still others will need different distributed programming models
    Be pragmatic
  • What Can I do with Mahout Right Now?
  • Recommendations
    Extensive framework for collaborative filtering
    Recommenders
    User based
    Item based
    Online and Offline support
    Offline can utilize Hadoop
    Many different Similarity measures
    Cosine, LLR, Tanimoto, Pearson, others
    It’s Valentine’s Day soon!
  • Clustering
    Document level
    Group documents based on a notion of similarity
    K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift
    Distance Measures
    Manhattan, Euclidean, other
    Topic Modeling
    Cluster words across documents to identify topics
    Latent Dirichlet Allocation
  • Categorization
    Place new items into predefined categories:
    Sports, politics, entertainment
    Recommenders
    Implementations
    Naïve Bayes
    Compl. Naïve Bayes
    Decision Forests
    Linear Regression
    • See Chapter 17 of Mahout in Action for Shop It To Me use case:
    • http://awe.sm/5FyNe
  • Freq. Pattern Mining
    Identify frequently co-occurrent items
    Useful for:
    Query Recommendations
    Apple -> iPhone, orange, OS X
    Related product placement
    Basket Analysis
    http://www.amazon.com
  • Evolutionary
    Map-Reduce ready fitness functions for genetic programming
    Integration with Watchmaker
    http://watchmaker.uncommons.org/index.php
    Problems solved:
    Traveling salesman
    Class discovery
    Many others
    Caveat: Hasn’t received as much attention as others
  • Other
    Primitive Collections!
    Math library
    Vectors, Matrices, etc.
    Noise Reduction via Singular Value Decomposition
    Export from Lucene/Solr and other formats
  • Mahout and Hadoop
    Most Mahout implementations are built on Map-Reduce
    Many also have sequential implementations
    Linear Regression is blazingly fast without needing M/R
    Let’s look at how K-Means is implemented in Mahout
  • K-Means
    Clustering Algorithm
    Nicely parallelizable!
    http://en.wikipedia.org/wiki/K-means_clustering
  • K-Means in Map-Reduce
    Input:
    Mahout Vectors representing the original content
    Either:
    A predefined set of initial centroids (Can be from Canopy)
    --k – The number of clusters to produce
    Iterate
    Do the centroid calculation (more in a moment)
    Clustering Step (optional)
    Output
    Centroids (as Mahout Vectors)
    Points for each Centroid (if Clustering Step was taken)
  • Map-Reduce Iteration
    Each Iteration calculates the Centroids using:
    KMeansMapper
    KMeansCombiner
    KMeansReducer
    Clustering Step
    Calculate the points for each Centroid using:
    KMeansClusterMapper
  • KMeansMapper
    During Setup:
    Load the initial Centroids (or the Centroids from the last iteration)
    Map Phase
    For each input
    Calculate it’s distance from each Centroid and output the closest one
    Distance Measures are pluggable
    Manhattan, Euclidean, Squared Euclidean, Cosine, others
  • KMeansReducer
    Setup:
    Load up clusters
    Convergence information
    Partial sums from KMeansCombiner (more in a moment)
    Reduce Phase
    Sum all the vectors in the cluster to produce a new Centroid
    Check for Convergence
    Output cluster
  • KMeansCombiner
    A Combiner is like a Map-side Reducer which helps save on IO
    Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • KMeansClusterMapper
    Some applications only care about what the Centroids are, so this step is optional
    Setup:
    Load up the clusters and the DistanceMeasure used
    Map Phase
    Calculate which Cluster the point belongs to
    Output <ClusterId, Vector>
  • Summary
    Machine learning is all over the web today
    Mahout is about scalable machine learning
    Mahout has functionality for many of today’s common machine learning tasks
    Many Mahout implementations use Hadoop
    KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce
  • Resources
    http://mahout.apache.org
    http://cwiki.apache.org/MAHOUT
    {user|dev}@mahout.apache.org
    http://svn.apache.org/repos/asf/mahout/trunk
    http://hadoop.apache.org
  • Resources
    “Mahout in Action” by Owen, Anil, Dunning and Friedman
    http://awe.sm/5FyNe
    “Introducing Apache Mahout”
    http://www.ibm.com/developerworks/java/library/j-mahout/
    “Programming Collective Intelligence” by Toby Segaran
    “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
  • References
    HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg
    Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg
    Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg
    Google News: http://news.google.com
    Amazon.com: http://www.amazon.com
    Facebook: http://www.facebook.com
    Couple: http://www.vlemx.com/
    Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/
    http://www.theregister.co.uk/2006/08/15/beer_diapers/
    DMOZ: http://www.dmoz.org
    Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html