Intro to Mahout -- DC Hadoop
Upcoming SlideShare
Loading in...5
×
 

Intro to Mahout -- DC Hadoop

on

  • 12,948 views

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Statistics

Views

Total Views
12,948
Views on SlideShare
12,896
Embed Views
52

Actions

Likes
11
Downloads
369
Comments
0

5 Embeds 52

http://hadoopbigdata.wordpress.com 26
http://students.kennesaw.edu 20
http://localhost 4
http://twitter.com 1
http://paper.li 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  • Convergence just checks to see how far the centroid has moved from the previous centroid

Intro to Mahout -- DC Hadoop Intro to Mahout -- DC Hadoop Presentation Transcript

  • Intro to Apache Mahout
    Grant Ingersoll
    Lucid Imagination
    http://www.lucidimagination.com
  • Anyone Here Use Machine Learning?
    Any users of:
    Google?
    Search?
    Priority Inbox?
    Facebook?
    Twitter?
    LinkedIn?
  • Topics
    Background and Use Cases
    What can you do in Mahout?
    Where’s the community at?
    Resources
    K-Means in Hadoop (time permitting)
  • Definition
    “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”
    Intro. To Machine Learning by E. Alpaydin
    Subset of Artificial Intelligence
    Lots of related fields:
    Information Retrieval
    Stats
    Biology
    Linear algebra
    Many more
  • Common Use Cases
    Recommend friends/dates/products
    Classify content into predefined groups
    Find similar content
    Find associations/patterns in actions/behaviors
    Identify key topics/summarize text
    Documents and Corpora
    Detect anomalies/fraud
    Ranking search results
    Others?
  • Apache Mahout
    An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License
    http://mahout.apache.org
    Why Mahout?
    Many Open Source ML libraries either:
    Lack Community
    Lack Documentation and Examples
    Lack Scalability
    Lack the Apache License
    Or are research-oriented
    Definition:http://dictionary.reference.com/browse/mahout
  • What does scalable mean to us?
    Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm
    Some algorithms won’t scale to massive machine clusters
    Others fit logically on a Map Reduce framework like Apache Hadoop
    Still others will need different distributed programming models
    Others are already fast (SGD)
    Be pragmatic
  • Sampling of Who uses Mahout?
    https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
  • What Can I do with Mahout Right Now?3C + FPM + O = Mahout
  • Collaborative Filtering
    Extensive framework for collaborative filtering (recommenders)
    Recommenders
    User based
    Item based
    Online and Offline support
    Offline can utilize Hadoop
    Many different Similarity measures
    Cosine, LLR, Tanimoto, Pearson, others
  • Clustering
    Document level
    Group documents based on a notion of similarity
    K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)
    All Map/Reduce
    Distance Measures
    Manhattan, Euclidean, other
    Topic Modeling
    Cluster words across documents to identify topics
    Latent Dirichlet Allocation (M/R)
  • Categorization
    Place new items into predefined categories:
    Sports, politics, entertainment
    Recommenders
    Implementations
    Naïve Bayes (M/R)
    Compl. Naïve Bayes (M/R)
    Decision Forests (M/R)
    Linear Regression (Seq. but Fast!)
    • See Chapter 17 of Mahout in Action for Shop It To Me use case:
    • http://awe.sm/5FyNe
  • Freq. Pattern Mining
    Identify frequently co-occurrent items
    Useful for:
    Query Recommendations
    Apple -> iPhone, orange, OS X
    Related product placement
    Basket Analysis
    Map/Reduce
    http://www.amazon.com
  • Other
    Primitive Collections!
    Collocations (M/R)
    Math library
    Vectors, Matrices, etc.
    Noise Reduction via Singular Value Decomp (M/R)
  • Prepare Data from Raw content
    Data Sources:
    Lucene integration
    bin/mahout lucene.vector…
    Document Vectorizer
    bin/mahout seqdirectory …
    bin/mahout seq2sparse …
    Programmatically
    See the Utils module in Mahout and the Iterator<Vector> classes
    Database
    File system
  • How to: Command Line
    Most algorithms have a Driver program
    $MAHOUT_HOME/bin/mahout.shhelps with most tasks
    Prepare the Data
    Different algorithms require different setup
    Run the algorithm
    Single Node
    Hadoop
    Print out the results or incorporate into application
    Several helper classes:
    LDAPrintTopics, ClusterDumper, etc.
  • What’s Happening Now?
    Unified Framework for Clustering and Classification
    0.5 release on the horizon (May?)
    Working towards 1.0 release by focusing on:
    Tests, examples, documentation
    API cleanup and consistency
    Gearing up for Google Summer of Code
    New M/R work for Hidden Markov Models
  • Summary
    Machine learning is all over the web today
    Mahout is about scalable machine learning
    Mahout has functionality for many of today’s common machine learning tasks
    Many Mahout implementations use Hadoop
  • Resources
    http://mahout.apache.org
    http://cwiki.apache.org/MAHOUT
    {user|dev}@mahout.apache.org
    http://svn.apache.org/repos/asf/mahout/trunk
    http://hadoop.apache.org
  • Resources
    “Mahout in Action”
    Owen, Anil, Dunning and Friedman
    http://awe.sm/5FyNe
    “Introducing Apache Mahout”
    http://www.ibm.com/developerworks/java/library/j-mahout/
    “Taming Text” by Ingersoll, Morton, Farris
    “Programming Collective Intelligence” by Toby Segaran
    “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
    “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer
  • K-Means
    Clustering Algorithm
    Nicely parallelizable!
    http://en.wikipedia.org/wiki/K-means_clustering
  • K-Means in Map-Reduce
    Input:
    Mahout Vectors representing the original content
    Either:
    A predefined set of initial centroids (Can be from Canopy)
    --k – The number of clusters to produce
    Iterate
    Do the centroid calculation (more in a moment)
    Clustering Step (optional)
    Output
    Centroids (as Mahout Vectors)
    Points for each Centroid (if Clustering Step was taken)
  • Map-Reduce Iteration
    Each Iteration calculates the Centroids using:
    KMeansMapper
    KMeansCombiner
    KMeansReducer
    Clustering Step
    Calculate the points for each Centroid using:
    KMeansClusterMapper
  • KMeansMapper
    During Setup:
    Load the initial Centroids (or the Centroids from the last iteration)
    Map Phase
    For each input
    Calculate it’s distance from each Centroid and output the closest one
    Distance Measures are pluggable
    Manhattan, Euclidean, Squared Euclidean, Cosine, others
  • KMeansReducer
    Setup:
    Load up clusters
    Convergence information
    Partial sums from KMeansCombiner (more in a moment)
    Reduce Phase
    Sum all the vectors in the cluster to produce a new Centroid
    Check for Convergence
    Output cluster
  • KMeansCombiner
    Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • KMeansClusterMapper
    Some applications only care about what the Centroids are, so this step is optional
    Setup:
    Load up the clusters and the DistanceMeasure used
    Map Phase
    Calculate which Cluster the point belongs to
    Output <ClusterId, Vector>