• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
SDEC2011 Essentials of Mahout
 

SDEC2011 Essentials of Mahout

on

  • 1,512 views

 

Statistics

Views

Total Views
1,512
Views on SlideShare
1,512
Embed Views
0

Actions

Likes
1
Downloads
127
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SDEC2011 Essentials of Mahout SDEC2011 Essentials of Mahout Presentation Transcript

    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Essentials of MahoutMastering Hadoop Map-reduce for Data AnalysisShashank Tiwariblog: shanky.org | twitter: @tshankyst@treasuryofideas.com
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.What is Apache Mahout?• A scalable machine learning infrastructure• Built on top of Hadoop MapReduce• Currently supports: • Clustering, classification, and collaborative filtering, etc...
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.A Little History• Founded by folks active in the Lucene community• Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06- mapreducemulticore.pdf
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Project Goal• Create a community driven scalable and robust machine learning infrastructure• Leverage Hadoop for parallel processing and scalability• Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Supported Algorithms • Collaborative Filtering • User and Item based recommenders • K-Means, Fuzzy K-Means clustering • Mean Shift clustering • Dirichlet process clustering
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.More Supported Algorithms • Latent Dirichlet Allocation • Singular value decomposition • Parallel Frequent Pattern mining • Complementary Naive Bayes classifier • Random forest decision tree based classifier • ...and growing
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Focus Areas • Collaborative Filtering • Clustering • Classification
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Build and Install• Required Software: • Java 1.6.x • Maven 2.0.11+• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout• Compile & install core & examples: mvn install • Alternatively, individually mvn compile, mvn package, and mvn install
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Recommendation Examples • mvn -q exec:java - Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/ workspace/hadoop_workspace/grouplens/ratings.dat" • https://cwiki.apache.org/confluence/display/MAHOUT/ RecommendationExamples
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Common Use Cases • Shopping: Amazon, Netflix • Who to follow/friend: Twitter/Facebook • Web resource classification, spam filtering, financial markets pattern recognition, classification
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Collaborative Filtering Basis • User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges. • Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable. • Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike. • Model-based: provide recommendation on the basis of developing a model of users and their ratings.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Clustering Basis • Clustering algorithms also use the notion of similarity to group similar items into a cluster. • Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques. • Example: Euclidean distance,
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Mahout Taste Framework• Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Mahout Taste Framework• Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully.• Mahout supports building recommendation engines primarily basis the Taste library. • The library supports both user-based and item-based recommendations.• Can be used with Java or over RESTful web-service endpoints.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Taste Framework : Primary Classes • DataModel: Model for Users, Items, and Preferences • UserSimilarity: Interface defining the similarity between two users • ItemSimilarity: Interface defining the similarity between two items • Recommender: Interface for providing recommendations • UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Taste Framework : Online vs Offline • Can do online recommendations for a few thousand data sets. • Leverages Hadoop for offline recommendation calculations on large data sets.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Understanding the Group Lens Implementation• Provide an insight into a sample Mahout Taste Framework Implementation.• Uses the publicly available data set• Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation• Easy to follow example
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Group Lens Implementation Source• GroupLensDataModel.java• GroupLensRecommender.java• GroupLensRecommenderBuilder.java• GroupLensRecommenderEvaluatorRunner.java
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Group Lens Runner -- evaluator• Instantiates an evaluator: • RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); • a “mean average error” algorithm• Parses input parameters: • File ratingsFile = TasteOptionParser.getRatings(args);
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Group Lens Runner -- data model • Parses a colon delimiter pattern file: • DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
    • Group Lens Runner -- evaluate with Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.recommendation builder• evaluates using GroupLensRecommender • double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Questions?• blog: shanky.org | twitter: @tshanky• st@treasuryofideas.com