An Introduction to Collaborative Filtering             with Apache Mahout                         Sebastian Schelter      ...
Overview■ Apache Mahout: apache-licensed library  with the goal to provide highly scalable  data mining and machine learni...
Preference & DataModel■ Preference encapsulates a user-item-interaction as  (user,item,value) triple    □ only numeric use...
Recommender■ Recommender is the basic interface for all of Mahout‘s  recommenders    □ recommend n items for a particular ...
Item-Based Collaborative Filtering■ ItemBasedRecommender    □ can also compute item similarities    □ can provide preferen...
Latent factor models■ SVDRecommender    □ uses a decomposition of the user-item-interaction matrix to compute      recomme...
Evaluating recommenders■ RecommenderEvaluator, RecommenderIRStatsEvaluator    □ allow to measure the prediction quality of...
Evaluating recommenders■ RecommenderEvaluator, RecommenderIRStatsEvaluator    □ allow to measure the prediction quality of...
Starting to work on Mahout■ Prerequisites    □ Java 6    □ Maven    □ svn client■ checkout the source code from  http://sv...
Project: novel item similarity measure■ in the Million Song DataSet Challenge, a novel item  similarity measure was used i...
Project: temporal split evaluator■ currently Mahout‘s standard RecommenderEvaluator  randomly splits the data into trainin...
Project: baseline method for rating prediction■ port MyMediaLite’s UserItemBaseline to Mahout  (preliminary port already a...
Thank you.                 Questions?Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische ...
Upcoming SlideShare
Loading in...5
×

Introduction to Collaborative Filtering with Apache Mahout

7,102

Published on

Published in: Education, Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,102
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
227
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Collaborative Filtering with Apache Mahout"

  1. 1. An Introduction to Collaborative Filtering with Apache Mahout Sebastian Schelter Recommender Systems Challenge at ACM RecSys 2012 Database Systems and Information Management Group (DIMA) Technische Universität Berlin13.09.2012 http://www.dima.tu-berlin.de/ DIMA – TU Berlin 1
  2. 2. Overview■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning■ its collaborative filtering module is based on the Taste framework of Sean Owen■ mostly aimed at production scenarios, with a focus on □ processing efficiency □ integratibility with different datastores, web applications, Amazon EC2 □ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop■ not that much used in recommender challenges □ not enough different algorithms implemented? □ not enough tooling for evaluation? → it‘s open source, so it‘s up to you to change that! 13.09.2012 DIMA – TU Berlin 2
  3. 3. Preference & DataModel■ Preference encapsulates a user-item-interaction as (user,item,value) triple □ only numeric userIDs and itemIDs allowed for memory efficiency □ PreferenceArray encapsulates a set of preferences■ DataModel encapsulates a dataset □ lots of convenient accessor methods like getNumUsers(), getPreferencesForItem(itemID), ... □ allows to add temporal information to preferences □ lots of options to store the data (in-memory, file, database, key-value store) □ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendationDataModel dataModel = new FileDataModel(new File(„movielens.csv“));PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1); 13.09.2012 DIMA – TU Berlin 3
  4. 4. Recommender■ Recommender is the basic interface for all of Mahout‘s recommenders □ recommend n items for a particular user □ estimate the preference of a user towards an item■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user■ a Rescorer allows postprocessing recommendationsList<RecommendedItem> topItems = recommender.recommend(1, 10);float preference = recommender.estimatePreference(1, 25); 13.09.2012 DIMA – TU Berlin 4
  5. 5. Item-Based Collaborative Filtering■ ItemBasedRecommender □ can also compute item similarities □ can provide preferences for items as justification for recommendations■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...)■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity)ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, new PearsonCorrelationSimilarity(dataModel));List<RecommendedItem> similarItems = recommender.mostSimilarItems(5, 10); 13.09.2012 DIMA – TU Berlin 5
  6. 6. Latent factor models■ SVDRecommender □ uses a decomposition of the user-item-interaction matrix to compute recommendations■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available □ Simon Funk‘s SGD □ Alternating Least Squares □ Weighted matrix factorization for implicit feedback dataFactorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures, lambda, numIterations);Recommender svdRecommender = new SVDRecommender(dataModel, factorizer);List<RecommendedItem> topItems = svdRecommender.recommend(1, 10); 13.09.2012 DIMA – TU Berlin 6
  7. 7. Evaluating recommenders■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training dataRecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 7
  8. 8. Evaluating recommenders■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training dataRecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 8
  9. 9. Starting to work on Mahout■ Prerequisites □ Java 6 □ Maven □ svn client■ checkout the source code from http://svn.apache.org/repos/asf/mahout/trunk■ import it as a maven project into your favorite IDE 13.09.2012 DIMA – TU Berlin 9
  10. 10. Project: novel item similarity measure■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution■ would be great to see this one also featured in Mahout■ Task □ implement the novel item similarity measure as subclass of Mahout’s ItemSimilarity■ Future Work □ this novel similarity measure is asymmetric, ensure that it is correctly applied in all scenarios 13.09.2012 DIMA – TU Berlin 10
  11. 11. Project: temporal split evaluator■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing AbstractDifferenceRecommenderEvaluator■ Future Work □ factor out the logic for splitting datasets into training and test set 13.09.2012 DIMA – TU Berlin 11
  12. 12. Project: baseline method for rating prediction■ port MyMediaLite’s UserItemBaseline to Mahout (preliminary port already available)■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009)■ Task □ polish the code □ make it work with Mahout’s DataModel■ Future Work □ create an ItemBasedRecommender that makes use of the estimated biases 13.09.2012 DIMA – TU Berlin 12
  13. 13. Thank you. Questions?Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische Universität Berlin 13.09.2012 DIMA – TU Berlin 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×