Your SlideShare is downloading. ×
Introduction to Collaborative Filtering with Apache Mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Introduction to Collaborative Filtering with Apache Mahout

6,431
views

Published on

Published in: Education, Technology

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,431
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
211
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. An Introduction to Collaborative Filtering with Apache Mahout Sebastian Schelter Recommender Systems Challenge at ACM RecSys 2012 Database Systems and Information Management Group (DIMA) Technische Universität Berlin13.09.2012 http://www.dima.tu-berlin.de/ DIMA – TU Berlin 1
  • 2. Overview■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning■ its collaborative filtering module is based on the Taste framework of Sean Owen■ mostly aimed at production scenarios, with a focus on □ processing efficiency □ integratibility with different datastores, web applications, Amazon EC2 □ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop■ not that much used in recommender challenges □ not enough different algorithms implemented? □ not enough tooling for evaluation? → it‘s open source, so it‘s up to you to change that! 13.09.2012 DIMA – TU Berlin 2
  • 3. Preference & DataModel■ Preference encapsulates a user-item-interaction as (user,item,value) triple □ only numeric userIDs and itemIDs allowed for memory efficiency □ PreferenceArray encapsulates a set of preferences■ DataModel encapsulates a dataset □ lots of convenient accessor methods like getNumUsers(), getPreferencesForItem(itemID), ... □ allows to add temporal information to preferences □ lots of options to store the data (in-memory, file, database, key-value store) □ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendationDataModel dataModel = new FileDataModel(new File(„movielens.csv“));PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1); 13.09.2012 DIMA – TU Berlin 3
  • 4. Recommender■ Recommender is the basic interface for all of Mahout‘s recommenders □ recommend n items for a particular user □ estimate the preference of a user towards an item■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user■ a Rescorer allows postprocessing recommendationsList<RecommendedItem> topItems = recommender.recommend(1, 10);float preference = recommender.estimatePreference(1, 25); 13.09.2012 DIMA – TU Berlin 4
  • 5. Item-Based Collaborative Filtering■ ItemBasedRecommender □ can also compute item similarities □ can provide preferences for items as justification for recommendations■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...)■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity)ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, new PearsonCorrelationSimilarity(dataModel));List<RecommendedItem> similarItems = recommender.mostSimilarItems(5, 10); 13.09.2012 DIMA – TU Berlin 5
  • 6. Latent factor models■ SVDRecommender □ uses a decomposition of the user-item-interaction matrix to compute recommendations■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available □ Simon Funk‘s SGD □ Alternating Least Squares □ Weighted matrix factorization for implicit feedback dataFactorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures, lambda, numIterations);Recommender svdRecommender = new SVDRecommender(dataModel, factorizer);List<RecommendedItem> topItems = svdRecommender.recommend(1, 10); 13.09.2012 DIMA – TU Berlin 6
  • 7. Evaluating recommenders■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training dataRecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 7
  • 8. Evaluating recommenders■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training dataRecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 8
  • 9. Starting to work on Mahout■ Prerequisites □ Java 6 □ Maven □ svn client■ checkout the source code from http://svn.apache.org/repos/asf/mahout/trunk■ import it as a maven project into your favorite IDE 13.09.2012 DIMA – TU Berlin 9
  • 10. Project: novel item similarity measure■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution■ would be great to see this one also featured in Mahout■ Task □ implement the novel item similarity measure as subclass of Mahout’s ItemSimilarity■ Future Work □ this novel similarity measure is asymmetric, ensure that it is correctly applied in all scenarios 13.09.2012 DIMA – TU Berlin 10
  • 11. Project: temporal split evaluator■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing AbstractDifferenceRecommenderEvaluator■ Future Work □ factor out the logic for splitting datasets into training and test set 13.09.2012 DIMA – TU Berlin 11
  • 12. Project: baseline method for rating prediction■ port MyMediaLite’s UserItemBaseline to Mahout (preliminary port already available)■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009)■ Task □ polish the code □ make it work with Mahout’s DataModel■ Future Work □ create an ItemBasedRecommender that makes use of the estimated biases 13.09.2012 DIMA – TU Berlin 12
  • 13. Thank you. Questions?Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische Universität Berlin 13.09.2012 DIMA – TU Berlin 13