Introduction to Collaborative Filtering with Apache Mahout

An Introduction to Collaborative Filtering
with Apache Mahout

Sebastian Schelter
Recommender Systems Challenge
at ACM RecSys 2012

Database Systems and Information Management Group (DIMA)
Technische Universität Berlin

13.09.2012
http://www.dima.tu-berlin.de/
DIMA – TU Berlin 1

Overview

■ Apache Mahout: apache-licensed library
with the goal to provide highly scalable
data mining and machine learning

■ its collaborative filtering module is based on the Taste
framework of Sean Owen

■ mostly aimed at production scenarios, with a focus on
□ processing efficiency
□ integratibility with different datastores, web applications, Amazon EC2
□ scalability, allows computation of recommendations, items similarities and
matrix decompositions via MapReduce on Apache Hadoop

■ not that much used in recommender challenges
□ not enough different algorithms implemented?
□ not enough tooling for evaluation?

→ it‘s open source, so it‘s up to you to change that!

13.09.2012 DIMA – TU Berlin 2

Preference & DataModel

■ Preference encapsulates a user-item-interaction as
(user,item,value) triple
□ only numeric userIDs and itemIDs allowed for memory efficiency
□ PreferenceArray encapsulates a set of preferences

■ DataModel encapsulates a dataset
□ lots of convenient accessor methods like getNumUsers(),
getPreferencesForItem(itemID), ...
□ allows to add temporal information to preferences
□ lots of options to store the data (in-memory, file, database, key-value
store)
□ drawback: for a lot of usecases, all the data has to fit into memory to allow
efficient recommendation

DataModel dataModel = new FileDataModel(new File(„movielens.csv“));

PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1);


Recommender

■ Recommender is the basic interface for all of Mahout‘s
recommenders
□ recommend n items for a particular user
□ estimate the preference of a user towards an item

■ a CandidateItemsStrategy fetches all items that might be
recommended for a particular user

■ a Rescorer allows postprocessing recommendations

List<RecommendedItem> topItems = recommender.recommend(1, 10);

float preference = recommender.estimatePreference(1, 25);


Item-Based Collaborative Filtering

■ ItemBasedRecommender
□ can also compute item similarities
□ can provide preferences for items as justification for recommendations

■ lots of similarity measures available (Pearson correlation,
Jaccard coefficient, ...)

■ also allows usage of precomputed item similarities stored in a
file (via FileItemSimilarity)

ItemBasedRecommender recommender =
new GenericItemBasedRecommender(dataModel,
new PearsonCorrelationSimilarity(dataModel));

List<RecommendedItem> similarItems =
recommender.mostSimilarItems(5, 10);


Latent factor models

■ SVDRecommender
□ uses a decomposition of the user-item-interaction matrix to compute
recommendations

■ uses a Factorizer to compute a Factorization from a
DataModel, several different implementations available

□ Simon Funk‘s SGD
□ Alternating Least Squares
□ Weighted matrix factorization for implicit feedback data

Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures,
lambda, numIterations);

Recommender svdRecommender =
new SVDRecommender(dataModel, factorizer);

List<RecommendedItem> topItems = svdRecommender.recommend(1, 10);


Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator
□ allow to measure the prediction quality of a recommender by using a
random split of the dataset
□ support for MAE, RMSE, Precision, Recall, ....
□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the
training data

RecommenderEvaluator maeEvaluator = new
AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(
new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
new InteractionCutDataModelBuilder(maxPrefsPerUser),
dataModel, trainingPercentage, 1 - trainingPercentage);


Starting to work on Mahout

■ Prerequisites
□ Java 6
□ Maven
□ svn client

■ checkout the source code from
http://svn.apache.org/repos/asf/mahout/trunk

■ import it as a maven project into your favorite IDE


Project: novel item similarity measure

■ in the Million Song DataSet Challenge, a novel item
similarity measure was used in the winning solution

■ would be great to see this one also featured in Mahout

■ Task
□ implement the novel item similarity measure as subclass of Mahout’s
ItemSimilarity

■ Future Work
□ this novel similarity measure is asymmetric, ensure that it is correctly
applied in all scenarios


Project: temporal split evaluator

■ currently Mahout‘s standard RecommenderEvaluator
randomly splits the data into training and test set

■ for datasets with timestamps it would be much more
interesting use this temporal information to split the data
into training and test set

■ Task
□ create a TemporalSplitRecommenderEvaluator similar to the existing
AbstractDifferenceRecommenderEvaluator

■ Future Work
□ factor out the logic for splitting datasets into training and test set


Project: baseline method for rating prediction

■ port MyMediaLite’s UserItemBaseline to Mahout
(preliminary port already available)

■ user-item-baseline estimation is a simple approach that
estimates the global tendency of a user or an item to
deviate from the average rating
(described in Y. Koren: Factor in the Neighbors: Scalable
and Accurate Collaborative Filtering, TKDD 2009)

■ Task
□ polish the code
□ make it work with Mahout’s DataModel

■ Future Work
□ create an ItemBasedRecommender that makes use of the estimated
biases


Thank you.

Questions?

Sebastian Schelter
Database Systems and Information Management Group (DIMA)
Technische Universität Berlin

Introduction to Collaborative Filtering with Apache Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Collaborative Filtering with Apache Mahout

Similar to Introduction to Collaborative Filtering with Apache Mahout (20)

More from sscdotopen

More from sscdotopen (8)

Recently uploaded

Recently uploaded (20)

Introduction to Collaborative Filtering with Apache Mahout