SlideShare a Scribd company logo
An Introduction to Collaborative Filtering
             with Apache Mahout



                         Sebastian Schelter
                   Recommender Systems Challenge
                        at ACM RecSys 2012




         Database Systems and Information Management Group (DIMA)
                        Technische Universität Berlin

13.09.2012
                       http://www.dima.tu-berlin.de/
                               DIMA – TU Berlin                 1
Overview


■ Apache Mahout: apache-licensed library
  with the goal to provide highly scalable
  data mining and machine learning

■ its collaborative filtering module is based on the Taste
  framework of Sean Owen

■ mostly aimed at production scenarios, with a focus on
    □ processing efficiency
    □ integratibility with different datastores, web applications, Amazon EC2
    □ scalability, allows computation of recommendations, items similarities and
      matrix decompositions via MapReduce on Apache Hadoop

■ not that much used in recommender challenges
    □ not enough different algorithms implemented?
    □ not enough tooling for evaluation?

    → it‘s open source, so it‘s up to you to change that!


 13.09.2012                       DIMA – TU Berlin                        2
Preference & DataModel

■ Preference encapsulates a user-item-interaction as
  (user,item,value) triple
    □ only numeric userIDs and itemIDs allowed for memory efficiency
    □ PreferenceArray encapsulates a set of preferences


■ DataModel encapsulates a dataset
    □ lots of convenient accessor methods like getNumUsers(),
      getPreferencesForItem(itemID), ...
    □ allows to add temporal information to preferences
    □ lots of options to store the data (in-memory, file, database, key-value
      store)
    □ drawback: for a lot of usecases, all the data has to fit into memory to allow
      efficient recommendation

DataModel dataModel = new FileDataModel(new File(„movielens.csv“));

PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1);


 13.09.2012                      DIMA – TU Berlin                       3
Recommender

■ Recommender is the basic interface for all of Mahout‘s
  recommenders
    □ recommend n items for a particular user
    □ estimate the preference of a user towards an item


■ a CandidateItemsStrategy fetches all items that might be
  recommended for a particular user

■ a Rescorer allows postprocessing recommendations


List<RecommendedItem> topItems = recommender.recommend(1, 10);

float preference = recommender.estimatePreference(1, 25);




 13.09.2012                     DIMA – TU Berlin            4
Item-Based Collaborative Filtering

■ ItemBasedRecommender
    □ can also compute item similarities
    □ can provide preferences for items as justification for recommendations


■ lots of similarity measures available (Pearson correlation,
  Jaccard coefficient, ...)

■ also allows usage of precomputed item similarities stored in a
  file (via FileItemSimilarity)

ItemBasedRecommender recommender =
   new GenericItemBasedRecommender(dataModel,
   new PearsonCorrelationSimilarity(dataModel));

List<RecommendedItem> similarItems =
   recommender.mostSimilarItems(5, 10);



 13.09.2012                     DIMA – TU Berlin                      5
Latent factor models

■ SVDRecommender
    □ uses a decomposition of the user-item-interaction matrix to compute
      recommendations


■ uses a Factorizer to compute a Factorization from a
  DataModel, several different implementations available

    □ Simon Funk‘s SGD
    □ Alternating Least Squares
    □ Weighted matrix factorization for implicit feedback data

Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures,
   lambda, numIterations);

Recommender svdRecommender =
   new SVDRecommender(dataModel, factorizer);

List<RecommendedItem> topItems = svdRecommender.recommend(1, 10);



 13.09.2012                      DIMA – TU Berlin                   6
Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator
    □ allow to measure the prediction quality of a recommender by using a
      random split of the dataset
    □ support for MAE, RMSE, Precision, Recall, ....
    □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the
      training data



RecommenderEvaluator maeEvaluator = new
   AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(
   new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
   new InteractionCutDataModelBuilder(maxPrefsPerUser),
   dataModel, trainingPercentage, 1 - trainingPercentage);




 13.09.2012                    DIMA – TU Berlin                    7
Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator
    □ allow to measure the prediction quality of a recommender by using a
      random split of the dataset
    □ support for MAE, RMSE, Precision, Recall, ....
    □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the
      training data



RecommenderEvaluator maeEvaluator = new
   AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(
   new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
   new InteractionCutDataModelBuilder(maxPrefsPerUser),
   dataModel, trainingPercentage, 1 - trainingPercentage);




 13.09.2012                    DIMA – TU Berlin                    8
Starting to work on Mahout

■ Prerequisites
    □ Java 6
    □ Maven
    □ svn client


■ checkout the source code from
  http://svn.apache.org/repos/asf/mahout/trunk

■ import it as a maven project into your favorite IDE




 13.09.2012               DIMA – TU Berlin              9
Project: novel item similarity measure

■ in the Million Song DataSet Challenge, a novel item
  similarity measure was used in the winning solution

■ would be great to see this one also featured in Mahout

■ Task
    □ implement the novel item similarity measure as subclass of Mahout’s
      ItemSimilarity


■ Future Work
    □ this novel similarity measure is asymmetric, ensure that it is correctly
      applied in all scenarios




 13.09.2012                      DIMA – TU Berlin                       10
Project: temporal split evaluator

■ currently Mahout‘s standard RecommenderEvaluator
  randomly splits the data into training and test set

■ for datasets with timestamps it would be much more
  interesting use this temporal information to split the data
  into training and test set

■ Task
    □ create a TemporalSplitRecommenderEvaluator similar to the existing
      AbstractDifferenceRecommenderEvaluator


■ Future Work
    □ factor out the logic for splitting datasets into training and test set




 13.09.2012                       DIMA – TU Berlin                        11
Project: baseline method for rating prediction

■ port MyMediaLite’s UserItemBaseline to Mahout
  (preliminary port already available)

■ user-item-baseline estimation is a simple approach that
  estimates the global tendency of a user or an item to
  deviate from the average rating
  (described in Y. Koren: Factor in the Neighbors: Scalable
  and Accurate Collaborative Filtering, TKDD 2009)

■ Task
    □ polish the code
    □ make it work with Mahout’s DataModel


■ Future Work
    □ create an ItemBasedRecommender that makes use of the estimated
      biases



 13.09.2012                   DIMA – TU Berlin                  12
Thank you.




                 Questions?




Sebastian Schelter
Database Systems and Information Management Group (DIMA)
Technische Universität Berlin
    13.09.2012                    DIMA – TU Berlin         13

More Related Content

What's hot

Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Ajit Koti
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
Yasmine Gaber
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
Ted Dunning
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
Aman Adhikari
 
Movie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIsMovie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIs
Smitha Mysore Lokesh
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
Stefano Dalla Palma
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
Uri Lavi
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
Naoki Nakatani
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
Drew Farris
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
IMC Institute
 

What's hot (20)

Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Mahout
MahoutMahout
Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Movie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIsMovie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIs
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 

Similar to Introduction to Collaborative Filtering with Apache Mahout

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Osman Ali
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
Robert Grossman
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
MyMediaLite
MyMediaLiteMyMediaLite
MyMediaLite
Zeno Gantner
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalk
Udaya Kumar
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
RINUSATHYAN
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
hktripathy
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
SSSW
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
Gülden Bilgütay
 
Hudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation AlgorithmsHudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation Algorithms
Loc Nguyen
 
Solving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking ResultsSolving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking Results
ScyllaDB
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
Nicolas Sarramagna
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Ali Alkan
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
ijsrd.com
 

Similar to Introduction to Collaborative Filtering with Apache Mahout (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
MyMediaLite
MyMediaLiteMyMediaLite
MyMediaLite
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalk
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Hudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation AlgorithmsHudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation Algorithms
 
Solving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking ResultsSolving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking Results
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
 

More from sscdotopen

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

More from sscdotopen (8)

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Recently uploaded

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 

Recently uploaded (20)

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 

Introduction to Collaborative Filtering with Apache Mahout

  • 1. An Introduction to Collaborative Filtering with Apache Mahout Sebastian Schelter Recommender Systems Challenge at ACM RecSys 2012 Database Systems and Information Management Group (DIMA) Technische Universität Berlin 13.09.2012 http://www.dima.tu-berlin.de/ DIMA – TU Berlin 1
  • 2. Overview ■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning ■ its collaborative filtering module is based on the Taste framework of Sean Owen ■ mostly aimed at production scenarios, with a focus on □ processing efficiency □ integratibility with different datastores, web applications, Amazon EC2 □ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop ■ not that much used in recommender challenges □ not enough different algorithms implemented? □ not enough tooling for evaluation? → it‘s open source, so it‘s up to you to change that! 13.09.2012 DIMA – TU Berlin 2
  • 3. Preference & DataModel ■ Preference encapsulates a user-item-interaction as (user,item,value) triple □ only numeric userIDs and itemIDs allowed for memory efficiency □ PreferenceArray encapsulates a set of preferences ■ DataModel encapsulates a dataset □ lots of convenient accessor methods like getNumUsers(), getPreferencesForItem(itemID), ... □ allows to add temporal information to preferences □ lots of options to store the data (in-memory, file, database, key-value store) □ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendation DataModel dataModel = new FileDataModel(new File(„movielens.csv“)); PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1); 13.09.2012 DIMA – TU Berlin 3
  • 4. Recommender ■ Recommender is the basic interface for all of Mahout‘s recommenders □ recommend n items for a particular user □ estimate the preference of a user towards an item ■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user ■ a Rescorer allows postprocessing recommendations List<RecommendedItem> topItems = recommender.recommend(1, 10); float preference = recommender.estimatePreference(1, 25); 13.09.2012 DIMA – TU Berlin 4
  • 5. Item-Based Collaborative Filtering ■ ItemBasedRecommender □ can also compute item similarities □ can provide preferences for items as justification for recommendations ■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...) ■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity) ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, new PearsonCorrelationSimilarity(dataModel)); List<RecommendedItem> similarItems = recommender.mostSimilarItems(5, 10); 13.09.2012 DIMA – TU Berlin 5
  • 6. Latent factor models ■ SVDRecommender □ uses a decomposition of the user-item-interaction matrix to compute recommendations ■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available □ Simon Funk‘s SGD □ Alternating Least Squares □ Weighted matrix factorization for implicit feedback data Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures, lambda, numIterations); Recommender svdRecommender = new SVDRecommender(dataModel, factorizer); List<RecommendedItem> topItems = svdRecommender.recommend(1, 10); 13.09.2012 DIMA – TU Berlin 6
  • 7. Evaluating recommenders ■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data RecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 7
  • 8. Evaluating recommenders ■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data RecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 8
  • 9. Starting to work on Mahout ■ Prerequisites □ Java 6 □ Maven □ svn client ■ checkout the source code from http://svn.apache.org/repos/asf/mahout/trunk ■ import it as a maven project into your favorite IDE 13.09.2012 DIMA – TU Berlin 9
  • 10. Project: novel item similarity measure ■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution ■ would be great to see this one also featured in Mahout ■ Task □ implement the novel item similarity measure as subclass of Mahout’s ItemSimilarity ■ Future Work □ this novel similarity measure is asymmetric, ensure that it is correctly applied in all scenarios 13.09.2012 DIMA – TU Berlin 10
  • 11. Project: temporal split evaluator ■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set ■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set ■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing AbstractDifferenceRecommenderEvaluator ■ Future Work □ factor out the logic for splitting datasets into training and test set 13.09.2012 DIMA – TU Berlin 11
  • 12. Project: baseline method for rating prediction ■ port MyMediaLite’s UserItemBaseline to Mahout (preliminary port already available) ■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009) ■ Task □ polish the code □ make it work with Mahout’s DataModel ■ Future Work □ create an ItemBasedRecommender that makes use of the estimated biases 13.09.2012 DIMA – TU Berlin 12
  • 13. Thank you. Questions? Sebastian Schelter Database Systems and Information Management Group (DIMA) Technische Universität Berlin 13.09.2012 DIMA – TU Berlin 13