Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mahout Tutorial and Hands-on (version 2015)

4,170 views

Published on

A recent tutorial and hands-on about Mahout.
Examples are based on version 0.9 of the Library.

Published in: Technology

Mahout Tutorial and Hands-on (version 2015)

  1. 1. Apache Mahout – Tutorial (2015) Cataldo Musto, Ph.D. Corso di Accesso Intelligente all’Informazione ed Elaborazione del Linguaggio Naturale Università degli Studi di Bari – Dipartimento di Informatica – A.A. 2014/2015 07/01/2015 1
  2. 2. Outline • What is Mahout ? – Overview • How to use Mahout ? – Hands-on session 2
  3. 3. Part 1 What is Mahout? 3
  4. 4. What is (a) Mahout? 4 an elephant driver
  5. 5. • Mahout is a Java library – Implementing Machine Learning techniques 5 What is (a) Mahout?
  6. 6. • Mahout is a Java library – Implementing Machine Learning techniques • Clustering • Classification • Recommendation • Frequent ItemSet 6 What is (a) Mahout?
  7. 7. • Mahout is a Java library – Implementing Machine Learning techniques • Clustering • Classification • Recommendation • Frequent ItemSet (removed) 7 What is (a) Mahout?
  8. 8. What can we do? • Currently Mahout supports mainly four use cases: – Recommendation - takes users' behavior and tries to find items users might like. – Clustering - takes e.g. text documents and groups them into groups of topically related documents. – Classification - learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. 8
  9. 9. Why Mahout? • Mahout is not the only ML framework – Weka – R (http://www.r-project.org/) • Why do we prefer Mahout? (http://www.cs.waikato.ac.nz/ml/weka/) 9
  10. 10. Why Mahout? • Why do we prefer Mahout? – Apache License – Good Community – Good Documentation 10
  11. 11. Why Mahout? • Why do we prefer Mahout? – Apache License – Good Community – Good Documentation –Scalable 11
  12. 12. Why Mahout? • Why do we prefer Mahout? – Apache License – Good Community – Good Documentation –Scalable • Based on Hadoop (not mandatory!) 12
  13. 13. Why do we need a scalable framework? Big Data! 13
  14. 14. When do we need a scalable framework? (e.g. Recommendation Task) Over 100m user-preferences connections 14 http://mahout.apache.org/users/recommende r/recommender-first-timer-faq.html
  15. 15. Use Cases 15
  16. 16. Use Cases Recommendation Engine on Foursquare 16
  17. 17. Use Cases User Interest Modeling on Twitter 17
  18. 18. Use Cases Pattern Mining on Yahoo! (as anti-spam) 18
  19. 19. Algorithms • Recommendation – User-based Collaborative Filtering – Item-based Collaborative Filtering – Matrix Factorization-based CF • Several factorization techniques 19
  20. 20. Algorithms • Clustering – K-Means – Fuzzy K-Means – Streaming K-Means – etc. • Topic Modeling – LDA (Latent Dirichlet Allocation) 20
  21. 21. Algorithms • Classification – Logistic Regression – Bayes (only on Hadoop, from release 0.9) – Random Forests (only on Hadoop, from release 0.9) – Hidden Markov Models – Perceptrons 21
  22. 22. Algorithms • Other – Text processing • Creation of sparse vectors from text – Dimensionality Reduction techniques • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) – (and much more.) 22
  23. 23. Mahout in the Apache Software Foundation 23
  24. 24. Mahout in the Apache Software Foundation Original Mahout Project 24
  25. 25. Mahout in the Apache Software Foundation Taste: collaborative filtering framework 25
  26. 26. Mahout in the Apache Software Foundation Lucene: information retrieval software library 26
  27. 27. Mahout in the Apache Software Foundation Hadoop: framework for distributed storage and programming based on MapReduce 27
  28. 28. Next Releases: Watch out! Hadoop is going to be replaced by Apache Spark 28 X http://spark.apache.org
  29. 29. General Architecture Three-tiers architecture (Application, Algorithms and Shared Libraries) 29
  30. 30. General Architecture Business Logic 30
  31. 31. General Architecture Data Storage and Shared Libraries 31
  32. 32. General Architecture External Applications invoking Mahout APIs 32
  33. 33. In this tutorial we will focus on Recommendation 33
  34. 34. Recommendation • Mahout implements a Collaborative Filtering framework – Uses historical data (ratings, clicks, and purchases) to provide recommendations • User-based: recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users; • Item-based: calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline; – Popularized by Amazon and others • Matrix factorization-based: split the original user-item matrix in smaller matrices in order to analyze item rating patterns and learn some latent factors explaining users’ behavior and items characteristics. – Popularized by Netflix Prize 34
  35. 35. Recommendation - Architecture 35
  36. 36. Recommendation Workflow Inceptive Idea: A Java/J2EE application invokes a Mahout Recommender whose DataModel is based on a set of User Preferences that are built on the ground of a physical DataStore 36
  37. 37. Physical Storage (database, files, etc.) 37 Recommendation Workflow
  38. 38. Physical Storage (database, files, etc.) Data Model 38 Recommendation Workflow
  39. 39. Physical Storage (database, files, etc.) Data Model Recommender 39 Recommendation Workflow
  40. 40. External Application Physical Storage (database, files, etc.) Data Model Recommender 40 Recommendation Workflow
  41. 41. Recommendation in Mahout • Input: raw data (user preferences) • Output: preferences estimation • Step 1 – Mapping raw data into a DataModel Mahout-compliant • Step 2 – Tuning recommender components • Similarity measure, neighborhood, etc. • Step 3 – Computing rating estimations • Step 4 – Evaluating recommendation 41
  42. 42. Recommendation Components • Mahout key abstractions are implemented through Java interfaces : – DataModel interface • Methods for mapping raw data to a Mahout-compliant form – UserSimilarity interface • Methods to calculate the degree of correlation between two users – ItemSimilarity interface • Methods to calculate the degree of correlation between two items – UserNeighborhood interface • Methods to define the concept of ‘neighborhood’ – Recommender interface • Methods to implement the recommendation step itself 42
  43. 43. Recommendation Components • Mahout key abstractions are implemented through Java interfaces : – example: DataModel interface • Each methods for mapping raw data to a Mahout- compliant form is an implementation of the generic interface • e.g. MySQLJDBCDataModel feeds a DataModel from a MySQL database • (and so on) 43
  44. 44. Components: DataModel • A DataModel is the interface to draw information about user preferences. • Which sources is it possible to draw? – Database • MySQLJDBCDataModel, PostgreSQLDataModel • NoSQL databases supported: MongoDBDataModel, CassandraDataModel – External Files • FileDataModel – Generic (preferences directly feed through Java code) • GenericDataModel (They are all implementations of the DataModel interface) 44
  45. 45. • GenericDataModel – Feed through Java calls • FileDataModel – CSV (Comma Separated Values) • JDBCDataModel – JDBC Driver – Standard database structure Components: DataModel 45
  46. 46. FileDataModel – CSV input 46
  47. 47. Components: DataModel • Regardless the source, they all share a common implementation. • Basic object: Preference – Preference is a triple (user,item,score) – Stored in UserPreferenceArray 47
  48. 48. Components: DataModel • Basic object: Preference – Preference is a triple (user,item,score) – Stored in UserPreferenceArray • Two implementations – GenericUserPreferenceArray • It stores numerical preference, as well. – BooleanUserPreferenceArray • It skips numerical preference values. 48
  49. 49. Components: UserSimilarity • UserSimilarity defines a notion of similarity between two Users. – (respectively) ItemSimilarity defines a notion of similarity between two Items. • Which definition of similarity are available? – Pearson Correlation – Spearman Correlation – Euclidean Distance – Tanimoto Coefficient – LogLikelihood Similarity – Already implemented! 49
  50. 50. Example: TanimotoDistance 50
  51. 51. Example: CosineSimilarity 51
  52. 52. Different Similarity definitions influence neighborhood formation 52
  53. 53. Pearson’s vs. Euclidean distance 53
  54. 54. Pearson’s vs. Euclidean distance 54
  55. 55. Pearson’s vs. Euclidean distance 55
  56. 56. Components: UserNeighborhood • Which definition of neighborhood are available? – Nearest N users • The first N users with the highest similarity are labeled as ‘neighbors’ – Thresholds • Users whose similarity is above a threshold are labeled as ‘neighbors’ – Already implemented! 56
  57. 57. Components: Recommender • Given a DataModel, a definition of similarity between users (items) and a definition of neighborhood, a recommender produces as output an estimation of relevance for each unseen item • Which recommendation algorithms are implemented? – User-based CF – Item-based CF – SVD-based CF (and much more…) 57
  58. 58. Recap • Many implementations of a CF-based recommender! – Different recommendation algorithms – Different neighborhood definitions – Different similarity definitions • Evaluation fo the different implementations is actually very time-consuming – The strength of Mahout lies in that it is possible to save time in the evaluation of the different combinations of the parameters! – Standard interface for the evaluation of a Recommender System 58
  59. 59. Evaluation • Mahout provides classes for the evaluation of a recommender system – Prediction-based measures • Mean Average Error • RMSE (Root Mean Square Error) – IR-based measures • Precision, Recall, F1-measure, F1@n • NDCG (ranking measure) 59
  60. 60. Evaluation • Prediction-based Measures – Class: AverageAbsoluteDifferenceEvaluator – Method: evaluate() – Parameters: • Recommender implementation • DataModel implementation • TrainingSet size (e.g. 70%) • % of the data to use in the evaluation (smaller % for fast prototyping) 60
  61. 61. Evaluation • IR-based Measures – Class: GenericRecommenderIRStatsEvaluator – Method: evaluate() – Parameters: • Recommender implementation • DataModel implementation • Relevance Threshold (mean+standard deviation) • % of the data to use in the evaluation (smaller % for fast prototyping) 61
  62. 62. Part 2 How to use Mahout? Hands-on 62
  63. 63. Download Mahout • Download – The latest Mahout release is 0.9 – Available at: http://archive.apache.org/dist/mahout/0.9/mahout- distribution-0.9.zip – Extract all the libraries and include them in a new NetBeans (Eclipse) project • Requirement: Java 1.6.x or greater. • Hadoop is not mandatory! 63
  64. 64. Important JavaDoc https://builds.apache.org/job/Mahout-Quality/javadoc/ 64
  65. 65. Exercise 1 • Create a Preference object • Set preferences through some simple Java call • Print some statistics about preferences – How many preferences? On which items? – Wheter a user has expressed preference on a certain item. – Which one is the item with the highest score? 65
  66. 66. Hints • Hints about objects to be used: – Preference • Methods setUserId, setItemId, setValue; – GenericUserPreferenceArray • Dimension = number of preferences to be defined; • Methods: getIds(), sortByValueReversed(),hasPrefWithItemId(id); 66
  67. 67. Exercise 1: preferences import org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray; import org.apache.mahout.cf.taste.model.Preference; import org.apache.mahout.cf.taste.model.PreferenceArray; class CreatePreferenceArray { private CreatePreferenceArray() { } public static void main(String[] args) { PreferenceArray user1Prefs = new GenericUserPreferenceArray(2); user1Prefs.setUserID(0, 1L); user1Prefs.setItemID(0, 101L); user1Prefs.setValue(0, 2.0f); user1Prefs.setItemID(1, 102L); user1Prefs.setValue(1, 3.0f); Preference pref = user1Prefs.get(1); System.out.println(pref); } } 67
  68. 68. Exercise 1: preferences import org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray; import org.apache.mahout.cf.taste.model.Preference; import org.apache.mahout.cf.taste.model.PreferenceArray; class CreatePreferenceArray { private CreatePreferenceArray() { } public static void main(String[] args) { PreferenceArray user1Prefs = new GenericUserPreferenceArray(2); user1Prefs.setUserID(0, 1L); user1Prefs.setItemID(0, 101L); user1Prefs.setValue(0, 2.0f); user1Prefs.setItemID(1, 102L); user1Prefs.setValue(1, 3.0f); Preference pref = user1Prefs.get(1); System.out.println(pref); } } Score 2 for Item 101 68
  69. 69. Exercise 2 • Create a DataModel • Feed the DataModel through some simple Java calls • Print some statistics about data (how many users, how many items, maximum ratings, etc.) 69
  70. 70. Exercise 2 • Hints about objects to be used: – FastByIdMap • PreferenceArray stores the preferences of a single user • Where do the preferences of all the users are stored? – An HashMap? No. – Mahout introduces data structures optimized for recommendation tasks – HashMap are replaced by FastByIDMap – Model • Methods: getNumItems(), getNumUsers(),getMaxPreference() • General statistics about the model. 70
  71. 71. Exercise 2: data model import org.apache.mahout.cf.taste.impl.common.FastByIDMap; import org.apache.mahout.cf.taste.impl.model.GenericDataModel; import org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.model.PreferenceArray; class CreateGenericDataModel { private CreateGenericDataModel() { } public static void main(String[] args) { FastByIDMap<PreferenceArray> preferences = new FastByIDMap<PreferenceArray>(); PreferenceArray prefsForUser1 = new GenericUserPreferenceArray(10); prefsForUser1.setUserID(0, 1L); prefsForUser1.setItemID(0, 101L); prefsForUser1.setValue(0, 3.0f); prefsForUser1.setItemID(1, 102L); prefsForUser1.setValue(1, 4.5f); preferences.put(1L, prefsForUser1); DataModel model = new GenericDataModel(preferences); System.out.println(model); } } 71
  72. 72. Exercise 3 • Create a DataModel • Feed the DataModel through a CSV file • Calculate similarities between users – CSV file should contain enough data! 72
  73. 73. Exercise 3 • Hints about objects to be used: – FileDataModel • Argument: new File with the path of the CSV – PearsonCorrelationSimilarity, TanimotoCoefficientSimilarity, etc. • Argument: the model 73
  74. 74. Exercise 3: similarity import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.impl.model.*; import org.apache.mahout.cf.taste.impl.model.file.FileDatModel; class Example3_Similarity { public static void main(String[] args) throws Exception { // Istanzia il DataModel e crea alcune statistiche DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity pearson = new PearsonCorrelationSimilarity(model); UserSimilarity euclidean = new EuclideanDistanceSimilarity(model); System.out.println("Pearson:"+pearson.userSimilarity(1, 2)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 2)); System.out.println("Pearson:"+pearson.userSimilarity(1, 3)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 3)); } } 74
  75. 75. Exercise 3: similarity import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.impl.model.*; import org.apache.mahout.cf.taste.impl.model.file.FileDatModel; class Example3_Similarity { public static void main(String[] args) throws Exception { // Istanzia il DataModel e crea alcune statistiche DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity pearson = new PearsonCorrelationSimilarity(model); UserSimilarity euclidean = new EuclideanDistanceSimilarity(model); System.out.println("Pearson:"+pearson.userSimilarity(1, 2)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 2)); System.out.println("Pearson:"+pearson.userSimilarity(1, 3)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 3)); } } FileDataModel 75
  76. 76. Exercise 3: similarity import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.impl.model.*; import org.apache.mahout.cf.taste.impl.model.file.FileDatModel; class Example3_Similarity { public static void main(String[] args) throws Exception { // Istanzia il DataModel e crea alcune statistiche DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity pearson = new PearsonCorrelationSimilarity(model); UserSimilarity euclidean = new EuclideanDistanceSimilarity(model); System.out.println("Pearson:"+pearson.userSimilarity(1, 2)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 2)); System.out.println("Pearson:"+pearson.userSimilarity(1, 3)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 3)); } } Similarity Definitions 76
  77. 77. Exercise 3: similarity import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.impl.model.*; import org.apache.mahout.cf.taste.impl.model.file.FileDatModel; class Example3_Similarity { public static void main(String[] args) throws Exception { // Istanzia il DataModel e crea alcune statistiche DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity pearson = new PearsonCorrelationSimilarity(model); UserSimilarity euclidean = new EuclideanDistanceSimilarity(model); System.out.println("Pearson:"+pearson.userSimilarity(1, 2)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 2)); System.out.println("Pearson:"+pearson.userSimilarity(1, 3)); System.out.println("Euclidean:"+euclidean.userSimilarity(1, 3)); } } Output 77
  78. 78. Exercise 4 • Create a DataModel • Feed the DataModel through a CSV file • Calculate similarities between users – CSV file should contain enough data! • Generate neighboorhood • Generate recommendations 78
  79. 79. Exercise 4 • Create a DataModel • Feed the DataModel through a CSV file • Calculate similarities between users – CSV file should contain enough data! • Generate neighboorhood • Generate recommendations – Compare different combinations of parameters! 79
  80. 80. Exercise 4 • Create a DataModel • Feed the DataModel through a CSV file • Calculate similarities between users – CSV file should contain enough data! • Generate neighboorhood • Generate recommendations – Compare different combinations of parameters! 80
  81. 81. Exercise 4 • Hints about objects to be used: – NearestNUserNeighborhood – GenericUserBasedRecommender • Parameters: – data model  already shown – Neighborhood » Class: NearestNUserNeighborhood(n,similarity,model) » Class: ThresholdUserNeighborhood(thr,similarity,model) – similarity measure  already shown 81
  82. 82. Exercise 4: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } 82
  83. 83. Exercise 4: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } FileDataModel 83
  84. 84. Exercise 4: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } 2 neighbours 84
  85. 85. Exercise 4: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } Top-1 Recommendation for User 1 85
  86. 86. • Download the GroupLens dataset (100k) – Its format is already Mahout compliant – http://files.grouplens.org/datasets/movielens/ml- 100k.zip • Preparatory Exercise: repeat exercise 3 (similarity calculations) with a bigger dataset • Next: now we can run the recommendation framework against a state-of-the-art dataset Exercise 5: MovieLens Recommender 86
  87. 87. import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { DataModel model = new FileDataModel(new File("ua.base")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 20); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } Exercise 5: MovieLens Recommender 87
  88. 88. import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { DataModel model = new FileDataModel(new File("ua.base")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(10, 50); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } Exercise 5: MovieLens Recommender We can play with parameters! 88
  89. 89. Exercise 5: MovieLens Recommender • Analyze Recommender behavior with different combinations of parameters – Do the recommendations change with a different similarity measure? – Do the recommendations change with different neighborhood sizes? – Which one is the best one? • …. Let’s go to the next exercise! 89
  90. 90. • Evaluate different CF recommender configurations on MovieLens data • Metrics: RMSE, MAE, Precision Exercise 6: Recommender Evaluation 90
  91. 91. • Evaluate different CF recommender configurations on MovieLens data • Metrics: RMSE, MAE • Hints: useful classes – Implementations of RecommenderEvaluator interface • AverageAbsoluteDifferenceRecommenderEvaluator • RMSRecommenderEvaluator Exercise 6: Recommender Evaluation 91
  92. 92. • Further Hints: – Use RandomUtils.useTestSeed()to ensure the consistency among different evaluation runs – Invoke the evaluate() method • Parameters – RecommenderBuilder: recommender instance (as in previous exercises. – DataModelBuilder: specific criterion for training – Split Training-Test: double value (e.g. 0.7 for 70%) – Amount of data to use in the evaluation: double value (e.g 1.0 for 100%) Exercise 6: Recommender Evaluation 92
  93. 93. Example 6: evaluation class EvaluatorIntro { private EvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); } } Ensures the consistency between different evaluation runs. 93
  94. 94. Exercise 6: evaluation class EvaluatorIntro { private EvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); } } 94
  95. 95. Exercise 6: evaluation class EvaluatorIntro { private EvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); } } 70%training (whole dataset evaluation) 95
  96. 96. Exercise 6: evaluation class EvaluatorIntro { private EvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); } } Recommendation Engine 96
  97. 97. Example 6: evaluation class EvaluatorIntro { private EvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); } } We can add more measures 97
  98. 98. Exercise 6: evaluation class EvaluatorIntro { private EvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); System.out.println(rmse); } } 98
  99. 99. Exercise 7: item-based recommender • Mahout provides Java classes for building an item-based recommender system – Amazon-like – Recommendations are based on similarities among items (generally pre-computed offline) – Evaluate it with the MovieLens dataset! 99
  100. 100. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { ItemSimilarity similarity = new PearsonCorrelationSimilarity(model); return new GenericItemBasedRecommender(model, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); System.out.println(rmse); } } Exercise 7: item-based recommender 100
  101. 101. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { ItemSimilarity similarity = new PearsonCorrelationSimilarity(model); return new GenericItemBasedRecommender(model, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); System.out.println(rmse); } } ItemSimilarity Example 7: item-based recommender 101
  102. 102. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { ItemSimilarity similarity = new PearsonCorrelationSimilarity(model); return new GenericItemBasedRecommender(model, similarity); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); System.out.println(rmse); } } No Neighborhood definition for item- based recommenders Example 7: item-based recommender 102
  103. 103. Exercise 8: MF-based recommender • Mahout provides Java classes for building an a CF recommender system based on state-of-the-art matrix factorization techniques – Class: SVDRecommender • Parameters: DataModel, Factorizer • Factorizer: a factorization algorithm – Alternating Least Squares (ALSWRFactorizer) – SVD++ (SVDPlusPlusFactorizer) – Stochastic Gradient Descent (ParallelSGDFactorizer)… etc – Several parameters to tune! – Evaluate it with the MovieLens dataset! 103
  104. 104. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { ALSWRFactorizer = new ALSWRFactorizer(model, 10, 0.065, 60); return new SVDRecommender(model, factorizer); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); System.out.println(rmse); } } 104 Exercise 8: MF-based recommender
  105. 105. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderEvaluator rmse = new RMSEEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { ALSWRFactorizer = new ALSWRFactorizer(model, 10, 0.065, 60); return new SVDRecommender(model, factorizer); } }; double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0); System.out.println(score); System.out.println(rmse); } } 105 Exercise 8: MF-based recommender Hyperparameters: latent factors, lambda, iterations
  106. 106. Mahout Strengths • Fast-prototyping and evaluation – To evaluate a different configuration of the same algorithm we just need to update a parameter and run again. – Example • Different Neighborhood Size • Different similarity measures, etc. 106
  107. 107. 5 minutes to look for the best configuration  107
  108. 108. • Evaluation of CF algorithms through IR measures • Metrics: Precision, Recall Exercise 9: Recommender Evaluation 108
  109. 109. • Evaluation of CF algorithms through IR measures • Metrics: Precision, Recall • Hints: useful classes – GenericRecommenderIRStatsEvaluator – Evaluate() method • Same parameters of exercise 6 and 7 Exercise 9: Recommender Evaluation 109
  110. 110. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1,0); System.out.println(stats.getPrecision()); System.out.println(stats.getRecall()); System.out.println(stats.getF1()); } } Exercise 9: IR-based evaluation Precision@5 , Recall@5, etc. 110
  111. 111. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1,0); System.out.println(stats.getPrecision()); System.out.println(stats.getRecall()); System.out.println(stats.getF1()); } } Exercise 9: IR-based evaluation Precision@5 , Recall@5, etc. 111
  112. 112. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(500, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1,0); System.out.println(stats.getPrecision()); System.out.println(stats.getRecall()); System.out.println(stats.getF1()); } } Exercise 9: IR-based evaluation Set Neighborhood to 500 112
  113. 113. class IREvaluatorIntro { private IREvaluatorIntro() { } public static void main(String[] args) throws Exception { RandomUtils.useTestSeed(); DataModel model = new FileDataModel(new File("ua.base")); RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); // Build the same recommender for testing that we did last time: RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { @Override public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new EuclideanDistanceSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(500, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1,0); System.out.println(stats.getPrecision()); System.out.println(stats.getRecall()); System.out.println(stats.getF1()); } } Exercise 9: IR-based evaluation Set Euclidean Distance 113
  114. 114. • Write a class that automatically runs evaluation with different parameters – e.g. fixed neighborhood sizes from an Array of values – Print the best scores and the configuration Exercise 10: Recommender Evaluation 114
  115. 115. • Find the best configuration for several datasets – Download datasets from http://mahout.apache.org/users/basics/collections.html –Write classes to transform input data in a Mahout-compliant form –Extend exercise 10! Exercise 11: more datasets! 115
  116. 116. End. Do you want more? 116
  117. 117. Do you want more? • Recommendation – Deploy of a Mahout-based Web Recommender – Integration with Hadoop/Spark – Integration of content-based information – Custom similarities, Custom recommenders, Re- scoring functions • Content-based Recommender Systems through Classification Algorithms! 117

×