A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

34,751 views
34,501 views

Published on

A quick tutorial on Mahout's recommendation algorithm.

Published in: Education, Technology
1 Comment
61 Likes
Statistics
Notes
  • On slide 4 and 5 why do the diagonal yellow cells have different numbers? They represent the same items so they should have 5 or be empty.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
34,751
On SlideShare
0
From Embeds
0
Number of Embeds
1,363
Actions
Shares
0
Downloads
1,179
Comments
1
Likes
61
Embeds 0
No embeds

No notes for slide

A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

  1. 1. A QUICK TUTORIAL ON MAHOUT’S RECOMMENDATION ENGINE (V 0.4) Jee Vang, Ph.D. [email_address] A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License. Slide Version 3.1
  2. 2. What is recommendation? <ul><li>Recommendation involves the prediction of what new items a user would like or dislike based on preferences of or associations to previous items </li></ul><ul><li>(Made-up) Example: </li></ul><ul><ul><li>A user, John Doe, likes the following books (items): </li></ul></ul><ul><ul><ul><li>A Tale of Two Cities </li></ul></ul></ul><ul><ul><ul><li>The Great Gatsby </li></ul></ul></ul><ul><ul><ul><li>For Whom the Bell Tolls </li></ul></ul></ul><ul><ul><li>Recommendations will predict which new books (items), John Doe, will like: </li></ul></ul><ul><ul><ul><li>Jane Eyre </li></ul></ul></ul><ul><ul><ul><li>The Adventures of Tom Sawyer </li></ul></ul></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  3. 3. What is Mahout? <ul><li>Mahout is a machine learning application programming interface (API) built on Hadoop </li></ul><ul><ul><li>MapReduce (MR or M/R) </li></ul></ul><ul><ul><li>Hadoop Distributed File System (HDFS) </li></ul></ul><ul><li>Mahout is written in Java </li></ul><ul><li>Mahout has machine learning algorithms in the following areas: </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Pattern mining </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Evolutionary algorithms </li></ul></ul><ul><ul><li>Recommenders/Collaborative filtering </li></ul></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  4. 4. How does Mahout’s Recommendation Engine Work? X = S U R <ul><li>S is the similarity matrix between items </li></ul><ul><li>U is the user’s preferences for items </li></ul><ul><li>R is the predicted recommendations </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  5. 5. What is the similarity matrix, S? <ul><li>S is a n x n (square) matrix </li></ul><ul><ul><li>Each element, e, in S are indexed by row (j) and column (k), e jk </li></ul></ul><ul><ul><li>Each e jk in S holds a value that describes how similar are its corresponding j-th and k-th items </li></ul></ul><ul><ul><li>In this example, the similarity of the j-th and k-th items are determined by frequency of their co-occurrence (when the j-th item is seen, the k-th item is seen as well) </li></ul></ul><ul><ul><ul><li>In general, any similarity measure may be used to produce these values </li></ul></ul></ul><ul><ul><li>We see in this example that </li></ul></ul><ul><ul><ul><li>Items 1 and 2 co-occur 3 times, </li></ul></ul></ul><ul><ul><ul><li>Items 1 and 3 co-occur 4 times, </li></ul></ul></ul><ul><ul><ul><li>and so on… </li></ul></ul></ul>S Item 1 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  6. 6. What is the user’s preferences, U? <ul><li>The user’s preference is represented as a column vector </li></ul><ul><ul><li>Each value in the vector represents the user’s preference for j-th item </li></ul></ul><ul><ul><li>In general, this column vector is sparse </li></ul></ul><ul><ul><li>Values of zero, 0, represent no recorded preferences for the j-th item </li></ul></ul>U Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  7. 7. What is the recommendation, R? <ul><li>R is a column vector representing the prediction of recommendation of the j-th item for the user </li></ul><ul><li>R is computed from the multiplication of S and U </li></ul><ul><ul><li>S x U = R </li></ul></ul><ul><li>In this running example, the user already has expressed positive preferences for Items 1, 4, 5 and 7, so we look at only Items 2, 3, and 6 </li></ul><ul><li>We would recommend to the user Items 3, 2, and 6, in this order, to the user </li></ul>R Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  8. 8. What data format does Mahout’s recommendation engine expects? <ul><li>For Mahout v0.4, look at RecommenderJob </li></ul><ul><li>(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) </li></ul><ul><li>Each line of the input file should have the following format </li></ul><ul><ul><li>userID,itemID[,preferencevalue] </li></ul></ul><ul><ul><ul><li>userID is parsed as a long </li></ul></ul></ul><ul><ul><ul><li>itemID is parsed as a long </li></ul></ul></ul><ul><ul><ul><li>preferencevalue is parsed as a double and is optional </li></ul></ul></ul>Format 1 123,345 123,456 123,789 … 789,458 Format 2 123,345,1.0 123,456,2.2 123,789,3.4 … 789,458,1.2 A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  9. 9. How do you run Mahout’s recommendation engine? <ul><li>Requirements </li></ul><ul><ul><li>Hadoop cluster on GNU/Linux </li></ul></ul><ul><ul><li>Java 1.6.x </li></ul></ul><ul><ul><li>SSH </li></ul></ul><ul><li>Assuming you have a Hadoop cluster installed and configured correctly with the data loaded into HDFS, </li></ul><ul><ul><li>$HADOOP_INSTALL$/bin/hadoop jar $TARGET$/mahout-core-0.4-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=$INPUT$ -Dmapred.output.dir=$OUTPUT$ </li></ul></ul><ul><ul><ul><li>$HADOOP_INSTALL$ is the location where you installed Hadoop </li></ul></ul></ul><ul><ul><ul><li>$TARGET$ is the directory where you have the Mahout jar file </li></ul></ul></ul><ul><ul><ul><li>$INPUT$ is the input file name </li></ul></ul></ul><ul><ul><ul><li>$OUTPUT$ is the output file name </li></ul></ul></ul><ul><li>There are plenty of runtime options (check javadocs) </li></ul><ul><ul><li>--userFile (path) : optional; a file containing userIDs; only preferences of these userIDs will be computed </li></ul></ul><ul><ul><li>--itemsFile (path) : optional; a file containing itemIDs; only these items will be used in the recommendation predictions </li></ul></ul><ul><ul><li>--numRecommendations (integer) : number of recommendations to compute per user; default 10 </li></ul></ul><ul><ul><li>--booleanData (boolean) : treat input data as having no preference values; default false </li></ul></ul><ul><ul><li>--maxPrefsPerUser (integer) : maximum number of preferences considered per user in final recommendation phase; default 10 </li></ul></ul><ul><ul><li>--similarityClassname (classname): similarity measure (cooccurence, euclidean, log-likelihood, pearson, tanimoto coefficient, uncentered cosine, cosine) </li></ul></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  10. 10. What are the mechanics of Mahout’s recommendation engine? <ul><li>Mahout is built on Hadoop’s MapReduce (MR) API </li></ul><ul><ul><li><K1,V1>  map  <K2,V2> </li></ul></ul><ul><ul><li><K2,List(V2)>  reduce  <K3,V3> </li></ul></ul><ul><li>A series of MR phases (Jobs) are called to accomplish the task of predicting recommendations </li></ul><ul><ul><li>ItemIDIndexMapper, ItemIDIndexReducer </li></ul></ul><ul><ul><li>ItemPrefsMapper,ToUserVectorReducer </li></ul></ul><ul><ul><li>CounterUsersMapper,CounterUsersReducer </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>PartialMultiplyMapper,AggregateAndRecommendReducer </li></ul></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  11. 11. Mahout’s Recommender Engine: Phase 1, Generate List of ItemIDs <ul><li>Input: </li></ul><ul><ul><li><LongWritable,Text> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarIntWritable,VarLongWritable> </li></ul></ul><ul><li>Parses out itemID long </li></ul><ul><li>Converts itemID to int, itemID int </li></ul><ul><li>Emits <itemID int ,itemID long > </li></ul><ul><li>Input: </li></ul><ul><ul><li><VarIntWritable,List(VarLongWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarIntWritable,VarLongWritable> </li></ul></ul><ul><li>Find the smallest value in the list of values, itemID long min </li></ul><ul><li>Emits <itemID int , itemID long min > </li></ul><ul><li>ItemIDIndexMapper </li></ul><ul><li>ItemIDIndexReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  12. 12. Mahout’s Recommender Engine: Phase 2, Create Preference Vector <ul><li>Input: </li></ul><ul><ul><li><LongWritable,Text> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarLongWritable,VarLongWritable> </li></ul></ul><ul><li>Parses out userID and itemID </li></ul><ul><li>Emits <userID,itemID> </li></ul><ul><li>Input: </li></ul><ul><ul><li><VarLongWritable,List(VarLongWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarLongWritable,VectorWritable> </li></ul></ul><ul><li>Creates preferences, U </li></ul><ul><ul><li>U is a sparse Vector </li></ul></ul><ul><li>Emits <userID, U> </li></ul><ul><li>ToItemPrefsMapper </li></ul><ul><li>ToUserVectorReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  13. 13. Mahout’s Recommender Engine: Phase 3, Count Unique Users <ul><li>Input: </li></ul><ul><ul><li><LongWritable,Text> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><CountUsersKeyWritable,VarLongWritable> </li></ul></ul><ul><li>Parses out userID </li></ul><ul><li>Emits <userID,userID> </li></ul><ul><li>Input: </li></ul><ul><ul><li><CountUsersKeyWritable,List(VarLongWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarIntWritable,NullWritable> </li></ul></ul><ul><li>Count all unique users, numUsers </li></ul><ul><li>Emits <numUsers, null> </li></ul><ul><li>CountUsersMapper </li></ul><ul><li>CountUsersReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  14. 14. Mahout’s Recommender Engine: Phase 4, Transpose Preferences Vectors <ul><li>Input: </li></ul><ul><ul><li><VarLongWritable,VectorWritable> </li></ul></ul><ul><ul><li>Uses MR output from Phase 2 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,DistributedRowMatrix.MatrixEntryWritable> </li></ul></ul><ul><li>Transposes MR output from Phase 2 </li></ul><ul><ul><li>MR Phase 2 output had users as rows and items as cols </li></ul></ul><ul><ul><li>Now, items are rows and users are cols </li></ul></ul><ul><li>Each element, e jk , is transposed, e kj </li></ul><ul><li>Emits <k,e kj > </li></ul><ul><li>Input: </li></ul><ul><ul><li><IntWritable,List(DistributedRowMatrix.MatrixEntryWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,VectorWritable> </li></ul></ul><ul><li>Writes transposed user preferences vectors, U’ </li></ul><ul><li>Emits <row, U’> </li></ul><ul><li>MaybePruneRowsMapper </li></ul><ul><li>ToItemVectorsReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  15. 15. Mahout’s Recommender Engine: Phase 5.1, RowSimilarityJob, Compute Weights <ul><li>Input: </li></ul><ul><ul><li><IntWritable,VectorWritable> </li></ul></ul><ul><ul><li>Uses MR output from Phase 4 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarIntWritable,WeightedOccurences> </li></ul></ul><ul><li>For each element, e jk , compute its weighted occurrence, w jk </li></ul><ul><li>Emits <k,w jk > </li></ul><ul><li>Input: </li></ul><ul><ul><li><VarIntWritable,List(WeightedOccurrences)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarIntWritable,WeightedOccurrenceArray> </li></ul></ul><ul><li>Transfers weighted occurrences to array and writes results </li></ul><ul><li>Emits <k, w jk > </li></ul><ul><li>RowWeightMapper </li></ul><ul><li>WeightedOccurrencesPerColumnReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  16. 16. Mahout’s Recommender Engine: Phase 5.2, RowSimilarityJob, Compute Similarities <ul><li>Input: </li></ul><ul><ul><li><VarIntWritable,WeightedOccurrenceArray> </li></ul></ul><ul><ul><li>Uses MR output from Phase 5.1 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><WeightedRowPair,Coocurrence> </li></ul></ul><ul><li>For pair of rows, p, write its column coocurrences, c </li></ul><ul><li>Emits < p , c > </li></ul><ul><li>Input: </li></ul><ul><ul><li><WeightedRowPair,List(Coocurrence)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><SimilarityMatrixEntryKey,MatrixEntryWritable> </li></ul></ul><ul><li>Compute the row similarities between row a and row b , and write corresponding position in the matrix </li></ul><ul><li>Emits <row j , matrix entry> </li></ul><ul><li>CooccurrencesMapper </li></ul><ul><li>SimilarityReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  17. 17. Mahout’s Recommender Engine: Phase 5.3, RowSimilarityJob, Similarity Matrix <ul><li>Input: </li></ul><ul><ul><li><SimilarityMatrixEntryKey,MatrixEntryWritable> </li></ul></ul><ul><ul><li>Uses MR output from Phase 5.2 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><SimilarityMatrixEntryKey,MatrixEntryWritable> </li></ul></ul><ul><li>Writes similarity matrix entry key, sme, and matrix entry, me, as is </li></ul><ul><li>sme is basically each row </li></ul><ul><li>me is basically each row-col entry of the similarity matrix </li></ul><ul><li>Emits <sme,me> </li></ul><ul><li>Input: </li></ul><ul><ul><li><SimilarityMatrixEntryKey,List(MatrixEntryWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,VectorWritable> </li></ul></ul><ul><li>Write the row and its associated vector out </li></ul><ul><li>Emits <row, vector> </li></ul><ul><li>Mapper </li></ul><ul><li>EntriesToVectorsReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  18. 18. Mahout’s Recommender Engine: Phase 6, Pre-partial multiply, Similarity Matrix <ul><li>Input: </li></ul><ul><ul><li>< IntWritable,VectorWritable> </li></ul></ul><ul><ul><li>Uses MR output from Phase 5.3 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Wraps the similarity vector, v 1 , into a different vector format, v 2 </li></ul><ul><li>Emits <row,v 2 > </li></ul><ul><li>Input: </li></ul><ul><ul><li><IntWritable,List(VectorOrPrefWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Write the row and each of its associated vector out </li></ul><ul><li>Emits <row, vector> </li></ul><ul><li>SimilarityMatrixRowWrapperMapper </li></ul><ul><li>Reducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  19. 19. Mahout’s Recommender Engine: Phase 7, Pre-partial multiply, Preferences <ul><li>Input: </li></ul><ul><ul><li>< VarLongWritable,VectorWritable> </li></ul></ul><ul><ul><li>Uses MR output from Phase 2 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li>< VarIntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Maps userID and preference vector, U </li></ul><ul><li>Emits <userID,U> </li></ul><ul><li>Input: </li></ul><ul><ul><li><IntWritable,List(VectorOrPrefWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Write the row and each of its associated vector out </li></ul><ul><li>Emits <row, vector> </li></ul><ul><li>UserVectorSplitterMapper </li></ul><ul><li>Reducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  20. 20. Mahout’s Recommender Engine: Phase 8, Partial Multiply <ul><li>Input: </li></ul><ul><ul><li>< VarLongWritable,VectorWritable> </li></ul></ul><ul><ul><li>Uses MR outputs from Phases 6 and 7 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li>< VarIntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Maps row and vector, v </li></ul><ul><li>Emits <row,v> </li></ul><ul><li>Input: </li></ul><ul><ul><li><VarIntWritable,List(VectorOrPrefWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><IntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Write the row and each of its associated vector similarity, userIDs, and preference values </li></ul><ul><li>Emits <row, vector> </li></ul><ul><li>Mapper </li></ul><ul><li>ToVectorAndPrefReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  21. 21. Mahout’s Recommender Engine: Phase 9, Filters Items <ul><li>Input: </li></ul><ul><ul><li><LongWritable,Text> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarLongWritable,VectorLongWritable> </li></ul></ul><ul><li>Parses userID and itemID </li></ul><ul><li>Emits <itemID,userID> </li></ul><ul><li>Input: </li></ul><ul><ul><li><VarLongWritable,List(VarLongWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarIntWritable,VectorOrPrefWritable> </li></ul></ul><ul><li>Writes itemID and vector of userIDs and preferences </li></ul><ul><li>Emits <itemID, vector> </li></ul><ul><li>ItemFilterMapper </li></ul><ul><li>ItemFilterAsVectorAndPrefReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  22. 22. Mahout’s Recommender Engine: Phase 10, Aggregate and Recommend <ul><li>Input: </li></ul><ul><ul><li><VarIntWritable,VectorAndPrefsWritable> </li></ul></ul><ul><ul><li>Uses MR outputs from phases 8 and 9 </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarLongWritable,PrefAndSimilarityColumnWritable> </li></ul></ul><ul><li>Writes userID and recommendations </li></ul><ul><li>Emits <userID,recommendation> </li></ul><ul><li>Input: </li></ul><ul><ul><li><VarLongWritable,List(PrefAndSimilarityColumnWritable)> </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li><VarLongWritable,RecommendedItemsWritable> </li></ul></ul><ul><li>Writes userID and vector of recommendations </li></ul><ul><li>Emits <userID, vector> </li></ul><ul><li>PartialMultiplyMapper </li></ul><ul><li>AggregateAndRecommendReducer </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  23. 23. Summary and Conclusion <ul><li>Mahout is a machine learning API built on top of Hadoop which includes clustering, pattern mining, classification, regression, evolutionary algorithms, and recommenders </li></ul><ul><li>Mahout’s recommender engine transforms an expected input format into predicted recommendations </li></ul><ul><ul><li>Uses a series of MR phases to accomplish predicting recommendations </li></ul></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.
  24. 24. References <ul><li>S. Owen, R. Anil, T. Dunning, E. Friedman. Mahout in Action . MEAP: Manning Publications, 2010. </li></ul><ul><li>T. White. Hadoop: The Definitive Guide . Sebastopol, CA: O’Reilly Media, Inc., 2009. </li></ul><ul><li>J. Venner. Pro Hadoop . Berkely, CA: Apress, 2009. </li></ul><ul><li>C. Lam. Hadoop in Action . Stamford, CT: Manning Publications Co., 2011. </li></ul>A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License.

×