Scientific Article Recommendation      with Mahout           Kris Jack, PhDSenior Data Mining Engineer
Use Case➔    Good researchers are on top of their game➔    Large amount of research produced➔    Takes time to get at what...
1.5 million+ users; the 20 largest user bases:                            University of Cambridge                         ...
1.5 million+ users; the 20 largest user bases:                            University of Cambridge                         ...
Questions➔    How does Mahouts recommender work?➔    How well does it perform out of the box?➔    How well does it perform...
MahoutsRecommender
Generating recommendationsthrough matrix multiplication                                  This is item-based               ...
Researchers                                      Turing Babbage Einstein   Newton                    Comp Sci 1Research Ar...
Researchers                                      Turing Babbage Einstein   Newton                                         ...
Researchers                               Research                               Articlesitem.RecommenderJob 1. Prep. pref...
Researchers                                   Research                                   Articlesitem.RecommenderJob 1. Pr...
Researchers                                    Research                                    Articlesitem.RecommenderJob  1....
Researchers                                       Research                                       Articlesitem.RecommenderJ...
How well does     it work?
Mendeley Suggest
Running on Amazons Elastic Map Reduce                On demand use and easy to cost
MahoutsNormalised Amazon Hours          Performance                          No. Good Recommendations/10
Mahouts               Costly & BadNormalised Amazon Hours              Performance            Costly & Good           Chea...
Mahouts               Costly & BadNormalised Amazon Hours              Performance            Costly & Good           Chea...
Mahouts               Costly & BadNormalised Amazon Hours              Performance            Costly & Good           Chea...
Mahouts               Costly & Bad        Performance           Costly & Good                          7KNormalised Amazon...
Mahouts               Costly & Bad          Performance         Costly & Good                          7K                 ...
Lets tune it!
1. Reduce processing time2. Improve quality
1. Reduce processing time➔    Mahouts recommender is already efficient➔    But your data may have unusual properties➔    H...
Task Allocation              37 hours to complete    1 reducer allocated, despite having 48 available...
Task AllocationAllocating more reducers on a per job basis                job.getConfiguration().setInt(                  ...
Task Allocation   37 hours to complete                      14 hours                      From 1 → 40                     ...
Partitioners   14 hours to complete
Partitioners   14 hours to complete                                      ~50KB                            ~500MB
InputSampler.Sampler<IntWritable, Text> sampler =      new InputSampler.RandomSampler<IntWritable, Text>(...);  InputSampl...
Partitioners        14 hours to complete                   2 hours               Evenly               distributed
Mahouts               Costly & Bad          Performance         Costly & Good                          7K                 ...
Mahouts               Costly & Bad              Performance      Costly & Good                          7K                ...
Mahouts               Costly & Bad              Performance              Costly & Good                          7K        ...
2. Improve quality➔    Mahout provides item-based CF➔    We have many more items than users➔    Typically, user-based is m...
Researchers                                       Research                                       Articlesitem.RecommenderJ...
Researchers   user                                         Research                                         Articles   ite...
Mahouts               Costly & Bad              Performance      Costly & Good                          7K                ...
Mahouts               Costly & Bad              Performance                        Costly & Good                          ...
Mahouts               Costly & Bad              Performance                        Costly & Good                          ...
Mahouts               Costly & Bad              Performance                      Costly & Good                          7K...
Mahouts               Costly & Bad              Performance                   Costly & Good                          7K   ...
Mahouts               Costly & Bad              Performance                      Costly & Good                          7K...
Conclusions
Conclusions➔    Mahout is doing a great job of powering Mendeley Suggest    ➔        Large scale data set    ➔        Good...
Mahouts               Costly & Bad                         Performance                      Costly & Good                 ...
Scientific Article Recommendation with Mahout
Upcoming SlideShare
Loading in...5
×

Scientific Article Recommendation with Mahout

958

Published on

I gave this presentation as part of the Data Science meetup in London on 23rd May, 2012.

This describes how I've been making use of Mahout's item-based collaborative filtering recommender system to produce personalised scientific article recommendations for researchers. I discuss how well Mahout performs out of the box and how I manage to reduce processing time by 95% by tuning it to our data set.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
958
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Scientific Article Recommendation with Mahout"

  1. 1. Scientific Article Recommendation with Mahout Kris Jack, PhDSenior Data Mining Engineer
  2. 2. Use Case➔ Good researchers are on top of their game➔ Large amount of research produced➔ Takes time to get at what you need➔ Help researchers by recommending relevant research
  3. 3. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida50m research articles University of North Carolina
  4. 4. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan We need a Harvard University University of Oxford recommender that Sao Paulo University scales up, coping with Imperial College London University of Edinburgh our data and future Cornell University University of California at Berkeley growth RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida50m research articles University of North Carolina
  5. 5. Questions➔ How does Mahouts recommender work?➔ How well does it perform out of the box?➔ How well does it perform after some tuning?
  6. 6. MahoutsRecommender
  7. 7. Generating recommendationsthrough matrix multiplication This is item-based recommendations as similarity is based on items, not usersorg.apache.mahout.cf.taste.hadoop.item.RecommenderJob
  8. 8. Researchers Turing Babbage Einstein Newton Comp Sci 1Research Articles Comp Sci 2 Physics 1 Physics 2 Input (all user preferences)
  9. 9. Researchers Turing Babbage Einstein Newton 1.5M Comp Sci 1Research Articles Comp Sci 2 Physics 1 Physics 2 300M prefs 50M Input (all user preferences)
  10. 10. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user)
  11. 11. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Turing Articles A Users Preferences (item x user)
  12. 12. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing 2 1 0 0 ResearchResearch 0 0 Articles 1 1Articles 0 0 2 2 0 0 2 2 Item Similarity A Users Preferences (item x item) (item x user)
  13. 13. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research ResearchResearch 0 0 Articles Articles 1 1Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A Users Preferences Recommendations (item x item) (item x user) (item x user)
  14. 14. How well does it work?
  15. 15. Mendeley Suggest
  16. 16. Running on Amazons Elastic Map Reduce On demand use and easy to cost
  17. 17. MahoutsNormalised Amazon Hours Performance No. Good Recommendations/10
  18. 18. Mahouts Costly & BadNormalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  19. 19. Mahouts Costly & BadNormalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  20. 20. Mahouts Costly & BadNormalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  21. 21. Mahouts Costly & Bad Performance Costly & Good 7KNormalised Amazon Hours 6K 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  22. 22. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  23. 23. Lets tune it!
  24. 24. 1. Reduce processing time2. Improve quality
  25. 25. 1. Reduce processing time➔ Mahouts recommender is already efficient➔ But your data may have unusual properties➔ Hadoop may need a helping hand➔ Lets see whats going on...
  26. 26. Task Allocation 37 hours to complete 1 reducer allocated, despite having 48 available...
  27. 27. Task AllocationAllocating more reducers on a per job basis job.getConfiguration().setInt( "mapred.reduce.tasks", numReducers);Allocating more mappers on a per job basis job.getConfiguration().set( "mapred.max.split.size", String.valueOf(splitSize));
  28. 28. Task Allocation 37 hours to complete 14 hours From 1 → 40 reducers
  29. 29. Partitioners 14 hours to complete
  30. 30. Partitioners 14 hours to complete ~50KB ~500MB
  31. 31. InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(...); InputSampler.writePartitionFile(conf, sampler); conf.setPartitionerClass(TotalOrderPartitioner.class);http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
  32. 32. Partitioners 14 hours to complete 2 hours Evenly distributed
  33. 33. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  34. 34. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  35. 35. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  36. 36. 2. Improve quality➔ Mahout provides item-based CF➔ We have many more items than users➔ Typically, user-based is more appropriate ➔ So lets make one!
  37. 37. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research ResearchResearch 0 0 Articles Articles 1 1Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A Users Preferences Recommendations (item x item) (item x user) (item x user)
  38. 38. Researchers user Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Researchers Research Articles Turing Turing 2 1 0 0Researchers Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A Users Preferences Recommendations (item x item) (item x user) (item x user) User Similarity (user x user)
  39. 39. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  40. 40. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  41. 41. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based +1 (67%) ➔ 2.4K, 1.5 2K -1.4K Orig. user-based (58%) 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  42. 42. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  43. 43. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K 1K, 2.5 ➔ -0.7K Cust. user-based (70%) ➔0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  44. 44. Mahouts Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  45. 45. Conclusions
  46. 46. Conclusions➔ Mahout is doing a great job of powering Mendeley Suggest ➔ Large scale data set ➔ Good quality recommendations➔ Tuning helps ➔ Help Hadoop with task allocation if necessary ➔ Partition your data appropriately ➔ We save 95% resources➔ Use an appropriate algorithm ➔ Item- vs user-based (MAHOUT-1004) ➔ We increase precision by 66.6%
  47. 47. Mahouts Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good http://www.mendeley.com/profiles/kris-jack/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×