Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

5,085 views
4,506 views

Published on

I gave this presentation as part of the Big Data Week Conferences in London, 25th April, 2012.

Mendeley Suggest is a research article recommendation system powered by Mahout. This presentation explores how Mahout's distributed recommender works and how well it performs when applied to the problem of recommending research to Mendeley users. Based on experimentation, some tips are provided on how to speed Mahout up by tuning it to the characteristics of the training data set. A new recommendation algorithm is also presented that implements user-based collaborative filtering which complements Mahout's existing item-based collaborative filtering algorithm. The user-based implementation will soon be contributed back to the Mahout community.

Published in: Education, Technology
1 Comment
10 Likes
Statistics
Notes
No Downloads
Views
Total views
5,085
On SlideShare
0
From Embeds
0
Number of Embeds
223
Actions
Shares
0
Downloads
106
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

  1. Mahout becomes a researcher Kris Jack, PhDSenior Data Mining Engineer
  2. Overview➔ Whats Mendeley?➔ Applications of Mahouts Recommender➔ Under Mahouts Bonnet➔ Mahouts Research Career so Far➔ Conclusions
  3. Whats Mendeley?
  4. ➔ Mendeley is a data platform for researchers ➔ Were bringing together researchers and the research that they produce from all over the world ➔ Were structuring this data in a machine readable format ➔ Were opening this data up for you to build applications on top of it using our API ➔ These applications help researchers to do even better research and become more productive➔ How are we building our community?
  5. Mendeley provides tools to help users......organisetheir research ➔ Reference management ➔ Cite-as-you- write ➔ Full-text article search ➔ Digitalised annotations
  6. Mendeley provides tools to help users... ...collaborate with one another...organisetheir research ➔ Research network ➔ Professional research groups
  7. Mendeley provides tools to help users... ...collaborate with one another...organise ...discover newtheir research research ➔ Mendeley Suggest ➔ Personalised article recommendations ➔ Weekly batch of 10 recommended articles ➔ Collaborative Filtering ➔ The more data, the better
  8. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida50m research articles University of North Carolina
  9. Mendeley provides tools to help users... ...collaborate with one another...organise ...discover newtheir research research We need a recommender that scales up, coping with our data and future growth
  10. Applications of Mahouts Recommender
  11. Mahout use cases: ➔ Retrieve related items in large collectionshttp://www.slideshare.net/kryton/the-data-layer
  12. Mahout use cases: ➔ Retrieve related items in large collections ➔ Discover relevant items that you may have overlookedhttp://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/
  13. Mahout use cases: ➔ Retrieve related items in large collections ➔ Discover relevant items that you may have overlooked ➔ Find love! ➔ Mahout implements collaborative filtering, a surprisingly powerful algorithmhttp://www.speeddate.com/apps/site/views/mp/technology.php
  14. Mahout use cases: ➔ Retrieve related items in large collections ➔ Discover relevant items that you may have overlooked ➔ Find love! ➔ Mahout implements collaborative filtering, a surprisingly powerful algorithm ➔ Mendeley Suggest ➔ Discover new research ➔ Fill in gaps in your library ➔ Your personal advisorhttp://krisjack.blogspot.co.uk/2012/02/your-very-own-personalised-research.html
  15. Under Mahouts Bonnet
  16. Generating recommendationsthrough matrix multiplication This is item-based recommendations as similarity is based on items, not usersNot convinced? Try reading these... Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA. http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2 http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html
  17. Researchers Turing Babbage Einstein Newton Comp Sci 1Research Articles Comp Sci 2 Physics 1 Physics 2 Input (all user preferences)
  18. Researchers Turing Babbage Einstein Newton 1.5M Comp Sci 1Research Articles Comp Sci 2 Physics 1 Physics 2 300M prefs 50M Input (all user preferences)
  19. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user)
  20. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Turing Articles A Users Preferences (item x user)
  21. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing 2 1 0 0 ResearchResearch 0 0 Articles 1 1Articles 0 0 2 2 0 0 2 2 Item Similarity A Users Preferences (item x item) (item x user)
  22. Researchers Research Articles Research Articles Comp Sci 1 Physics 1 Comp Sci 2 Physics 2 Input (all user preferences) Comp Sci 1 2 1 0 0Research Articles Comp Sci 2 1 1 0 0 Physics 1 0 0 2 2 Physics 2 0 0 2 2
  23. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research ResearchResearch 0 0 Articles Articles 1 1Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A Users Preferences Recommendations (item x item) (item x user) (item x user)
  24. Running on Amazons Elastic Map Reduce On demand use and easy to cost
  25. Mahouts Research Career so Far
  26. Mendeley Suggest
  27. MahoutsNormalised Amazon Hours Performance No. Good Recommendations/10
  28. Mahouts Costly & BadNormalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  29. Mahouts Costly & BadNormalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  30. Mahouts Costly & BadNormalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  31. Mahouts Costly & Bad Performance Costly & Good 7KNormalised Amazon Hours 6K 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  32. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  33. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  34. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  35. Reducing processing time and cost➔ Mahouts recommender is already efficient ➔ but your data may have unusual properties➔ We got improvements by: ➔ tuning Hadoops mapper and reducer allocation over the 10 steps in the RecommenderJob ➔ using an appropriate partitioner
  36. Task Allocation 37 hours to complete 1 reducer allocated, despite having 48 available...
  37. Task AllocationAllocating more reducers on a per job basis job.getConfiguration().setInt( "mapred.reduce.tasks", numMappers);Allocating more mappers on a per job basis job.getConfiguration().set( "mapred.max.split.size", String.valueOf(splitSize));
  38. Task Allocation 37 hours to complete 14 hours From 1 → 40 reducers
  39. Partitioners 14 hours to complete
  40. Partitioners 14 hours to complete ~50KB ~500MB
  41. InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(...); InputSampler.writePartitionFile(conf, sampler); conf.setPartitionerClass(TotalOrderPartitioner.class);http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
  42. Partitioners 14 hours to complete 2 hours Evenly distributed
  43. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  44. Researchers Research Articlesitem.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research ResearchResearch 0 0 Articles Articles 1 1Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A Users Preferences Recommendations (item x item) (item x user) (item x user)
  45. Researchers user Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Researchers Research Articles Turing Turing 2 1 0 0Researchers Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A Users Preferences Recommendations (item x item) (item x user) (item x user) User Similarity (user x user)
  46. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  47. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based +1 (67%) ➔ 2.4K, 1.5 2K -1.4K Orig. user-based (58%) 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  48. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  49. Mahouts Costly & Bad Performance Costly & Good 7K 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K 1K, 2.5 ➔ -0.7K Cust. user-based (70%) ➔0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  50. Mahouts Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  51. Conclusions
  52. Conclusions➔ Mahout is doing a great job of powering Mendeley Suggest ➔ Large scale data set ➔ Excellent for batch processing requirements➔ Well soon be feeding our user-based implementation intoMahout ➔ User-based can outperform item-based ➔ Makes Mahouts offering more rounded➔ Save resources and money by understanding your data ➔ Help Hadoop with task allocation if necessary ➔ Paritition your data appropriately
  53. Were Hiring!➔ Hadoop Data Architect ➔ design a coherent data model across the company ➔ take ownership of our data ➔ hands on Hadoop administration➔ Marie Curie Senior Research Fellow ➔ ensure that Mendeley’s research catalogue is of high quality ➔ research and development opportunity➔ £500 Finders Fee if you find someone who we hire➔ http://www.mendeley.com/careers/
  54. www.mendeley.com

×