Mendeley’s Research Catalogue:
building it, opening it up and
making it even more useful for researchers
Kris Jack, PhD
Ch...
Outline
1. What‘s Mendeley?
2. Under the Bonnet
3. Opening up Data
4. Working with Academia
5. Conclusions
What's Mendeley?
Mendeley‘s not just a reference manager
è  Mendeley is a platform that connects
researchers, research data and apps
Mendeley Open API
Mendeley Open API
research catalogue
è  Mendeley is a platform that connects
researchers, research data and apps
...organise
their research
Mendeley provides tools to help users...
è  Reference
management
è  Cite-as-you-
write
è  Fu...
...organise
their research
...collaborate with
one another
Mendeley provides tools to help users...
è  Professional
resea...
...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users....
...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users....
...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users....
Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(...
Under the Bonnet
Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digit...
Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digit...
Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digit...
Lots of features to build & support
features
Lots of features to build & support
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Lots of features to build & support
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)...
The curse of success
•  More articles came
•  More users came
•  Keeping catalogue data fresh was a burden
•  Algorithms r...
Please try again later
~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harv...
~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harv...
~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harv...
We had serious needs
•  Build a catalogue based on billions of articles
•  Support many features that rely on the catalogu...
Enter Hadoop
What is Hadoop?
The Apache Hadoop Project develops
open-source software for reliable,
scalable, distributed c...
Hadoop
•  Designed to operate on a cluster of
computers
•  1…thousands
•  Commodity hardware (low cost units)
•  Each node...
New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsou...
New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsou...
New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsou...
Mendeley Suggest
Generating recommendations
through matrix multiplication
This is item-based
recommendations as
similarity is based on
item...
Running on Amazon's Elastic Map Reduce
On demand use and easy to cost
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Chea...
Disclaimer: these advantages have costs
•  Migrating to a new system (data consistency)
•  Setup costs
•  Learn black magi...
Big data backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(ded...
Opening up Data
Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(...
Challenge: Build an application with our data,
make science more open.
PloS/Mendeley's Binary Battle
More details at http:...
Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 201...
Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 201...
Challenge: Metadata Extraction Challenge
The Next Challenge…?
Working with Academia
We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 C...
Demo
CSL Editor
http://editor.citationstyles.org/
Demo
CODE Mendeley Desktop
http://code-research.eu/results
Demo
Mendeley Labs
http://labs.mendeley.com/
We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 C...
Conclusions
Conclusions
è  Mendeley is far more than a reference manager – it‘s
a platform that connects researchers, data and apps
è...
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Upcoming SlideShare
Loading in …5
×

Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

1,202 views
1,054 views

Published on

Presentation given at Workshop on Academic-Industrial Collaborations for Recommender Systems 2013 (http://bit.ly/114XDsE), JCDL'13. A walk through Mendeley as a platform, growing pains involved with engineering at a large scale, the data that we're making publicly available and some demos that have come out of academic collaborations.

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,202
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
13
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

  1. 1. Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers Kris Jack, PhD Chief Data Scientist, @_krisjack
  2. 2. Outline 1. What‘s Mendeley? 2. Under the Bonnet 3. Opening up Data 4. Working with Academia 5. Conclusions
  3. 3. What's Mendeley?
  4. 4. Mendeley‘s not just a reference manager
  5. 5. è  Mendeley is a platform that connects researchers, research data and apps Mendeley Open API
  6. 6. Mendeley Open API research catalogue è  Mendeley is a platform that connects researchers, research data and apps
  7. 7. ...organise their research Mendeley provides tools to help users... è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations
  8. 8. ...organise their research ...collaborate with one another Mendeley provides tools to help users... è  Professional research groups è  Social network è  Annotation sharing
  9. 9. ...organise their research ...collaborate with one another ...discover new research Mendeley provides tools to help users... è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  10. 10. ...organise their research ...collaborate with one another ...discover new research Mendeley provides tools to help users...
  11. 11. ...organise their research ...collaborate with one another ...discover new research Mendeley provides tools to help users...
  12. 12. Social network (>2.4M users) Research catalogue (~85M unique articles) Research groups (~240K groups) Personal libraries (>425M articles) Our community from a data perspective Logging massive set of usage data
  13. 13. Under the Bonnet
  14. 14. Lots of features to build & support è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations è  Professional research groups è  Social network è  Annotation sharing è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  15. 15. Lots of features to build & support è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations è  Professional research groups è  Social network è  Annotation sharing è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  16. 16. Lots of features to build & support è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations è  Professional research groups è  Social network è  Annotation sharing è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  17. 17. Lots of features to build & support features
  18. 18. Lots of features to build & support features Research catalogue (~30M unique articles) Personal libraries (>100M articles)
  19. 19. Lots of features to build & support features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics)
  20. 20. The curse of success •  More articles came •  More users came •  Keeping catalogue data fresh was a burden •  Algorithms relied on global counts •  Iterating over MySQL tables was slow •  Needed to shard tables to grow catalogue •  In short, our backend system didn’t scale
  21. 21. Please try again later
  22. 22. ~0.5 million users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina ~30m research articles
  23. 23. ~0.5 million users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina ~30m research articles The system started to become slow. How long did it take to generate our daily readership statistics?
  24. 24. ~0.5 million users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina ~30m research articles The system started to become slow. How long did it take to generate our daily readership statistics? 23 hours!
  25. 25. We had serious needs •  Build a catalogue based on billions of articles •  Support many features that rely on the catalogue •  Statistics •  Search •  Recommendations •  Sharing •  Data •  Freshness •  Consistency •  Business context •  Agile development (rapid prototyping) •  Cost effective •  Going viral •  Technical debt stacking up
  26. 26. Enter Hadoop What is Hadoop? The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing www.hadoop.apache.org
  27. 27. Hadoop •  Designed to operate on a cluster of computers •  1…thousands •  Commodity hardware (low cost units) •  Each node offers local computation and storage •  Provides framework for working with big data (beyond petabytes)
  28. 28. New tech stack for backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics)
  29. 29. New tech stack for backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics) 23 hr computations now took 15 minutes
  30. 30. New tech stack for backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics) recommended reading
  31. 31. Mendeley Suggest
  32. 32. Generating recommendations through matrix multiplication This is item-based recommendations as similarity is based on items, not users org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
  33. 33. Running on Amazon's Elastic Map Reduce On demand use and easy to cost
  34. 34. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based 3 Mahout's Performance
  35. 35. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 3 Mahout's Performance
  36. 36. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 3 -4.1K (63%) Mahout's Performance
  37. 37. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 3 Mahout's Performance
  38. 38. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Mahout's Performance
  39. 39. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 -1.4K (58%) +1 (67%) Mahout's Performance
  40. 40. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Cust. user-based è 0.3K, 2.5 Mahout's Performance
  41. 41. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Cust. user-based è 0.3K, 2.5 -0.7K (70%) Mahout's Performance -4.1K (63%)
  42. 42. NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 0 0.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Cust. user-based è 0.3K, 2.5 -6.2K (95%) Mahout's Performance +1 (67%)
  43. 43. Disclaimer: these advantages have costs •  Migrating to a new system (data consistency) •  Setup costs •  Learn black magic to configure •  Hardware for cluster •  Administrative costs •  High learning curve to administrate Hadoop •  Still an immature technology •  You may need to debug the source code •  Developing against Mahout •  Still needs lots of love
  44. 44. Big data backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics)
  45. 45. Opening up Data
  46. 46. Social network (>2.4M users) Research catalogue (~85M unique articles) Research groups (~240K groups) Personal libraries (>425M articles) Our community from a data perspective Logging massive set of usage data
  47. 47. Challenge: Build an application with our data, make science more open. PloS/Mendeley's Binary Battle More details at http://dev.mendeley.com/api-binary-battle/
  48. 48. Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set ScienceRec Challenge 2012 More details at http://2012.recsyschallenge.com/tracks/sciencerec/
  49. 49. Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set ScienceRec Challenge 2012 More details at http://2012.recsyschallenge.com/tracks/sciencerec/
  50. 50. Challenge: Metadata Extraction Challenge The Next Challenge…?
  51. 51. Working with Academia
  52. 52. We have a history of academic collaborations Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS
  53. 53. Demo CSL Editor http://editor.citationstyles.org/
  54. 54. Demo CODE Mendeley Desktop http://code-research.eu/results
  55. 55. Demo Mendeley Labs http://labs.mendeley.com/
  56. 56. We have a history of academic collaborations Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS Want to collaborate?
  57. 57. Conclusions
  58. 58. Conclusions è  Mendeley is far more than a reference manager – it‘s a platform that connects researchers, data and apps è  Starting small is good, but be prepared for the cost of scaling up è  We‘re opening up our data for you to build apps on our platform è  We‘re always keen to collaborate with academic groups Kris Jack, PhD Chief Data Scientist, @_krisjack

×