Mendeley’s Research Catalogue:
building it, opening it up and
making it even more useful for researchers
Kris Jack, PhD
Chief Data Scientist, @_krisjack
Outline
1. What‘s Mendeley?
2. Under the Bonnet
3. Opening up Data
4. Working with Academia
5. Conclusions
What's Mendeley?
Mendeley‘s not just a reference manager
è  Mendeley is a platform that connects
researchers, research data and apps
Mendeley Open API
Mendeley Open API
research catalogue
è  Mendeley is a platform that connects
researchers, research data and apps
...organise
their research
Mendeley provides tools to help users...
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digitalised
annotations
...organise
their research
...collaborate with
one another
Mendeley provides tools to help users...
è  Professional
research groups
è  Social network
è  Annotation
sharing
...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users...
è  Explore crowdsourced
research catalogue
è  Document statistics
è  Personalised article
recommendations
è  Related research
è  Research contact
suggestions
...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users...
...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users...
Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(>425M articles)
Our community from a data perspective
Logging massive
set of usage data
Under the Bonnet
Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digitalised
annotations
è  Professional
research groups
è  Social network
è  Annotation
sharing
è  Explore crowdsourced
research catalogue
è  Document statistics
è  Personalised article
recommendations
è  Related research
è  Research contact
suggestions
Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digitalised
annotations
è  Professional
research groups
è  Social network
è  Annotation
sharing
è  Explore crowdsourced
research catalogue
è  Document statistics
è  Personalised article
recommendations
è  Related research
è  Research contact
suggestions
Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digitalised
annotations
è  Professional
research groups
è  Social network
è  Annotation
sharing
è  Explore crowdsourced
research catalogue
è  Document statistics
è  Personalised article
recommendations
è  Related research
è  Research contact
suggestions
Lots of features to build & support
features
Lots of features to build & support
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Lots of features to build & support
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
The curse of success
•  More articles came
•  More users came
•  Keeping catalogue data fresh was a burden
•  Algorithms relied on global counts
•  Iterating over MySQL tables was slow
•  Needed to shard tables to grow catalogue
•  In short, our backend system didn’t scale
Please try again later
~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles
~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles
The system started to become
slow.
How long did it take to
generate our daily readership
statistics?
~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles
The system started to become
slow.
How long did it take to
generate our daily readership
statistics?
23 hours!
We had serious needs
•  Build a catalogue based on billions of articles
•  Support many features that rely on the catalogue
•  Statistics
•  Search
•  Recommendations
•  Sharing
•  Data
•  Freshness
•  Consistency
•  Business context
•  Agile development (rapid prototyping)
•  Cost effective
•  Going viral
•  Technical debt stacking up
Enter Hadoop
What is Hadoop?
The Apache Hadoop Project develops
open-source software for reliable,
scalable, distributed computing
www.hadoop.apache.org
Hadoop
•  Designed to operate on a cluster of
computers
•  1…thousands
•  Commodity hardware (low cost units)
•  Each node offers local computation and
storage
•  Provides framework for working with big
data (beyond petabytes)
New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
23 hr
computations
now took 15
minutes
New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
recommended
reading
Mendeley Suggest
Generating recommendations
through matrix multiplication
This is item-based
recommendations as
similarity is based on
items, not users
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Running on Amazon's Elastic Map Reduce
On demand use and easy to cost
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
3
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
3
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
3
-4.1K
(63%)
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
3
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
-1.4K
(58%)
+1 (67%)
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Cust. user-based
è 0.3K, 2.5
Mahout's
Performance
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Cust. user-based
è 0.3K, 2.5
-0.7K
(70%)
Mahout's
Performance
-4.1K
(63%)
NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Cust. user-based
è 0.3K, 2.5
-6.2K
(95%)
Mahout's
Performance
+1 (67%)
Disclaimer: these advantages have costs
•  Migrating to a new system (data consistency)
•  Setup costs
•  Learn black magic to configure
•  Hardware for cluster
•  Administrative costs
•  High learning curve to administrate Hadoop
•  Still an immature technology
•  You may need to debug the source code
•  Developing against Mahout
•  Still needs lots of love
Big data backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
Opening up Data
Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(>425M articles)
Our community from a data perspective
Logging massive
set of usage data
Challenge: Build an application with our data,
make science more open.
PloS/Mendeley's Binary Battle
More details at http://dev.mendeley.com/api-binary-battle/
Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/
Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/
Challenge: Metadata Extraction Challenge
The Next Challenge…?
Working with Academia
We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 CSL Editor
2012-2014 CODE
2012-2014 ERASM
2013-2015 EEXCESS
Demo
CSL Editor
http://editor.citationstyles.org/
Demo
CODE Mendeley Desktop
http://code-research.eu/results
Demo
Mendeley Labs
http://labs.mendeley.com/
We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 CSL Editor
2012-2014 CODE
2012-2014 ERASM
2013-2015 EEXCESS
Want to collaborate?
Conclusions
Conclusions
è  Mendeley is far more than a reference manager – it‘s
a platform that connects researchers, data and apps
è  Starting small is good, but be prepared for the cost of
scaling up
è  We‘re opening up our data for you to build apps on
our platform
è  We‘re always keen to collaborate with academic
groups
Kris Jack, PhD
Chief Data Scientist, @_krisjack

Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

  • 1.
    Mendeley’s Research Catalogue: buildingit, opening it up and making it even more useful for researchers Kris Jack, PhD Chief Data Scientist, @_krisjack
  • 2.
    Outline 1. What‘s Mendeley? 2. Under theBonnet 3. Opening up Data 4. Working with Academia 5. Conclusions
  • 3.
  • 4.
    Mendeley‘s not justa reference manager
  • 5.
    è  Mendeley isa platform that connects researchers, research data and apps Mendeley Open API
  • 6.
    Mendeley Open API researchcatalogue è  Mendeley is a platform that connects researchers, research data and apps
  • 7.
    ...organise their research Mendeley providestools to help users... è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations
  • 8.
    ...organise their research ...collaborate with oneanother Mendeley provides tools to help users... è  Professional research groups è  Social network è  Annotation sharing
  • 9.
    ...organise their research ...collaborate with oneanother ...discover new research Mendeley provides tools to help users... è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  • 10.
    ...organise their research ...collaborate with oneanother ...discover new research Mendeley provides tools to help users...
  • 11.
    ...organise their research ...collaborate with oneanother ...discover new research Mendeley provides tools to help users...
  • 12.
    Social network (>2.4M users) Researchcatalogue (~85M unique articles) Research groups (~240K groups) Personal libraries (>425M articles) Our community from a data perspective Logging massive set of usage data
  • 13.
  • 14.
    Lots of featuresto build & support è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations è  Professional research groups è  Social network è  Annotation sharing è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  • 15.
    Lots of featuresto build & support è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations è  Professional research groups è  Social network è  Annotation sharing è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  • 16.
    Lots of featuresto build & support è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations è  Professional research groups è  Social network è  Annotation sharing è  Explore crowdsourced research catalogue è  Document statistics è  Personalised article recommendations è  Related research è  Research contact suggestions
  • 17.
    Lots of featuresto build & support features
  • 18.
    Lots of featuresto build & support features Research catalogue (~30M unique articles) Personal libraries (>100M articles)
  • 19.
    Lots of featuresto build & support features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics)
  • 24.
    The curse ofsuccess •  More articles came •  More users came •  Keeping catalogue data fresh was a burden •  Algorithms relied on global counts •  Iterating over MySQL tables was slow •  Needed to shard tables to grow catalogue •  In short, our backend system didn’t scale
  • 25.
  • 26.
    ~0.5 million users;the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina ~30m research articles
  • 27.
    ~0.5 million users;the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina ~30m research articles The system started to become slow. How long did it take to generate our daily readership statistics?
  • 28.
    ~0.5 million users;the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina ~30m research articles The system started to become slow. How long did it take to generate our daily readership statistics? 23 hours!
  • 29.
    We had seriousneeds •  Build a catalogue based on billions of articles •  Support many features that rely on the catalogue •  Statistics •  Search •  Recommendations •  Sharing •  Data •  Freshness •  Consistency •  Business context •  Agile development (rapid prototyping) •  Cost effective •  Going viral •  Technical debt stacking up
  • 30.
    Enter Hadoop What isHadoop? The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing www.hadoop.apache.org
  • 31.
    Hadoop •  Designed tooperate on a cluster of computers •  1…thousands •  Commodity hardware (low cost units) •  Each node offers local computation and storage •  Provides framework for working with big data (beyond petabytes)
  • 32.
    New tech stackfor backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics)
  • 33.
    New tech stackfor backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics) 23 hr computations now took 15 minutes
  • 34.
    New tech stackfor backend features Research catalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics) recommended reading
  • 35.
  • 37.
    Generating recommendations through matrixmultiplication This is item-based recommendations as similarity is based on items, not users org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
  • 38.
    Running on Amazon'sElastic Map Reduce On demand use and easy to cost
  • 39.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based 3 Mahout's Performance
  • 40.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 3 Mahout's Performance
  • 41.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 3 -4.1K (63%) Mahout's Performance
  • 42.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 3 Mahout's Performance
  • 43.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Mahout's Performance
  • 44.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 -1.4K (58%) +1 (67%) Mahout's Performance
  • 45.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Cust. user-based è 0.3K, 2.5 Mahout's Performance
  • 46.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Cust. user-based è 0.3K, 2.5 -0.7K (70%) Mahout's Performance -4.1K (63%)
  • 47.
    NormalisedAmazonHours No. Good Recommendations/10 0 1K 2K 3K 4K 5K 6K 7K 00.5 1 1.5 2 2.5 Costly & Bad Costly & Good Cheap & Bad Cheap & Good 6.5K, 1.5 Orig. item-based Cust. item-based è 2.4K, 1.5 Orig. user-based è 1K, 2.5 3 Cust. user-based è 0.3K, 2.5 -6.2K (95%) Mahout's Performance +1 (67%)
  • 48.
    Disclaimer: these advantageshave costs •  Migrating to a new system (data consistency) •  Setup costs •  Learn black magic to configure •  Hardware for cluster •  Administrative costs •  High learning curve to administrate Hadoop •  Still an immature technology •  You may need to debug the source code •  Developing against Mahout •  Still needs lots of love
  • 49.
    Big data backend features Researchcatalogue (~30M unique articles) Personal libraries (>100M articles) Crowdsourcing (deduplication, metadata aggregation, statistics)
  • 50.
  • 51.
    Social network (>2.4M users) Researchcatalogue (~85M unique articles) Research groups (~240K groups) Personal libraries (>425M articles) Our community from a data perspective Logging massive set of usage data
  • 55.
    Challenge: Build anapplication with our data, make science more open. PloS/Mendeley's Binary Battle More details at http://dev.mendeley.com/api-binary-battle/
  • 57.
    Challenge: Build off-linesystem for scientific recommendations with our API and DataTEL data set ScienceRec Challenge 2012 More details at http://2012.recsyschallenge.com/tracks/sciencerec/
  • 58.
    Challenge: Build off-linesystem for scientific recommendations with our API and DataTEL data set ScienceRec Challenge 2012 More details at http://2012.recsyschallenge.com/tracks/sciencerec/
  • 59.
    Challenge: Metadata ExtractionChallenge The Next Challenge…?
  • 60.
  • 61.
    We have ahistory of academic collaborations Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS
  • 62.
  • 63.
  • 64.
  • 65.
    We have ahistory of academic collaborations Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS Want to collaborate?
  • 66.
  • 67.
    Conclusions è  Mendeley isfar more than a reference manager – it‘s a platform that connects researchers, data and apps è  Starting small is good, but be prepared for the cost of scaling up è  We‘re opening up our data for you to build apps on our platform è  We‘re always keen to collaborate with academic groups Kris Jack, PhD Chief Data Scientist, @_krisjack