Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Mendeley’s Research Catalogue:
building it, opening it up and
making it even more useful for researchers
Kris Jack, PhD
Chief Data Scientist, @_krisjack

Outline
1. What‘s Mendeley?
2. Under the Bonnet
3. Opening up Data
4. Working with Academia
5. Conclusions

Mendeley‘s not just a reference manager

è  Mendeley is a platform that connects
researchers, research data and apps
Mendeley Open API

Mendeley Open API
research catalogue
è  Mendeley is a platform that connects
researchers, research data and apps

...organise
their research
Mendeley provides tools to help users...
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digitalised
annotations

...organise
their research
...collaborate with
one another
è  Professional
research groups
è  Social network
è  Annotation
sharing

...organise
their research
...collaborate with
one another
...discover new
research
è  Explore crowdsourced
research catalogue
è  Document statistics
è  Personalised article
recommendations
è  Related research
è  Research contact
suggestions

...organise
their research
...collaborate with
one another
...discover new
research

Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(>425M articles)
Our community from a data perspective
Logging massive
set of usage data

Lots of features to build & support
è  Reference
management
è  Cite-as-you-
write
è  Full-text
article search
è  Digitalised
annotations
è  Professional
research groups
è  Social network
è  Annotation
sharing
è  Explore crowdsourced
research catalogue
è  Document statistics
è  Personalised article
recommendations
è  Related research
è  Research contact
suggestions

features

features
Research catalogue
Personal libraries
(>100M articles)

features
Research catalogue
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)

The curse of success
•  More articles came
•  More users came
•  Keeping catalogue data fresh was a burden
•  Algorithms relied on global counts
•  Iterating over MySQL tables was slow
•  Needed to shard tables to grow catalogue
•  In short, our backend system didn’t scale

~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles

Stanford University
MIT
Harvard University
Cornell University
RWTH Aachen
Columbia University
Georgia Tech
UC San Diego
The system started to become
slow.
How long did it take to
generate our daily readership
statistics?

Stanford University
MIT
Harvard University
Cornell University
RWTH Aachen
Columbia University
Georgia Tech
UC San Diego
The system started to become
slow.
How long did it take to
generate our daily readership
statistics?
23 hours!

We had serious needs
•  Build a catalogue based on billions of articles
•  Support many features that rely on the catalogue
•  Statistics
•  Search
•  Recommendations
•  Sharing
•  Data
•  Freshness
•  Consistency
•  Business context
•  Agile development (rapid prototyping)
•  Cost effective
•  Going viral
•  Technical debt stacking up

Enter Hadoop
What is Hadoop?
The Apache Hadoop Project develops
open-source software for reliable,
scalable, distributed computing
www.hadoop.apache.org

Hadoop
•  Designed to operate on a cluster of
computers
•  1…thousands
•  Commodity hardware (low cost units)
•  Each node offers local computation and
storage
•  Provides framework for working with big
data (beyond petabytes)

New tech stack for backend
features
Research catalogue
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)

features
Research catalogue
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
23 hr
computations
now took 15
minutes

features
Research catalogue
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
recommended
reading

Generating recommendations
through matrix multiplication
This is item-based
recommendations as
similarity is based on
items, not users
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Running on Amazon's Elastic Map Reduce
On demand use and easy to cost

NormalisedAmazonHours
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5
Orig. item-based
3
Mahout's
Performance

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
3
Mahout's
Performance

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
3
-4.1K
(63%)
Mahout's
Performance

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Mahout's
Performance

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
-1.4K
(58%)
+1 (67%)
Mahout's
Performance

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Cust. user-based
è 0.3K, 2.5
Mahout's
Performance

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Cust. user-based
è 0.3K, 2.5
-0.7K
(70%)
Mahout's
Performance
-4.1K
(63%)

0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
6.5K, 1.5
Orig. item-based
Cust. item-based
è 2.4K, 1.5
Orig. user-based
è 1K, 2.5
3
Cust. user-based
è 0.3K, 2.5
-6.2K
(95%)
Mahout's
Performance
+1 (67%)

Disclaimer: these advantages have costs
•  Migrating to a new system (data consistency)
•  Setup costs
•  Learn black magic to configure
•  Hardware for cluster
•  Administrative costs
•  High learning curve to administrate Hadoop
•  Still an immature technology
•  You may need to debug the source code
•  Developing against Mahout
•  Still needs lots of love

Big data backend
features
Research catalogue
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)

Challenge: Build an application with our data,
make science more open.
PloS/Mendeley's Binary Battle
More details at http://dev.mendeley.com/api-binary-battle/

Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Challenge: Metadata Extraction Challenge
The Next Challenge…?

We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 CSL Editor
2012-2014 CODE
2012-2014 ERASM
2013-2015 EEXCESS

Demo
CSL Editor
http://editor.citationstyles.org/

Demo
CODE Mendeley Desktop
http://code-research.eu/results

Demo
Mendeley Labs
http://labs.mendeley.com/

We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 CSL Editor
2012-2014 CODE
2012-2014 ERASM
2013-2015 EEXCESS
Want to collaborate?

Conclusions
è  Mendeley is far more than a reference manager – it‘s
a platform that connects researchers, data and apps
è  Starting small is good, but be prepared for the cost of
scaling up
è  We‘re opening up our data for you to build apps on
our platform
è  We‘re always keen to collaborate with academic
groups
Kris Jack, PhD
Chief Data Scientist, @_krisjack

Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

More Related Content

What's hot

Viewers also liked

Similar to Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

More from Kris Jack

Recently uploaded

Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers