I gave this talk to an MSc class about Semantic Technologies at the Technical University of Graz (TUG) on 2012/01/12.
It presents what recommendation systems are and how they are often used before delving into how they are used at Mendeley. Real-world results from Mendeley’s article recommendation system are also presented.
The work presented here has been partially funded by the European Commission as part of the TEAM IAPP project (grant no. 251514) within the FP7 People Programme (Marie Curie).
Next-generation AAM aircraft unveiled by Supernal, S-A2
Mendeley: Recommendation Systems for Academic Literature
1. Mendeley:
Recommendation
Systems for Academic
Literature
Kris Jack, PhD
Data Mining Team Lead
2. “All the time we are very
conscious of the huge challenges
that human society has now –
curing cancer, understanding
the brain for Alzheimer‘s [...].
But a lot of the state of knowledge
of the human race is sitting in the
scientists’ computers, and is
currently not shared […] We need
to get it unlocked so we can tackle
those huge problems.“
3. Overview
➔
what's a recommender and what does it look like?
➔
what's Mendeley?
➔
the secrets behind recommenders
➔
recommenders @ Mendeley
4. What's a
recommender and
what does it look like?
5. What's a recommender?
Definition:
A recommendation system
(recommender) is a subclass of
information filtering system that
aims to predict a user's interest
in items.
11. What is Mendeley?
...a large data technology
startup company
...and it's on a mission to
change the way that
research is done!
12. Mendeley Last.fm
3) Last.fm builds your music
works like this: profile and recommends you
music you also could like... and
1) Install “Audioscrobbler” it’s the world‘s biggest open
music database
2) Listen to music
13. Mendeley Last.fm
music libraries research libraries
artists researchers
songs papers
genres disciplines
15. Mendeley provides tools to help users...
...collaborate with
one another
...organise
their research
16. US National Academy of Engineering “Grand Challenges”:
Climate
change Sustainable food
supplies
Artificial
Clean energy Intelligence
Clean water Terrorist
Pandemic diseases violence
Tools of scientific
discovery
17. Mendeley provides tools to help users...
...collaborate with
one another
...organise ...discover new
their research research
18.
19. Mendeley provides tools to help users...
...collaborate with
one another
...organise ...discover new
their research research
20. 1.4 million+ users; the 20 largest userbases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
21. 50m
Real-time data on 28m unique papers:
Thomson Reuters’
Web of Knowledge
(dating from 1934)
Mendeley after
16 months:
22. The secrets behind
recommenders
Q1/2: How can a tool generate recommendations?
Q2/2: How can you measure the tool's performance?
23. Q1/2: How can a tool generate recommendations?
Content-based Filtering Collaborative Filtering
Find items with similar Find items that users who are
characteristics (e.g. title, similar to you also liked (wisdom
discipline) to what the user of the crowds)
previously liked
TF-IDF, BM25, Bayesian User-based and item-based
classifiers, decision trees, artificial variations, matrix factorisation
neural networks
Quickly absorbs new items No need to understand item
(ovecomes cold start problem) characteristics
Can make good recommendations Tends to give more novel
from very few examples recommendations
Hybrid tools too...
24. Q2/2: How can you measure the tool's performance?
➔
Cross validation with hold outs
➔
get yourself a good ground truth
➔
hide a fraction of your data from the system
➔
try to predict the hidden fraction from the
remaining data
➔
calculate precision and recall
➔
Let users decide
➔
set up evaluations with real users (experimental)
➔
track tool usage by users
25. Recommenders
@ Mendeley
1) Related Research
●
given 1 research article
●
find other related articles
2) Personalised Recommendations
●
given a user's profile (e.g. interests)
●
find new articles of interest to them
26.
27. Use Case 1: Related Research
Strategy
content-based approach (tf-idf with lucene implementation)
search for articles with same metadata (e.g. title, tags)
Evaluation
cross-validation with hold outs on a ground truth data set
28.
29. Use Case 1: Related Research
tf-idf Precision per Field when Field is Available
0.5
Q2/2 What are our results?
0.45
0.4
0.35
0.3
Precision @ 5
0.25
0.2
0.15
0.1
0.05
0
tag abstract mesh-term title general-keyword author keyword
metadata field
Results 1) tags are the most informative field for finding related research
30. Use Case 1: Related Research
tf-idf Precision for Field Combos when Field is Available
0.5
0.45
0.4 abstract+author+general-keyword+tag+title
0.35
0.3
precision @ 5
0.25
0.2
0.15
0.1
0.05
0
tag bestCombo abstract mesh-term title general-keyword author keyword
metadata field(s)
Results 2) tags outperform combinations of fields
31. How does Mendeley
use recommendation 2/2 Personalised
Recommendations
technologies?
2) Personalised Recommendations
●
given a user's profile (e.g. interests)
●
find new articles of interest to them
32.
33. Use Case 2: Perso Recommendations
Strategy
collaborative filtering (item-based with apache mahout)
recommend articles to researchers that would interest them
Evaluation
cross-validation with hold outs on a ground truth data set
34.
35. Use Case 2: Perso Recommendations
Strategy
collaborative filtering (item-based with apache mahout)
recommend articles to researchers that would interest them
Evaluation
cross-validation with hold outs on a ground truth data set
41. Precision at 10 articles
Precision by Library Size
Number of articles in user library
42. Test:
10-fold cross validation
50,000 user libraries
So, results comparable to non- Completely distributed, so can
distributed recommender easily run on EC2 within 24
hours...
43.
44. Conclusions
Summary
➔
Recommendations can be complementary to search
➔
They can help users to discover interesting items
➔
They can exploit item metadata (content-based)
➔
They can exploit the 'wisdom of the crowds' (CF)
45. Conclusions
Summary
➔
Crowd-sourced metadata can have a poweful
informative value (e.g. article tags)
➔
Sometimes you need to let data grow
➔
Evaluations under lab conditions don't always
predict real world results well
➔
Recommenders don't just have to be about making
money … remember where we started...?
46. “All the time we are very
conscious of the huge challenges
that human society has now –
curing cancer, understanding
the brain for Alzheimer‘s [...].
But a lot of the state of knowledge
of the human race is sitting in the
scientists’ computers, and is
currently not shared […] We need
to get it unlocked so we can tackle
those huge problems.“