Proposing a Scientific Paper Retrieval and Recommender Framework

Proposing a Scientific Paper Retrieval and
Recommender Framework
Aravind Sesagiri Raamkumar, Schubert Foo & Natalie Pang
Wee Kim Wee School of Communication and Information
Nanyang Technological University, Singapore
Presentation for ICADL’16
December 7th
2016
1

•Information Retrieval (IR) and Recommender Systems (RS) techniques
have been used to find information objects for:-
 Scholarly Communication Lifecycle tasks
 Literature Review (LR) search tasks
•Examples of such tasks include
 Building a reading list of research papers
 Recommending similar papers based on seed papers
 Recommending papers based on query logs
 Serendipitous discovery of interesting papers
 Recommending publication venues for manuscripts
 Recommending papers based on citation context
 Recommending co-authors for papers
 And few more….
BACKGROUND
2

Issues
Proposed techniques and applications are piecemeal approaches
Wide variety of algorithms and data fields used in prior studies
What was done?
A prototype system Rec4LRW was built for recommending papers for
three tasks:-
1. Building a reading list of research papers
2. Finding similar papers based on a set of papers
3. Shortlisting papers from the final reading list for inclusion in
manuscript
Task recommendation techniques conceptualized on top of an
identified set of base features
BACKGROUND
3

REC4LRW SYSTEM EVALUATION
• Offline evaluation experiment and user evaluation study conducted to
evaluate the Rec4LRW system
• ACM DL extract of papers published between 1951 and 2011 used
as corpus for the system with 103,739 articles
• Postgraduate research students, research staff and academic staff
were recruited for the user evaluation study
Main entry criteria: Participant should have authored at least
one research paper
• Participants evaluated the task recommendations and the overall
Rec4LRW system from a list of 43 topics
Online questionnaires were provided at the end of each task
7

USER STUDY PARTICIPANTS
9
Demographic Variable Number of Participants
Position
Student 62 (47%)
Staff 70 (53%)
Experience Level [Self-Reported]
Beginner 15 (11.4%)
Intermediate 61 (46.2%)
Advanced 34 (25.8%)
Expert 22 (16.7%)
Discipline Category
Engineering & Technology 87 (65.9%)
Social Sciences 42 (31.8%)
Life Sciences & Medicine 3 (2.3%)
Discipline
Computer Science & Information Systems 51 (38.6%)
Library and Information Studies 30 (22.7%)
Electrical & Electronic Engineering 30 (22.7%)
Communication & Media Studies 8 (6.1%)
Mechanical, Aeronautical & Manufacturing Engineering 5 (3.8%)
Biological Sciences 2 (1.5%)
Statistics & Operational Research 1 (0.8%)
Education 1 (0.8%)
Politics & International Studies 1 (0.8%)
Economics & Econometrics 1 (0.8%)
Civil & Structural Engineering 1 (0.8%)
Psychology 1 (0.8%)

DATA ANALYSIS PROCEDURES
Quantitative Data
Ascertain the agreement percentages of the evaluation
measures
Logistic regression, t-test and correlation tests
Qualitative Data
Identify the top preferred and critical aspects of the tasks
and the overall system
Feedback responses were coded by a single coder using an
inductive approach
10

EMERGENT THEMES AND A
FRAMEWORK
• Certain dominant themes were apparent from the qualitative feedback
• These themes were consolidated into a single framework - Scientific Paper
Retrieval and Recommender Framework (SPRRF)
Why do we need a framework?
• Most RS and IR studies are single dimensional i.e. algorithmic
• Need to consider the overall context towards providing a meaningful
experience
• Framework generation based on empirical data
• Guide the next round of evaluation of Rec4LRW system
11

THEMES (1-2)
Theme 1: Distinct User Groups
•Users who want more control
Participants required control features in the UI and gave preferences on the
algorithms logic
“..Maybe a side window with categories like high reach, survey etc could be put up and upon clicking
it, more papers in that category could be loaded.”
•Users who tend to trust the system and its output
Participants were largely satisfied with the overall system
“The idea of providing this system is quite* good. Such a system if developed and prepared well, can
help and speed up the process of literature survey by helping to find better papers…”
Theme 2: Information Cues
•Four cue labels used in the system: Recent, Popular, High Reach, Survey/Review
•Cues positively impacted participants’ perceptions of the system
“I like the highlighted recommendations - for e.g. Popular, Recent etc. which greatly helps in
distinguishing various references and catches the eye !”
12

THEMES (3-4)
Theme 3: Forced Serendipity vs Natural Serendipity
•Prior studies have focused mainly on modelling serendipity
•‘View Papers in the Parent Cluster’ feature helped participants in noticing
papers which they have not read earlier
“The view papers in the parent cluster function is very helpful to get a full picture of research
field.”
“The user can view many papers in the parent cluster in addition to the shortlisted papers. Thus
the user need not spend much time on finding related papers.”
Theme 4: Learning Algorithms vs Fixed Algorithms
•Some participants in the study suggested heuristics to identify papers for the
tasks 1 and 2
•These users expect a list of appropriate algorithms to be presented in the
system
“..Take a high impact paper (based on citation and may be exact keyword matching), then go
through its own references to understand more about the research conducted. This is because,
a good work generally cites other prominent works in the field…”
13

THEMES (3-4)
Theme 3: Forced Serendipity vs Natural Serendipity
•Prior studies have focused mainly on modelling serendipity
•‘View Papers in the Parent Cluster’ feature helped participants in noticing
papers which they have not read earlier
“The view papers in the parent cluster function is very helpful to get a full picture of research
field.”
“The user can view many papers in the parent cluster in addition to the shortlisted papers. Thus
the user need not spend much time on finding related papers.”
Theme 4: Learning Algorithms vs Fixed Algorithms
•Some participants in the study suggested heuristics to identify papers for the
tasks 1 and 2
•These users expect a list of appropriate algorithms to be presented in the
system
“..Take a high impact paper (based on citation and may be exact keyword matching), then go
through its own references to understand more about the research conducted. This is because,
a good work generally cites other prominent works in the field…”
15

THEMES (5-6)
Theme 5: Inclusion of Control Features in User-Interface
•Many participants felt handicapped by the absence of control features in the
Rec4LRW system
•Expected control features were sort options, topical facets and advanced search
features
“Really good for the initial review. It would be nice to see additional filters to focus on a specific
topic”
“More recent papers shall be included, and it is better if the user can sort the recommended
paper by sequence such as sort times, date, relevance...”
Theme 6: Inclusion of Bibliometric Data
•Participants explicitly stated the need for metrics such as impact factor and h-
index in the UI
•The main challenge is the computing overhead for calculating the new metrics
“Categorizing the papers based on popularity, journal impact factor, and etc”
“…In case that an item in the recommendation list is a journal paper, can we also know its
impact factor and which databases indexes it?”
16

THEMES (7-8)
Theme 7: Diversification of Corpus
•The evaluation of algorithms has been restricted to datasets from certain
disciplines such as computer science in prior studies
•Future studies should include papers from “far-apart” disciplines for the
evaluation
“…Due to limitation of data sets (as only ACM papers) search result is not of decent quality.”
“But in general the main drawback is that "the papers in the corpus/dataset are from an extract
of papers from ACM DL". As I work at the intersection of information systems and business
many relevant papers are not included in the list.”
Theme 8: Task Interconnectivity
•Participants appreciated the utility of ‘seed basket’ and ‘reading list’ towards
management of the paper across the three tasks
“I like the idea of giving recommendations based on a seed group of articles, but there needs to
be more facets to select from, there needs to be greater selection of seeding articles as well in
terms of those facets.”
“The whole idea seems good for me, especially making seed of 5+ for expanding the bunch.”
17

THE FRAMEWORK
18
SPRRF Feature Skill-Reliant User System-Reliant User
UI Customization
Sort options √
Topical Facets √ √
Advanced search options √
Algorithmic Customization
Setting the recommendations count √ √
Selecting the retrieval algorithm √
Submitting external papers √ √
User Personalization
Paper collections √ √
Favourites specification √ √
Paper anchors √
Relevance feedback √

FUTURE WORK
• SPRRF to be used in second round of Rec4LRW
evaluation studies
• SPRRF components to be statistically validated through
hypotheses
• Expand the scope of SPRRF to other information
objects in the Scholarly Communication Lifecycle
19

GET ACCESS TO REC4LRW…
Use the link http://goo.gl/XgynzY or scan the below QR code
20

Proposing a Scientific Paper Retrieval and Recommender Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Proposing a Scientific Paper Retrieval and Recommender Framework

Similar to Proposing a Scientific Paper Retrieval and Recommender Framework (20)

More from Aravind Sesagiri Raamkumar

More from Aravind Sesagiri Raamkumar (20)

Recently uploaded

Recently uploaded (20)

Proposing a Scientific Paper Retrieval and Recommender Framework