Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018
Streamlining Python Development: A Guide to a Modern Project Setup
Multi-method Evaluation in Scientific Paper Recommender Systems
1. Multi-method Evaluation in
Scientific Paper Recommender
Systems
Aravind Sesagiri Raamkumar
Schubert Foo
Wee Kim Wee School of Communication and
Information, NTU
IUadaptME Workshop|UMAP’18
July 8th 2018
5. 5
• Major Areas
– Literature Review (LR) tasks
• Task of building an initial reading list at the start of LR
• Task of finding similar papers based on a single paper
• Task of finding similar papers based on multiple papers
• Task of searching papers based on input text
– User footprint
– Researcher’s publication history
– Social network of authors
• Recommendations generated based on:-
– Citation network
– Metadata fields
– Text content from papers
– System logs
SPRS Studies
6. Rec4LRW System
Rec4LRW – Recommender System for Literature Review and Writing
• Task 1 - Building an initial reading list of research papers
– Author-specified Keywords based Retrieval (AKR) Technique
• Task 2 - Finding similar papers based on set of papers
– Integrated Discovery of Similar Papers (IDSP) Technique
• Task 3 - Shortlisting papers from reading list for inclusion in manuscript
– Citation Network based Shortlisting (CNS) Technique
6
10. Rec4LRW Evaluation Strategy
10
Offline
Evaluation of
Task 1
• Rank aggregation method
User
Evaluation of
Three Tasks
• Survey-based evaluations
User
Evaluation of
Overall System
• Survey-based
evaluations
“Offline evaluations are more prevalent in
this SPRS area, accounting to about 69% of
all studies”
11. Offline Evaluation of Task 1
Evaluated Techniques
Label Abbr. Technique Description
A AKRv1 Basic AKR technique with weights WCC = 0.25, WRC=0.25, WCO = 0.5
B AKRv2 Basic AKR technique with weights WCC = 0.1, WRC=0.1, WCO = 0.8
C HAKRv1 HITS enhanced AKR technique boosted with weights WCC = 0.25, WRC=0.25, WCO = 0.5
D HAKRv2 HITS enhanced AKR technique boosted with weights WCC = 0.1, WRC=0.1, WCO = 0.8
E CFHITS IBCF technique boosted with HITS
F CFPR IBCF technique boosted with PageRank
G PR PageRank technique
Evaluation Approach
• Number of Recent (R1), Popular (R2), Survey (R3) and Diverse (R4) papers were enumerated for each of the
186 topics and seven techniques
• Ranks were assigned to the technique based on the highest counts in each recommendation list
• The RankAggreg library was used to perform Rank Aggregation
Experiment Setup
• A total of 186 author-specified keywords from the ACM DL dataset were identified as the seed research topic
• The experiment was performed in three sequential steps.
1. Top 200 papers were retrieved using the BM25 similarity algorithm
2. Top 20 papers were identified using the specific ranking schemes of the seven techniques
3. The evaluation metrics were measured for the seven techniques
11
12. Offline Evaluation of Task 1
Results
Paper Type (Requirement)
Optimal Aggregated Ranks
Min. Obj. Function
Score1 2 3 4 5 6 7
Recent Papers (R1) B A C D E F G 10.66
Popular Papers (R2) F E C D G A B 11.89
Literature Survey Papers (R3) C G D A E F B 13.38
Diverse Papers (R4) C D G A B F E 12.15
• The HITS enhanced version of the AKR technique HAKRv1 (C) was the best all-
round performing technique
• The HAKRv1 technique was particularly good for retrieving literature survey
papers and papers from different sub-topics while the basic AKRv1 technique (A)
was good for retrieving recent papers
12
13. Rec4LRW User Study Evaluation Goals
1. Ascertain the agreement percentages of the evaluation measures for the
three tasks and the overall system and identify whether the values are above a
preset threshold criteria of 75%
2. Test the hypothesis that students benefit more from the recommendation
tasks/system in comparison to staff
3. Measure the correlation between the measures and build a regression model
with ‘agreeability on a good list’ as the dependent variable
4. Track the change in user perceptions between the three tasks
5. Compare the pre-study and post-study variables for understanding whether
the target participants are benefitted from the tasks
6. Identify the top most preferred and critical aspects of the task
recommendations and the system using the subjective feedback of the
participants
13
14. User Study Details
• Rec4LRW system was made available over the internet
• Participants were recruited with intent to get worldwide audience
• Only researchers with paper authoring experience were recruited through a
pre-screening survey
• 230 researchers participated in the pre-screening survey
• 149 participants were deemed eligible and invited for the study
• Participants provided with a user guide
• All the three tasks were required to be executed by the participants
• Evaluation questionnaires embedded in the screen of each task of Rec4LRW
system
14
15. Task Evaluation Measures
Common Measures
• Relevance
• Usefulness
• Good_List
Tasks 1 and 2
• Good_Spread
• Diversity
• Interdisciplinarity
• Popularity
• Recency
• Good_Mix
• Familiarity
• Novelty
• Serendipity
• Expansion_Required
• User_Satisfaction
Task 2
• Seedbasket_Similarity
• Shared_Corelations
• Seedbasket_Usefulness
Task 3
• Importance
• Certainty
• Shortlisting_Feature
15
Qualitative Feedback
1) From the displayed information, what features did
you like the most?
2) Please provide your personal feedback about the
execution of this task
16. System Evaluation Measures
Effort to use the System (EUS)
• Convenience
• Effort_Required
• Mouse_Clicks
• Little_Time
• Much_Time
Perceived Usefulness (PU)
• Productivity_Improvability
• Enhance_Effectiveness
• Ease_Job
• Work_Usefulness
Perceived System Effectiveness (PSE)
• Recommend
• Pleasant_Experience
• Useless
• Awareness
• Better_Choice
• Findability
• Accomplish_Tasks
• Performance_Improvability
16
18. Analysis Procedures
Quantitative Data
• Agreement Percentage (AP) calculated by only considering responses of 4
(‘Agree’) and 5 (‘Strongly Agree’) in the 5-point Likert scale
• Independent samples t-test for hypothesis testing
• Spearman coefficient for correlation measurement
• MLR used for the predictive models
– Paired samples t-test for model validation
Qualitative Data
• Descriptive coding method was used to code the participant feedback
• Two coders performed the coding in a sequential manner
Preferred Aspects (κ) Critical Aspects (κ)
Task 1 0.918 0.727
Task 2 0.930 0.758
Task 3 0.877 0.902
18
19. Participant Demographics
Stage N
Task 1 132
Task 2 121
Task 3 119
Demographic Variable N
Position
Student 62 (47%)
Staff 70 (53%)
Experience Level
Beginner 15 (11.4%)
Intermediate 61 (46.2%)
Advanced 34 (25.8%)
Expert 22 (16.7%)
Discipline N
Computer Science & Information Systems 51 (38.6%)
Library and Information Studies 30 (22.7%)
Electrical & Electronic Engineering 30 (22.7%)
Communication & Media Studies 8 (6.1%)
Mechanical, Aeronautical & Manufacturing Engineering 5 (3.8%)
Biological Sciences 2 (1.5%)
Statistics & Operational Research 1 (0.8%)
Education 1 (0.8%)
Politics & International Studies 1 (0.8%)
Economics & Econometrics 1 (0.8%)
Civil & Structural Engineering 1 (0.8%)
Psychology 1 (0.8%)
Country N
Singapore 107 (81.1%)
India 4 (3%)
Malaysia 3 (2.3%)
Sri Lanka 3 (2.3%)
Pakistan 3 (2.3%)
Indonesia 2 (1.5%)
Germany 2 (1.5%)
Australia 1 (0.8%)
Iran 1 (0.8%)
Thailand 1 (0.8%)
China 1 (0.8%)
USA 1 (0.8%)
Canada 1 (0.8%)
Sweden 1 (0.8%)
Slovenia 1 (0.8%) 19
23. Results for Goal 6
Top 5 Preferred Aspects
Rank Task 1 (N=109) Task 2 (N=100) Task 3 (N=91)
1 Information Cue Labels (41%)
Shared Co-citations & Co-references
(28%)
Shortlisting Feature &
Recommendation Quality (24%)
2 Rich Metadata (21%) Recommendation Quality (27%) Information Cue Labels (15%)
3 Diversity of Papers (13%) Information Cue Labels (16%) View Papers in Clusters (11%)
4 Recommendation Quality (9%) Seed Basket (14%) Rich Metadata (7%)
5 Recency of Papers (4%) Rich Metadata (9%) Ranking of Papers (3%)
Rank Task 1 (N=109) Task 2 (N=100) Task 3 (N=91)
1 Broad topics not suitable (20%) Quality can be improved (16%)
Rote selection of papers for task
execution (16%)
2 Limited dataset (7%) Limited dataset (12%) Limited dataset (5%)
3 Quality can be improved (6%)
Recommendation algorithm could
include more dimensions (7%)
Algorithm can be improved (5%)
4 Different algorithm required (5%) Speed can be improved (7%) Not sure of the usefulness (4%)
5 Free-text search required (4%)
Repeated recommendations from Task 1
(3%)
UI can be improved (3%)
Top 5 Critical Aspects
23
24. SPRRF - Scientific Paper Retrieval and
Recommender Framework (SPRRF)
Distinct User
Groups
Usefulness of
Information Cue
Labels
Forced
Serendipity vs.
Natural
Serendipity
Learning
Algorithms vs.
Fixed-Logic
Algorithms
Inclusion of
Control
Features in UI
Inclusion of
Bibliometric
Data
Diversification
of Corpus
• Seven themes identified using holistic coding method
• SPRRF conceptualized as a mental model based on
the themes
• The framework needs to be validated
24
25. Questions for Discussion
How dependable are the gold standard lists in SPRS
evaluation since relevance is largely dependent on
user perspective?
Should SPRS evaluations be conducted in a parallel
or serial manner?
What type of data should be collected during
usability testing in SPRS evaluation?
25