RecSys 2018, Vancouver, Canada
On the Robustness and Discriminative Power
of IR Metrics for Top-N Recommendation
Daniel Valcarce∗ A. Bellogín† J. Parapar∗ P. Castells†
@dvalcarce @abellogin @jparapar @pcastells
∗University of A Coruña
†Universidad Autónoma de Madrid
Evaluation
Recommender Systems Evaluation
Online evaluation (e.g., A/B testing):
expensive,
measures real user behavior.
Offline evaluation:
cheap,
highly reproducible,
usually constitutes the first step before deploying a
recommender system.
2
Recommender Systems Evaluation
Online evaluation (e.g., A/B testing):
expensive,
measures real user behavior.
Offline evaluation ←
cheap,
highly reproducible,
usually constitutes the first step before deploying a
recommender system.
2
Offline Evaluation of Recommender Systems (RS)
When evaluating recommender systems (RS), which metric
should we use?
Many types: error, ranking accuracy, diversity, novelty, etc.
Ranking accuracy metrics are becoming the most popular.
These metrics have been traditionally used in Information
Retrieval (IR).
3
IR Metrics used in RS
Some IR metrics that have been used in RS:
P: Precision
Recall
MAP: Mean Average Precision
nDCG: Normalised Discounted Cumulative Gain
MRR: Mean Reciprocal Rank
bpref: Binary Preference
infAP: Inferred Average Precision
4
Analyse how IR Metrics behave in RS
These ranking accuracy metrics have been studied in IR.
We want to study their behavior in top-N recommendation.
Two perspectives:
robustness,
discriminative power.
5
Proposal
Robustness
Sparsity bias
Sparsity is intrinsic to the
recommendation task.
We take random
subsamples from the test
set to increase the bias.
Popularity bias
Missing-not-at-random
(long tail distribution).
We remove the most
popular items to study the
bias.
We measuretherobustnessofametricbycomputingtheKendall’s
correlation of systems rankings when changing the amount of
bias.
7
Discriminative Power
A metric is discriminative when its differences in value are
statistically significant.
We use the permutation test with difference in means as
test statistic.
We run a statistical test between all possible system pairs.
We plot the obtained p-values sorted by decreasing value.
8
Experiments
Experimental Settings
Three datasets:
◦ MovieLens 1M,
◦ LibraryThing,
◦ BeerAdvocate.
Methodology:
◦ AllItems: rank all the items in the dataset that have not
been rated by the target user.
◦ 80-20% random split.
Systems:
◦ 21 different recommendation algorithms.
10
Comparing metric cut-offs
Comparing cut-offs of the same metric (nDCG) 1/4
@5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100
@5
@10
@20
@30
@40
@50
@60
@70
@80
@90
@100
1.00 0.95 0.93 0.92 0.92 0.92 0.92 0.91 0.90 0.90 0.90
0.95 1.00 0.98 0.97 0.97 0.97 0.97 0.96 0.95 0.95 0.95
0.93 0.98 1.00 0.99 0.99 0.99 0.99 0.98 0.97 0.97 0.97
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.91 0.96 0.98 0.99 0.99 0.99 0.99 1.00 0.99 0.99 0.99
0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00
0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00
0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00
Correlation between cut-offs of nDCG.
12
Comparing cut-offs of the same metric (nDCG) 2/4
100 90 80 70 60 50 40 30 20 10 0
% of ratings in the test set
0.85
0.90
0.95
1.00
Kendall’sτ
nDCG@5
nDCG@10
nDCG@20
nDCG@30
nDCG@40
nDCG@50
nDCG@60
nDCG@70
nDCG@80
nDCG@90
nDCG@100
Kendall’s correlation among systems when increasing the
sparsity bias using nDCG.
13
Comparing cut-offs of the same metric (nDCG) 3/4
100 95 90 85 80
% least popular items in the test set
0.0
0.2
0.4
0.6
0.8
1.0
Kendall’sτ
nDCG@5
nDCG@10
nDCG@20
nDCG@30
nDCG@40
nDCG@50
nDCG@60
nDCG@70
nDCG@80
nDCG@90
nDCG@100
Kendall’s correlation among systems when changing the
popularity bias using nDCG.
14
Comparing cut-offs of the same metric (nDCG) 4/4
0 5 10 15 20 25
pairs of recommender systems
0.0
0.2
0.4
0.6
0.8
1.0
p-value
nDCG@5
nDCG@10
nDCG@20
nDCG@30
nDCG@40
nDCG@50
nDCG@60
nDCG@70
nDCG@80
nDCG@90
nDCG@100
Discriminative power of nDCG measured with p-value curves.
15
Comparing metrics at the same
cut-off
Comparing metrics with cut-off @100 1/4
P Recall MAP nDCG MRR bpref infAP
P
Recall
MAP
nDCG
MRR
bpref
infAP
1.00 0.89 0.87 0.89 0.71 0.89 0.91
0.89 1.00 0.87 0.90 0.72 0.90 0.92
0.87 0.87 1.00 0.96 0.84 0.92 0.92
0.89 0.90 0.96 1.00 0.82 0.94 0.96
0.71 0.72 0.84 0.82 1.00 0.80 0.80
0.89 0.90 0.92 0.94 0.80 1.00 0.96
0.91 0.92 0.92 0.96 0.80 0.96 1.00
Correlation between metrics.
17
Comparing metrics with cut-off @100 4/4
100 90 80 70 60 50 40 30 20 10 0
% of ratings in the test set
0.85
0.90
0.95
1.00
Kendall’sτ
P
Recall
MAP
nDCG
MRR
bpref
infAP
Kendall’s correlation among systems when increasing the
sparsity bias.
18
Comparing metrics with cut-off @100 3/4
100 95 90 85 80
% least popular items in the test set
0.0
0.2
0.4
0.6
0.8
1.0
Kendall’sτ
P
Recall
MAP
nDCG
MRR
bpref
infAP
Kendall’s correlation among systems when increasing the
popularity bias.
19
Comparing metrics with cut-off @100 4/4
0 5 10 15 20 25
pairs of recommender systems
0.0
0.2
0.4
0.6
0.8
1.0
p-value
P
Recall
MAP
nDCG
MRR
bpref
infAP
Discriminative power measured with p-value curves.
20
Conclusions and Future Directions
Conclusions
Deeper cut-offs offer greater robustness and
discriminative power than shallow cut-offs.
Precision offers high robustness to sparsity and popularity
biases and good discriminative power.
NDCG provides the best discriminative power and high
robustness to the sparsity bias and moderate robustness to
the popularity bias.
22
Future work
Explore different types of metrics such as diversity or
novelty metrics.
Use other evaluation methodologies instead of AllItems.
◦ For instance, One-Plus-Random: one relevant item and N
non-relevant items as candidate set.
Employ different partitioning schemes such as temporal
splits.
23
Thank you!

On the Robustness and Discriminative Power of IR Metrics for Top-N Recommendation [RecSys '18 Slides]

  • 1.
    RecSys 2018, Vancouver,Canada On the Robustness and Discriminative Power of IR Metrics for Top-N Recommendation Daniel Valcarce∗ A. Bellogín† J. Parapar∗ P. Castells† @dvalcarce @abellogin @jparapar @pcastells ∗University of A Coruña †Universidad Autónoma de Madrid
  • 2.
  • 3.
    Recommender Systems Evaluation Onlineevaluation (e.g., A/B testing): expensive, measures real user behavior. Offline evaluation: cheap, highly reproducible, usually constitutes the first step before deploying a recommender system. 2
  • 4.
    Recommender Systems Evaluation Onlineevaluation (e.g., A/B testing): expensive, measures real user behavior. Offline evaluation ← cheap, highly reproducible, usually constitutes the first step before deploying a recommender system. 2
  • 5.
    Offline Evaluation ofRecommender Systems (RS) When evaluating recommender systems (RS), which metric should we use? Many types: error, ranking accuracy, diversity, novelty, etc. Ranking accuracy metrics are becoming the most popular. These metrics have been traditionally used in Information Retrieval (IR). 3
  • 6.
    IR Metrics usedin RS Some IR metrics that have been used in RS: P: Precision Recall MAP: Mean Average Precision nDCG: Normalised Discounted Cumulative Gain MRR: Mean Reciprocal Rank bpref: Binary Preference infAP: Inferred Average Precision 4
  • 7.
    Analyse how IRMetrics behave in RS These ranking accuracy metrics have been studied in IR. We want to study their behavior in top-N recommendation. Two perspectives: robustness, discriminative power. 5
  • 8.
  • 9.
    Robustness Sparsity bias Sparsity isintrinsic to the recommendation task. We take random subsamples from the test set to increase the bias. Popularity bias Missing-not-at-random (long tail distribution). We remove the most popular items to study the bias. We measuretherobustnessofametricbycomputingtheKendall’s correlation of systems rankings when changing the amount of bias. 7
  • 10.
    Discriminative Power A metricis discriminative when its differences in value are statistically significant. We use the permutation test with difference in means as test statistic. We run a statistical test between all possible system pairs. We plot the obtained p-values sorted by decreasing value. 8
  • 11.
  • 12.
    Experimental Settings Three datasets: ◦MovieLens 1M, ◦ LibraryThing, ◦ BeerAdvocate. Methodology: ◦ AllItems: rank all the items in the dataset that have not been rated by the target user. ◦ 80-20% random split. Systems: ◦ 21 different recommendation algorithms. 10
  • 13.
  • 14.
    Comparing cut-offs ofthe same metric (nDCG) 1/4 @5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100 @5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100 1.00 0.95 0.93 0.92 0.92 0.92 0.92 0.91 0.90 0.90 0.90 0.95 1.00 0.98 0.97 0.97 0.97 0.97 0.96 0.95 0.95 0.95 0.93 0.98 1.00 0.99 0.99 0.99 0.99 0.98 0.97 0.97 0.97 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.91 0.96 0.98 0.99 0.99 0.99 0.99 1.00 0.99 0.99 0.99 0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00 Correlation between cut-offs of nDCG. 12
  • 15.
    Comparing cut-offs ofthe same metric (nDCG) 2/4 100 90 80 70 60 50 40 30 20 10 0 % of ratings in the test set 0.85 0.90 0.95 1.00 Kendall’sτ nDCG@5 nDCG@10 nDCG@20 nDCG@30 nDCG@40 nDCG@50 nDCG@60 nDCG@70 nDCG@80 nDCG@90 nDCG@100 Kendall’s correlation among systems when increasing the sparsity bias using nDCG. 13
  • 16.
    Comparing cut-offs ofthe same metric (nDCG) 3/4 100 95 90 85 80 % least popular items in the test set 0.0 0.2 0.4 0.6 0.8 1.0 Kendall’sτ nDCG@5 nDCG@10 nDCG@20 nDCG@30 nDCG@40 nDCG@50 nDCG@60 nDCG@70 nDCG@80 nDCG@90 nDCG@100 Kendall’s correlation among systems when changing the popularity bias using nDCG. 14
  • 17.
    Comparing cut-offs ofthe same metric (nDCG) 4/4 0 5 10 15 20 25 pairs of recommender systems 0.0 0.2 0.4 0.6 0.8 1.0 p-value nDCG@5 nDCG@10 nDCG@20 nDCG@30 nDCG@40 nDCG@50 nDCG@60 nDCG@70 nDCG@80 nDCG@90 nDCG@100 Discriminative power of nDCG measured with p-value curves. 15
  • 18.
    Comparing metrics atthe same cut-off
  • 19.
    Comparing metrics withcut-off @100 1/4 P Recall MAP nDCG MRR bpref infAP P Recall MAP nDCG MRR bpref infAP 1.00 0.89 0.87 0.89 0.71 0.89 0.91 0.89 1.00 0.87 0.90 0.72 0.90 0.92 0.87 0.87 1.00 0.96 0.84 0.92 0.92 0.89 0.90 0.96 1.00 0.82 0.94 0.96 0.71 0.72 0.84 0.82 1.00 0.80 0.80 0.89 0.90 0.92 0.94 0.80 1.00 0.96 0.91 0.92 0.92 0.96 0.80 0.96 1.00 Correlation between metrics. 17
  • 20.
    Comparing metrics withcut-off @100 4/4 100 90 80 70 60 50 40 30 20 10 0 % of ratings in the test set 0.85 0.90 0.95 1.00 Kendall’sτ P Recall MAP nDCG MRR bpref infAP Kendall’s correlation among systems when increasing the sparsity bias. 18
  • 21.
    Comparing metrics withcut-off @100 3/4 100 95 90 85 80 % least popular items in the test set 0.0 0.2 0.4 0.6 0.8 1.0 Kendall’sτ P Recall MAP nDCG MRR bpref infAP Kendall’s correlation among systems when increasing the popularity bias. 19
  • 22.
    Comparing metrics withcut-off @100 4/4 0 5 10 15 20 25 pairs of recommender systems 0.0 0.2 0.4 0.6 0.8 1.0 p-value P Recall MAP nDCG MRR bpref infAP Discriminative power measured with p-value curves. 20
  • 23.
  • 24.
    Conclusions Deeper cut-offs offergreater robustness and discriminative power than shallow cut-offs. Precision offers high robustness to sparsity and popularity biases and good discriminative power. NDCG provides the best discriminative power and high robustness to the sparsity bias and moderate robustness to the popularity bias. 22
  • 25.
    Future work Explore differenttypes of metrics such as diversity or novelty metrics. Use other evaluation methodologies instead of AllItems. ◦ For instance, One-Plus-Random: one relevant item and N non-relevant items as candidate set. Employ different partitioning schemes such as temporal splits. 23
  • 26.