SIGIR 2018 - From the Probability Ranking Principle to the Low Prior Discovery Recall Principle for Recommender Systems

1
0
0.02
0.04
0.06
0.08
P@1
DFNP
PRP
Popularity
Relevant Popularity
Average Rating
Random
Experiment
References
Recommendation optimal ranking
From the PRP to the Low Prior Discovery Recall Principle for Recommender Systems
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
{Rocio.cannamares,pablo.castells}@uam.es
IRGIRGroup @UAM
 Relevant recommendations are useful, relevant and novel ones are more useful..
 New accuracy metrics
Relevant = relevant + not seen  e.g. “undiscovered precision” 𝑈𝑃@𝑘
Standard accuracy
Document search Recommendation
Query (𝑞 ∈ 𝑄) User (𝑢 ∈ 𝒰)
Document (𝑑 ∈ 𝐷) Item (𝑖 ∈ ℐ)
Relevance User taste
Relevance judgments Relevance information
Observation of taste (ratings)
DFNP: items sorted by 𝒑 𝒓𝒆𝒍 ¬𝒓𝒂𝒕𝒆𝒅, 𝒊
Undiscovered accuracy
LDRP: items sorted by 𝒑 𝒓𝒆𝒍, ¬𝒔𝒆𝒆𝒏 ¬𝒓𝒂𝒕𝒆𝒅, 𝒊
Relevance
judgments
𝐷
Relevant
Result set
PRP: documents sorted by 𝒑 𝒓𝒆𝒍 𝒅
Optimal ranking Optimal ranking Optimal ranking:
nDCG@10 Undiscovered nDCG@10
Setup
 Split all judgements into training and test (5-fold)
 Input for algorithms: training judgments on known music
 With this configuration:
 Train data → MNAR (natural biased)
 Test data → MAR (unbiased)
 Relevant = positive rating + not seen before
Dataset
Experiment
Relevance
information
Relevant
Recommended
ℐ
Ratings
𝔼 𝑃@1 = 𝑝 𝑟𝑒𝑙 𝑖1
Expected precision Expected precision
𝔼 𝑃@1 =
𝑘=1
ℐ
𝑝 𝑟𝑒𝑙, ¬𝑟𝑎𝑡𝑒𝑑 𝑖 𝑘
𝑗=1
𝑘−1
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑖𝑗
Expected precision
Recommendation
Relevance
information
Relevant
Recommended
ℐ
Ratings
Seen
𝔼 𝑈𝑃@1 =
𝑘=1
ℐ
𝑝 𝑟𝑒𝑙, ¬𝑠𝑒𝑒𝑛 𝑖 𝑘
𝑗=1
𝑘−1
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑖𝑗
0
20
40
60
80
100
120
Items
Undiscovered accuracy
We build a dataset with unbiased observations (judgments)
 We sample 1,000 music tracks from deezer.com uniformly at random
 We ask ~1,000 anonymous workers on CrowdFlower to rate 100 tracks
each, sampled uniformly at random → ~100,000 ratings total
Available at http://ir.ii.uam.es/cm100k
Setup
 Training data: random subsample (ratio 𝜌) of judgments
on known music → MNAR (naturally biased)
 Relevance information: all remaining judgments
Typical offline evaluation setup
 All available rating data is split into training and test.
 Training data are given to recommenders → Ratings
 Test data are used for evaluating the recommendation → Relevance information
Problem
 The relevance information is biased: users rate what they want, typically some
items have more ratings than others
Goal
 Collect unbiased relevance information for evaluation
Judgments
𝜌
0
1
Relevance
informationRatings
Items
Unknown
music
Known
music
Relevance
informationRatings
Known music
Unknown music
Known music
𝜌 = 0.8
Items
0
0.02
0.04
0.06
0.1 1
nDCG@10
r
PRP and popularities very
close to optimal ranking
Average rating seems
to underperform
considerably PRP and popularities
worse than random!
The comparison between algorithms changes completely when we consider relevant and novel recommendations
0
0.01
0.02
0
0.01
0.02
Results
 DFNP more robust than PRP to the split ratio
⁻ PRP worse than random when 𝜌 = 1
 Popularity seems to follow PRP
 Average rating better than popularity when 𝜌 is close to 1
 Potential implications on state of the art personalized algorithms
[1] A. Bellogín, P. Castells, and I. Cantador. Statistical Biases in Information Retrieval Metrics for Recommender Systems. Inf. Ret. 20, 6 (Dec. 2017), 606–634
[2] R. Cañamares and P. Castells. 2017. A Probabilistic Reformulation of Memory-Based Collaborative Filtering – Implications on Popularity Biases. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), 215–224
[3] P. Castells, N. J. Hurley, S. Vargas. 2015. Novelty and Diversity in Recommender Systems. In: Recommender Systems Handbook, 2nd edition, F. Ricci, L. Rokach, and B. Shapira (Eds.). Springer, New York, NY, USA, 881–918
[4] P. Cremonesi, Y. Koren, and R. Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the 4th ACM Conference on Recommender Systems (RecSys 2010), 39–46
[5] S. E. Robertson. 1977. The Probability Ranking in IR. Journal of Documentation 33, 4 (Jan. 1977), 294–304.
[6] H. Steck. 2011. Item popularity and recommendation accuracy. In Proc. of the 5th ACM Conference on Recommender Systems (RecSys 2011), 125–132
[7] H. Steck. 2010. Training and testing of recommender systems on data missing not at random. In Proc. of the 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2010). ACM, New York, NY, USA, 713–722
[8] B. Marlin and R. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proc. of the 3rd ACM Conference on Recommender Systems (RecSys 2009). ACM, New York, NY, USA, 5–12
41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018)
Average rating is
the only algorithm
standing above random

SIGIR 2018 - From the Probability Ranking Principle to the Low Prior Discovery Recall Principle for Recommender Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to SIGIR 2018 - From the Probability Ranking Principle to the Low Prior Discovery Recall Principle for Recommender Systems

Similar to SIGIR 2018 - From the Probability Ranking Principle to the Low Prior Discovery Recall Principle for Recommender Systems (20)

Recently uploaded

Recently uploaded (20)

SIGIR 2018 - From the Probability Ranking Principle to the Low Prior Discovery Recall Principle for Recommender Systems