SIGIR 2018 - Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems

IRGIRGroup @UAM
Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems
41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018)
Ann Arbor, MI, USA, 10 July 2018
Should I Follow the Crowd?
A Probabilistic Analysis of the Effectiveness
of Popularity in Recommender Systems
41st Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR 2018)
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es
Ann Arbor, USA, 10 July 2018

IRGIRGroup @UAM
Popularity in recommendation
Would you find it useful
to recommend these items?

IRGIRGroup @UAM
Outline
1. The popularity bias in recommender systems
2. Formal analysis
3. Experiments
4. Conclusions

IRGIRGroup @UAM
Recommender systems
Users who bought this also bought…
Music discovery
Related videos
People you may know…
. . .
now

IRGIRGroup @UAM
The recommender system’s task
 Rating matrix with some available cell values, most cells empty
4 4 2 2 2
4 1 4
4 3 2 5 2
4 3 5 2
1 5 1
Users
Items
Abstraction of user-item
interaction
The “rating” matrix

IRGIRGroup @UAM
The recommender system’s task
4 4 2 2 2
4 1 4
4 3 ? 2 5 ? 2
4 3 5 2
1 5 1
Users
Items
Abstraction of user-item
interaction
The “rating” matrix
 Rating matrix with some available cell values, most cells empty
 Rank items by predicting missing ratings
 Offline evaluation: split the data into training and test
 Evaluate with IR metrics  test ratings = relevance judgments

IRGIRGroup @UAM
The popularity bias in recommender systems
Items
Users
In the data In algorithms
Mat. factorization
Nr. positive ratings
Nr.timestop10
0
400
800
0 1000 2000
Popularity
800
400
0
0 1000 2000
User-based kNN
2000
1000
0
0 1000 2000
Popular
items
Rest of items
(long tail)
Items
Nr.ratings

IRGIRGroup @UAM
The popularity bias in recommender systems
Items
Users
In the data
In offline evaluation?
In algorithms
Mat. factorization
Nr.timestop10
0
400
800
0 1000 2000
Popularity
800
400
0
0 1000 2000
User-based kNN
2000
1000
0
0 1000 2000
Popular
items
Rest of items
(long tail)
Random
User-based kNN
Matrix factorization0.3
0.2
0.1
0
nDCG@10
MovieLens 1M

IRGIRGroup @UAM
Avoiding popularity
Popularity is not personalized, trivial, suspicious…
Metrics and algorithms have been proposed
that remove or cope with popularity biases
But… is popularity
“good” or “bad”?

IRGIRGroup @UAM
Should I follow the crowd?
?
 Lack of novelty is not a sufficient answer: degrees of popularity
 Majority provides useful default choices
 Technically simplest and cheap solution
– Most apps have majority listings
 Majority is not always right
– Randomness factors, conformity,
manipulation, etc.

IRGIRGroup @UAM
Our formal analysis
Research questions
 Is popularity really effective (accurate)
in recommendation?
– Which popularity: positive rating count / ratio?
 Are we measuring properly its effectiveness
in offline experiments?
Expected
precision
Observed vs. true
metric values
Random
variables
Rankings
Depend-
encies
Theoretical
findings

IRGIRGroup @UAM
Formalization: observed vs. true accuracy
Computed on available
user taste observations
Computed with full
knowledge of user tastes
Observed metric value True metric value
Items
Users Relevant
Non relevant
Missing ratings
≈
?
Items
Users
Expected
precision
Observed vs. true
metric values
Random
variables
Rankings
Depend-
encies
Theoretical
findings

IRGIRGroup @UAM
Formalization: expectation
 Analysis on the expected accuracy of popularity-
based recommendation
– How good is popularity compared to “something else”
– Do observed and true accuracy agree (in expectation)?
 A most simple metric: 𝔼 𝑃@1
 Expectation  random variables, probabilities
Expected
precision
Observed vs. true
metric values
Random
variables
Rankings
Depend-
encies
Theoretical
findings

IRGIRGroup @UAM
𝑟𝑒𝑙
Formalization: key random variables
𝑟𝑎𝑡𝑒𝑑
Users × Items
Expected
precision
Random
variables
Rankings
Depend-
encies
Observed vs. true
metric values
Theoretical
findings

IRGIRGroup @UAM
𝑟𝑒𝑙
Formalization: key random variables
Users × Items
Expected
precision
Random
variables
Rankings
Depend-
encies
Observed vs. true
metric values
Theoretical
findings

IRGIRGroup @UAM
Optimal and popularity rankings
 Popularity variants as probability ranking functions
Nr. relevant ratings ⟶ 𝑝𝑜𝑝 𝑖 ∝ 𝑝(𝑟𝑒𝑙, 𝑡𝑟𝑎𝑖𝑛|𝑖)
Average rating value ⟶ 𝑎𝑣𝑔 𝑖 = 𝑝(𝑟𝑒𝑙|𝑡𝑟𝑎𝑖𝑛, 𝑖)
 Lemma: the optimal non-personalized rankings are
For true 𝔼 𝑃@1 ⟶ 𝑜𝑝𝑡 𝑖 = 𝑝 𝑟𝑒𝑙 ¬𝑡𝑟𝑎𝑖𝑛, 𝑖
For observed 𝔼 𝑃@1 ⟶ 𝑜𝑝𝑡 𝑖 = 𝑝 𝑟𝑒𝑙, 𝑡𝑒𝑠𝑡 ¬𝑡𝑟𝑎𝑖𝑛, 𝑖
 We reach several findings by reasoning on rank equivalence
Expected
precision
Random
variables
Rankings
Depend-
encies
Observed vs. true
metric values
Theoretical
findings

IRGIRGroup @UAM
Conditional (in)dependences between variables
Discover Rate
Like
User
Item
Items
Nr.ratings
Rating
distribution
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑖
Expected
precision
Random
variables
Rankings
Depend-
encies
Observed vs. true
metric values
Theoretical
findings
𝑠𝑒𝑒𝑛
𝑖
𝑟𝑒𝑙
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑖 = 𝑝 𝑟𝑎𝑡𝑒𝑑 𝑟𝑒𝑙, 𝑖 𝑝 𝑟𝑒𝑙 𝑖 + 𝑝 𝑟𝑎𝑡𝑒𝑑 ¬𝑟𝑒𝑙, 𝑖 𝑝(¬𝑟𝑒𝑙|𝑖)
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑟𝑒𝑙, 𝑖 = 𝑝 𝑟𝑎𝑡𝑒𝑑 𝑠𝑒𝑒𝑛, 𝑟𝑒𝑙, 𝑖 𝑝 𝑠𝑒𝑒𝑛 𝑟𝑒𝑙, 𝑖

IRGIRGroup @UAM
Conditional (in)dependences between variables
Discover Rate
Like
User
Item
Items
Nr.ratings
Rating
distribution
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑖
Expected
precision
Random
variables
Rankings
Depend-
encies
Observed vs. true
metric values
Theoretical
findings
𝑝 𝑟𝑎𝑡𝑒𝑑 𝑟𝑒𝑙, 𝑖 = 𝑝 𝑟𝑎𝑡𝑒𝑑 𝑠𝑒𝑒𝑛, 𝑟𝑒𝑙, 𝑖 𝑝 𝑠𝑒𝑒𝑛 𝑟𝑒𝑙, 𝑖
𝑠𝑒𝑒𝑛
𝑖
𝑟𝑒𝑙
𝑠𝑒𝑒𝑛
𝑖
𝑟𝑒𝑙
𝑠𝑒𝑒𝑛
𝑖
𝑟𝑒𝑙
1. Rating depends
just on relevance
2. Rating independent
from relevance
3. Rating depends both
on items and relevance

IRGIRGroup @UAM
0
0.5
1
Observed True
Theoretical findings
1. Rating depends only on relevance
Observed and true optimals agree: 𝒑𝒐𝒑 ∝ 𝒂𝒗𝒈 ∝ optimal
2. Rating independent from relevance
Observed precision: random < 𝒂𝒗𝒈 < 𝒑𝒐𝒑 ∝ optimal
True precision: random < 𝒑𝒐𝒑 < 𝒂𝒗𝒈 ∝ optimal
3. General case: no independence assumption
Monte Carlo
Observed precision: random ∼ 𝒂𝒗𝒈 < 𝒑𝒐𝒑 ∼ optimal
True precision: random ≼ 𝒑𝒐𝒑 ≼ 𝒂𝒗𝒈 ≺ optimal
𝔼 𝑃@1 𝜃 = න
Ω 𝑛
𝔼 𝑃@1 𝜃, 𝜔 𝑑𝜔
P@1
Random
Optimal
Popularity
Avg rating
Observed and true
precision agree
Observed and true
precision disagree
Expected
precision
Random
variables
Rankings
Depend-
encies
Observed vs. true
metric values
Theoretical
findings

IRGIRGroup @UAM
Experiments
We wish to…
1. Run recommendations by rating count,
average rating, random, optimal
2. Compute observed and true precision
3. See what comes out

IRGIRGroup @UAM
 We build a dataset free of observational (popularity) bias
1. We sample 1,000 music tracks from deezer.com uniformly at random
2. We ask anonymous workers on CrowdFlower to rate 100 tracks each,
sampled uniformly at random
→ ~100 judgments per user
× ~100 judgments per track
= ~100,000 judgments total
 CM100k dataset available at
http://ir.ii.uam.es/cm100k
Data – Crowdsourced dataset

IRGIRGroup @UAM
Computing observed vs. true precision: judgments and ratings
100k judgments
Items
Seen
before
Never seen
before

IRGIRGroup @UAM
100k judgments
Items
Seen
before
“Ratings”
Never seen
before

IRGIRGroup @UAM
Observed precision
100k judgments
Items
“Ratings”
Random split
Input for algorithm
Relevance judgments
Training

IRGIRGroup @UAM
Items
True precision estimate
Biased input
for algorithm
Full relevance
judgment sample
Training
Test
Items
Training
Observed precision

IRGIRGroup @UAM
0
0.02
0.04
0.06
Observed True
Results
MovieLens 1M CM100k
nDCG@10
Random
Optimal
Average rating
Popularity
0
0.1
0.2
Observed Observed TrueObserved
Similar qualitative outcome

IRGIRGroup @UAM
0
0.02
0.04
0.06
Observed True
Results
nDCG@10
Random
Optimal
Average rating
Popularity
0
0.1
0.2
• Popularity almost optimal • Popularity ∼ random• Popularity almost optimal • Popularity ∼ random
CM100kMovieLens 1M

IRGIRGroup @UAM
0
0.02
0.04
0.06
Observed True
Results
nDCG@10
Random
Optimal
Average rating
Popularity
0
0.1
0.2
• Popularity almost optimal
• Avg rating < popularity
• Popularity ∼ random
• Avg rating > popularity
• Popularity almost optimal • Popularity ∼ random
CM100kMovieLens 1M

IRGIRGroup @UAM
Test relevance dependence extremes
1. Full relevance dependence 2. Relevance independence
nDCG@10
0
0.02
0.04
0.06
Observed True
0
0.02
0.04
0.06
Observed True
Random
Optimal
Avg rating
Popularity
Observed TrueObserved True
Shuffling the
discovery distribution
in different ways… Discovery

IRGIRGroup @UAM
Discovery
nDCG@10
0
0.02
0.04
0.06
Observed True
0
0.02
0.04
0.06
Observed True
Random
Optimal
Avg rating
Popularity
TrueObserved True
Popularity
and avg rating
are just ok
Random < avg rating
< popularity < optimal
Observed
Roughly agree
Shuffling the
in different ways…

IRGIRGroup @UAM
nDCG@10
0
0.02
0.04
0.06
Observed True
0
0.02
0.04
0.06
Observed True
Random
Optimal
Avg rating
Popularity
Observed TrueObserved True
• Popularity near optimal
• Avg rating near random
Random < popularity
< avg rating < optimal
Shuffling the
in different ways… Discovery

IRGIRGroup @UAM
Implications on personalized algorithms: user-based kNN
0
0.01
0.02
0.03
Obs True
0
0.1
0.2
0.3
Obs
MovieLens 1M CM100k (full dependencies)
nDCG@10
0
0.01
0.02
0.03
Obs True
0
0.1
0.2
0.3
Obs
Non-normalized kNN
(biased to popularity)
Normalized kNN
(biased to avg rating)
Observed TrueObserved
Non-normalized > normalized Non-normalized < normalized

IRGIRGroup @UAM
Conclusions
 So… is it good or bad to recommend popular items?
– It depends on the relation between rating, discovery and relevance
– Tends to be good; worst case: complex simultaneous dependencies
– Weak relevance dependence can twist offline measurements
 The average rating can work better than the rating count
– Observed accuracy tends to be unfair to the average rating
 Implications on collaborative filtering algorithms
– Understanding popularity can help improve state of the art algorithms
 Evaluation with random samples can uncover new findings

IRGIRGroup @UAM
Future work
 Further questions can be attempted upon a similar
formal approach
 Undiscovered accuracy stop by our poster!
 Other data split procedures, e.g. temporal
 Dynamic taste development
 Recommender system in the discovery loop
 Further research on personalized algorithms
 Denser (full?) unbiased judgment matrix


SIGIR 2018 - Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems

Recommended

Recommended

More Related Content

Similar to SIGIR 2018 - Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems

Similar to SIGIR 2018 - Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems (20)

Recently uploaded

Recently uploaded (17)

SIGIR 2018 - Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems