SlideShare a Scribd company logo
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
On Target Item Sampling
in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es
Virtual Event, Brazil, September 2020
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation
Is system A better than B?
A B
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
···
···
Offline evaluation
Rank
Compute
metrics
Test data
Training
Unrated
Test
Set of
all items
Getting it right
• Correlate with production setting / online evaluation
• Consistent & comparable with other offline experiments
A > B?
A B
Training
data
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation – Target items
A
• Exclude items with training data
• Include a certain number of non-
relevant items (e.g. to reduce cost)
• Can this change the outcome?
Set of
all items
B
Compute
metrics
Test data
A > B?
Training
Unrated
Test
Training
data
Target
items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Training
Unrated
Target
items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Unrated
Test + all unrated
All items except
training items
Largest
Target items
Test + no unrated
Just test ratings
Smallest
Target
items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test + all unrated
Target items
Test + no unrated
Test
Liked
Not liked
Unrated
Test + some unrated
Target items
May the number of unrated target items
affect the evaluation outcome?
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
Ranking A
1
2
3
4
5
All unrated
Ranking B
1
2
3
4
5
P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5
Ranking A
No unrated
Ranking B
May this affect the evaluation outcome?
> <
1
2
3
1
2
3
4
5
4
5
Test
Unrated
Liked
Not liked
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
0
0.2
0.4
0.6
Full Test
P@10P@10
0
0.6
0.4
0.2
8 systems
iMF (full)
iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average rating
Popularity
Random
0
0.2
0.4
0.6
Full Test
P@10
0
0.2
0.4
0.6
Full Test
P@10
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Best
system
Worst
system
P@10
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Kendall 𝛕 = 𝟎. 𝟏𝟒
A simple offline experiment on MovieLens 1M
Best
No unrated
Worst
All unrated
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Biased disagreement
Systematic disagreements Many Few
Unrated target items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement
Which one is right?
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Many Few
Unrated target items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement – Is either one right?
Since we want to match
online evaluation,
let’s compare to
unbiased evaluation
Few
unrated items
Many
unrated items
Unbiased
evaluation
Which one is right?
Yahoo! R3
MAR ratings → Unbiased evaluation
MNAR ratings → Biased evaluation
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Comparison to unbiased evaluation
Biased vs. unbiased evaluation
with Yahoo! R3
Neither all nor zero unrated targets
match unbiased evaluation well
How about something in between?
Let’s explore the target size range…
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
“Sweet spot” Kendall 
0
1
2
5
10
20
50
100
200
500
All
0
1
0.8
0.6
0.4
0.2
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
MovieLens 1M
# unrated target items
“Sweet spot”
?
No MAR data
Check discriminative power
Kendall 
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
All
0
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items
Sweet spot?
# ties
# ties
Check discriminative power
Kendall 
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
Why do ties increase in the extremes?
• Few unrated items: small set of items to rank
• Many unrated items: metric → 0 as # unrated → 
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items

Sweet spot?
# ties
# ties
Kendall 
Many ties
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
All
1000
2000
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: 𝑝-values
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
The number of ties seems more informative than 𝑝-values
# unrated target items
Sweet spot?
Sum of
𝑝-values
# ties
Kendall 
# ties
Sum of
𝑝-values
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
0
40
20
0
100
300
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Loss of coverage
Small target sets can easily cause incomplete rankings
Risk of highly misleading results depending on how the metric deals with this
No
unrated
1
2
All
unrated
Metric cutoff
3
4
Coverage
loss
1
2
3
4
5
6 6
5 0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
1000
2000
Full
Coverage@10
0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
Full
Coverage@10
Yahoo! R3 MovieLens 1M
kNN with
small 𝑘 kNN with
small 𝑘
# unrated target items # unrated target items
0
1
2
5
10
20
50
100
200
500
All
0
1
2
5
10
20
50
100
200
500
All
1000
2000
0
1
0.8
0.6
0.4
0.2
Coverage@10
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Conclusion
 Different target sets produce different evaluation outcomes
– The disagreements are systematic on specific algorithms and configurations
 Weakness of small target sets
– More difficult to produce different rankings → discrimination power loss
– Incomplete rankings
 Weakness of large target sets
– Exposure to observation bias (popularity or any other MNAR bias)
– More difficult to produce metric values > 0 → discrimination power loss
 Tie analysis can provide helpful orientation
Neithersemsideal!
Sweetspot→balance
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Future work
 Target items introduce a pre-filter that may alter the evaluated algorithms
– Different target sampling distributions (e.g. popularity)
– Different split protocols (e.g. temporal) also affect this
 Further research on offline evaluation bias
– Does unbiased Yahoo! R3 match a real setting?
 Also check out:
– Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020
– Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020

More Related Content

Similar to RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

Opticon 2017 Decisions at Scale
Opticon 2017 Decisions at ScaleOpticon 2017 Decisions at Scale
Opticon 2017 Decisions at Scale
Optimizely
 
Monetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital Assets
Apigee | Google Cloud
 
TargetSummit Berlin - Lovoo Lele Canfora
TargetSummit Berlin -  Lovoo Lele CanforaTargetSummit Berlin -  Lovoo Lele Canfora
TargetSummit Berlin - Lovoo Lele Canfora
TargetSummit
 
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET Journal
 
Artificial Intelligence in Action
Artificial Intelligence in ActionArtificial Intelligence in Action
Artificial Intelligence in Action
Benjamin Ejzenberg
 
Beyond Simple A/B testing
Beyond Simple A/B testingBeyond Simple A/B testing
Beyond Simple A/B testing
Ratio
 
What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3
Ogilvy Consulting
 
Google Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQueryGoogle Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQuery
CARTO
 
Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
Product School
 
Building Analytics for Growth
Building Analytics for GrowthBuilding Analytics for Growth
Building Analytics for Growth
Kareem Azees
 
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Applitools
 
Get Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With DigitalGet Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With Digital
Joshua Stauffer
 
Intelligence Data Day 2020
Intelligence Data Day 2020Intelligence Data Day 2020
Intelligence Data Day 2020
Patrick Deglon
 
Digital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and BeyondDigital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and Beyond
SCL HUB Conference
 
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - WebinarA/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
Roberto Turrin
 
Peak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 UpdatePeak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 Update
Paul Drägert
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
Projects
ProjectsProjects
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
Michaela Linhart
 

Similar to RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation (20)

Opticon 2017 Decisions at Scale
Opticon 2017 Decisions at ScaleOpticon 2017 Decisions at Scale
Opticon 2017 Decisions at Scale
 
Monetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital Assets
 
TargetSummit Berlin - Lovoo Lele Canfora
TargetSummit Berlin -  Lovoo Lele CanforaTargetSummit Berlin -  Lovoo Lele Canfora
TargetSummit Berlin - Lovoo Lele Canfora
 
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
 
Series A Deck
Series A DeckSeries A Deck
Series A Deck
 
Artificial Intelligence in Action
Artificial Intelligence in ActionArtificial Intelligence in Action
Artificial Intelligence in Action
 
Beyond Simple A/B testing
Beyond Simple A/B testingBeyond Simple A/B testing
Beyond Simple A/B testing
 
What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3
 
Google Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQueryGoogle Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQuery
 
Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
 
Building Analytics for Growth
Building Analytics for GrowthBuilding Analytics for Growth
Building Analytics for Growth
 
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
 
Get Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With DigitalGet Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With Digital
 
Intelligence Data Day 2020
Intelligence Data Day 2020Intelligence Data Day 2020
Intelligence Data Day 2020
 
Digital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and BeyondDigital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and Beyond
 
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - WebinarA/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
 
Peak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 UpdatePeak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 Update
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
 
Projects
ProjectsProjects
Projects
 
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
 

More from Pablo Castells

REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
Pablo Castells
 
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
Pablo Castells
 
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
Pablo Castells
 
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender SystemsSIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
Pablo Castells
 
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
Pablo Castells
 
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
Pablo Castells
 

More from Pablo Castells (6)

REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
 
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
 
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
 
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender SystemsSIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
 
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
 
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
 

Recently uploaded

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 

RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

  • 1. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Rocío Cañamares and Pablo Castells Universidad Autónoma de Madrid http://ir.ii.uam.es Virtual Event, Brazil, September 2020
  • 2. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Offline evaluation Is system A better than B? A B
  • 3. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 ··· ··· Offline evaluation Rank Compute metrics Test data Training Unrated Test Set of all items Getting it right • Correlate with production setting / online evaluation • Consistent & comparable with other offline experiments A > B? A B Training data
  • 4. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Offline evaluation – Target items A • Exclude items with training data • Include a certain number of non- relevant items (e.g. to reduce cost) • Can this change the outcome? Set of all items B Compute metrics Test data A > B? Training Unrated Test Training data Target items
  • 5. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Target items Target items Test Liked Not liked Training Unrated Target items
  • 6. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Target items Target items Test Liked Not liked Unrated Test + all unrated All items except training items Largest Target items Test + no unrated Just test ratings Smallest Target items
  • 7. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Target items Target items Test + all unrated Target items Test + no unrated Test Liked Not liked Unrated Test + some unrated Target items May the number of unrated target items affect the evaluation outcome?
  • 8. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency Ranking A 1 2 3 4 5 All unrated Ranking B 1 2 3 4 5 P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5 Ranking A No unrated Ranking B May this affect the evaluation outcome? > < 1 2 3 1 2 3 4 5 4 5 Test Unrated Liked Not liked
  • 9. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency All unrated No unrated A simple offline experiment on MovieLens 1M 0 0.2 0.4 0.6 Full Test P@10P@10 0 0.6 0.4 0.2 8 systems iMF (full) iMF (test) kNN (full/test) Normalized kNN (full) Normalized kNN (test) Average rating Popularity Random 0 0.2 0.4 0.6 Full Test P@10 0 0.2 0.4 0.6 Full Test P@10
  • 10. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency All unrated No unrated A simple offline experiment on MovieLens 1M 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8 iMF (full) iMF (test) kNN (full/test) Normalized kNN (full) Normalized kNN (test) Average ratingMost popular Random Average rating Random Normalized kNN (test) kNN (full/test) Normalized kNN (full) iMF (full) Most popular iMF (test) Best system Worst system P@10
  • 11. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8 iMF (full) iMF (test) kNN (full/test) Normalized kNN (full) Normalized kNN (test) Average ratingMost popular Random Average rating Random Normalized kNN (test) kNN (full/test) Normalized kNN (full) iMF (full) Most popular iMF (test) Kendall 𝛕 = 𝟎. 𝟏𝟒 A simple offline experiment on MovieLens 1M Best No unrated Worst All unrated
  • 12. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 • Popular items • Unrated items in objective function • Items with high average rating • Ignoring unrated items in objective Biased disagreement Systematic disagreements Many Few Unrated target items
  • 13. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Biased disagreement Which one is right? • Popular items • Unrated items in objective function • Items with high average rating • Ignoring unrated items in objective Many Few Unrated target items
  • 14. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Biased disagreement – Is either one right? Since we want to match online evaluation, let’s compare to unbiased evaluation Few unrated items Many unrated items Unbiased evaluation Which one is right? Yahoo! R3 MAR ratings → Unbiased evaluation MNAR ratings → Biased evaluation
  • 15. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Comparison to unbiased evaluation Biased vs. unbiased evaluation with Yahoo! R3 Neither all nor zero unrated targets match unbiased evaluation well How about something in between? Let’s explore the target size range… τ = 0.79 τ = 0.57 No unrated All unrated Unbiased Systemranking 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8
  • 16. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Correlation Yahoo! R3 # unrated target items “Sweet spot” Kendall  0 1 2 5 10 20 50 100 200 500 All 0 1 0.8 0.6 0.4 0.2 τ = 0.79 τ = 0.57 No unrated All unrated Unbiased Systemranking 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8
  • 17. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 1000 2000 0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Correlation Yahoo! R3 # unrated target items MovieLens 1M # unrated target items “Sweet spot” ? No MAR data Check discriminative power Kendall  1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 All 0 0 All
  • 18. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 1000 2000 0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Discriminative power: ties Yahoo! R3 “Sweet spot” MovieLens 1M # unrated target items Almost opposite monotonicity # unrated target items Sweet spot? # ties # ties Check discriminative power Kendall  1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 0 0.6 0.4 0.2 All 0 0 All
  • 19. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 1000 2000 Why do ties increase in the extremes? • Few unrated items: small set of items to rank • Many unrated items: metric → 0 as # unrated →  0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Discriminative power: ties Yahoo! R3 “Sweet spot” MovieLens 1M # unrated target items Almost opposite monotonicity # unrated target items  Sweet spot? # ties # ties Kendall  Many ties 1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 0 0.6 0.4 0.2 All 0 0 All
  • 20. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 All 1000 2000 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Discriminative power: 𝑝-values Yahoo! R3 “Sweet spot” MovieLens 1M # unrated target items The number of ties seems more informative than 𝑝-values # unrated target items Sweet spot? Sum of 𝑝-values # ties Kendall  # ties Sum of 𝑝-values 1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 0 0.6 0.4 0.2 0 40 20 0 100 300 0 All
  • 21. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Loss of coverage Small target sets can easily cause incomplete rankings Risk of highly misleading results depending on how the metric deals with this No unrated 1 2 All unrated Metric cutoff 3 4 Coverage loss 1 2 3 4 5 6 6 5 0 0.2 0.4 0.6 0.8 1 0 1 2 5 10 20 50 100 200 500 1000 2000 Full Coverage@10 0 0.2 0.4 0.6 0.8 1 0 1 2 5 10 20 50 100 200 500 Full Coverage@10 Yahoo! R3 MovieLens 1M kNN with small 𝑘 kNN with small 𝑘 # unrated target items # unrated target items 0 1 2 5 10 20 50 100 200 500 All 0 1 2 5 10 20 50 100 200 500 All 1000 2000 0 1 0.8 0.6 0.4 0.2 Coverage@10
  • 22. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Conclusion  Different target sets produce different evaluation outcomes – The disagreements are systematic on specific algorithms and configurations  Weakness of small target sets – More difficult to produce different rankings → discrimination power loss – Incomplete rankings  Weakness of large target sets – Exposure to observation bias (popularity or any other MNAR bias) – More difficult to produce metric values > 0 → discrimination power loss  Tie analysis can provide helpful orientation Neithersemsideal! Sweetspot→balance
  • 23. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Future work  Target items introduce a pre-filter that may alter the evaluated algorithms – Different target sampling distributions (e.g. popularity) – Different split protocols (e.g. temporal) also affect this  Further research on offline evaluation bias – Does unbiased Yahoo! R3 match a real setting?  Also check out: – Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020 – Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020