This document discusses target item sampling in offline recommender system evaluation. It finds that using different target item sets (e.g. including all unrated items vs no unrated items) can produce inconsistent evaluation outcomes and system rankings. Small target sets have low discrimination power due to many ties, while large target sets are biased by popularity and other factors. Analyzing the number of ties provides useful information. The optimal approach is to select a "sweet spot" target set size that balances discrimination and bias.
Simple TRIZ Function Attribute Analysis in combination with a "what if" questioning algorithm provides a powerful framework to enlist what can go wrong in systems
How do we protect privacy of users when building large-scale AI based systems? How do we develop machine learned models and systems taking fairness, accountability, and transparency into account? With the ongoing explosive growth of AI/ML models and systems, these are some of the ethical, legal, and technical challenges encountered by researchers and practitioners alike. In this talk, we will first motivate the need for adopting a "fairness and privacy by design" approach when developing AI/ML models and systems for different consumer and enterprise applications. We will then focus on the application of fairness-aware machine learning and privacy-preserving data mining techniques in practice, by presenting case studies spanning different LinkedIn applications (such as fairness-aware talent search ranking, privacy-preserving analytics, and LinkedIn Salary privacy & security design), and conclude with the key takeaways and open challenges.
Simple TRIZ Function Attribute Analysis in combination with a "what if" questioning algorithm provides a powerful framework to enlist what can go wrong in systems
How do we protect privacy of users when building large-scale AI based systems? How do we develop machine learned models and systems taking fairness, accountability, and transparency into account? With the ongoing explosive growth of AI/ML models and systems, these are some of the ethical, legal, and technical challenges encountered by researchers and practitioners alike. In this talk, we will first motivate the need for adopting a "fairness and privacy by design" approach when developing AI/ML models and systems for different consumer and enterprise applications. We will then focus on the application of fairness-aware machine learning and privacy-preserving data mining techniques in practice, by presenting case studies spanning different LinkedIn applications (such as fairness-aware talent search ranking, privacy-preserving analytics, and LinkedIn Salary privacy & security design), and conclude with the key takeaways and open challenges.
Monetization - The Right Business Model for Your Digital AssetsApigee | Google Cloud
As enterprises build and grow their mobile value chain with app, data and API technologies, digital assets become not only a competitive advantage, but also a source of revenue.
Join Anita Paul and Bryan Kirschner as they discuss the opportunities for value creation presented by APIs and data, share monetization models that apply to any industry, and explain how Apigee Monetization Services can help you deliver on the right business model for your digital assets.
We will discuss:
- The business context in the new digital world
- Business use cases and revenue opportunities
- How Apigee Monetization Services changes the game
How can you implement machine learning and artificial intelligence without having to build your own? In this webinar, we explore APIs, how companies are providing them "as-a-service" and how Ogilvy is applying machine learning to shopper reviews.
Intro to Data Analytics with Oscar's Director of ProductProduct School
The Director of Product at Oscar, Vasudev Vadlamudi, went over key types of quantitative analysis that B2C product managers use on the job including: funnels, cohorts, and a/b testing. For each one he looked into when and why they are used, and used examples.
A guide to helping start-ups with building and operating data systems for growth. Highlights what makes a good metric, how to define the right metrics for your business, and then how to build data infrastructure so that you can collect the relevant data.
Presented at Grow Camp 2018 in MaRS Discovery District in Toronto, Canada.
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Applitools
Full webinar recording:
Go through this presentation and on-demand session to learn: What Are The World’s Most Innovative Testing Teams Doing That You Are Not?
As much as we all hate to admit it, our test automation efforts are struggling. Coverage is dropping. Bugs are escaping to production. Our apps are visually complex, growing rapidly, delivered continuously, and changing constantly - so much so that our functional framework is now bloated, broken, and unable to keep up with Agile and CI-CD release best practices.
No wonder that in our latest State of Visual Testing research, the majority of companies surveyed reported that their CI-CD and automation processes are not helping them to successfully compete in today's fast-paced ecosystem, and are not effective in ensuring software quality in a scalable and robust way.
But what about those elite testing teams that got it right? What's their secret? Can we copy what they did, instead of setting ourselves to fail?
With this presentation, and on-demand session discussing it, learn how the 10% of the world’s most innovative testing teams have reinvented their test automation to support a fully automated CI-CD process, and guaranteed their company's digital transformation was a success.
Use these resources to learn:
-- Why the majority of test automation efforts are falling behind
-- How your QA and testing efforts compare to these elite teams -- via live polling results
-- 4 modern techniques that the top 10% of testing teams globally are doing every day, and that you can do too
A/B Testing Data-Driven Algorithms in the Cloud - WebinarRoberto Turrin
We present how A/B testing can be used to evaluate the performance of Machine Learning algorithms.
We explore the different evaluation approaches - from offline evaluation to online evaluation - with a particular focus on long-term KPIs and on the recent Cloud-based technologies that can facilitate the development and integration of A/B testing.
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
YouTube: https://youtu.be/xtOg44r6dsE
(** Python Data Science Training: https://www.edureka.co/python **)
In this PPT on Supervised vs Unsupervised vs Reinforcement learning, we’ll be discussing the types of machine learning and we’ll differentiate them based on a few key parameters. The following topics are covered in this session:
1. Introduction to Machine Learning
2. Types of Machine Learning
3. Supervised vs Unsupervised vs Reinforcement learning
4. Use Cases
Python Training Playlist: https://goo.gl/Na1p9G
Python Blog Series: https://bit.ly/2RVzcVE
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor flow, IBM watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science role. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python,Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.
Monetization - The Right Business Model for Your Digital AssetsApigee | Google Cloud
As enterprises build and grow their mobile value chain with app, data and API technologies, digital assets become not only a competitive advantage, but also a source of revenue.
Join Anita Paul and Bryan Kirschner as they discuss the opportunities for value creation presented by APIs and data, share monetization models that apply to any industry, and explain how Apigee Monetization Services can help you deliver on the right business model for your digital assets.
We will discuss:
- The business context in the new digital world
- Business use cases and revenue opportunities
- How Apigee Monetization Services changes the game
How can you implement machine learning and artificial intelligence without having to build your own? In this webinar, we explore APIs, how companies are providing them "as-a-service" and how Ogilvy is applying machine learning to shopper reviews.
Intro to Data Analytics with Oscar's Director of ProductProduct School
The Director of Product at Oscar, Vasudev Vadlamudi, went over key types of quantitative analysis that B2C product managers use on the job including: funnels, cohorts, and a/b testing. For each one he looked into when and why they are used, and used examples.
A guide to helping start-ups with building and operating data systems for growth. Highlights what makes a good metric, how to define the right metrics for your business, and then how to build data infrastructure so that you can collect the relevant data.
Presented at Grow Camp 2018 in MaRS Discovery District in Toronto, Canada.
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Applitools
Full webinar recording:
Go through this presentation and on-demand session to learn: What Are The World’s Most Innovative Testing Teams Doing That You Are Not?
As much as we all hate to admit it, our test automation efforts are struggling. Coverage is dropping. Bugs are escaping to production. Our apps are visually complex, growing rapidly, delivered continuously, and changing constantly - so much so that our functional framework is now bloated, broken, and unable to keep up with Agile and CI-CD release best practices.
No wonder that in our latest State of Visual Testing research, the majority of companies surveyed reported that their CI-CD and automation processes are not helping them to successfully compete in today's fast-paced ecosystem, and are not effective in ensuring software quality in a scalable and robust way.
But what about those elite testing teams that got it right? What's their secret? Can we copy what they did, instead of setting ourselves to fail?
With this presentation, and on-demand session discussing it, learn how the 10% of the world’s most innovative testing teams have reinvented their test automation to support a fully automated CI-CD process, and guaranteed their company's digital transformation was a success.
Use these resources to learn:
-- Why the majority of test automation efforts are falling behind
-- How your QA and testing efforts compare to these elite teams -- via live polling results
-- 4 modern techniques that the top 10% of testing teams globally are doing every day, and that you can do too
A/B Testing Data-Driven Algorithms in the Cloud - WebinarRoberto Turrin
We present how A/B testing can be used to evaluate the performance of Machine Learning algorithms.
We explore the different evaluation approaches - from offline evaluation to online evaluation - with a particular focus on long-term KPIs and on the recent Cloud-based technologies that can facilitate the development and integration of A/B testing.
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
YouTube: https://youtu.be/xtOg44r6dsE
(** Python Data Science Training: https://www.edureka.co/python **)
In this PPT on Supervised vs Unsupervised vs Reinforcement learning, we’ll be discussing the types of machine learning and we’ll differentiate them based on a few key parameters. The following topics are covered in this session:
1. Introduction to Machine Learning
2. Types of Machine Learning
3. Supervised vs Unsupervised vs Reinforcement learning
4. Use Cases
Python Training Playlist: https://goo.gl/Na1p9G
Python Blog Series: https://bit.ly/2RVzcVE
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor flow, IBM watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science role. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python,Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender SystemsPablo Castells
Diversity as a relevant dimension of retrieval quality is receiving increasing attention in the Information Retrieval and Recommender Systems (RS) fields. The problem has nonetheless been approached under different views and formulations in IR and RS respectively, giving rise to different models, methodologies, and metrics, with little convergence between both fields. In this poster we explore the adaptation of diversity metrics, techniques, and principles from adhoc IR to the recommendation task, by introducing the notion of user profile aspect as an analogue of query intent. As a particular approach, user aspects are automatically extracted from latent item features. Empirical results support the proposed approach and provide further insights.
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...Pablo Castells
The intent-oriented search diversification methods developed in the field so far tend to build on generative views of the retrieval system to be diversified. Core algorithm components –in particular redundancy assessment– are expressed in terms of the probability to observe documents, rather than the probability that the documents be relevant. This has been sometimes described as a view considering the selection of a single document in the underlying task model. In this paper we propose an alternative formulation of aspect-based diversification algorithms which explicitly includes a formal relevance model. We develop means for the effective computation of the new formulation, and we test the resulting algorithm empirically. We report experiments on search and recommendation tasks showing competitive or better performance than the original diversification algorithms. The relevance-based formulation has further interesting properties, such as unifying two well-known state of the art algorithms into a single version. The relevance-based approach opens alternative possibilities for further formal connections and developments as natural extensions of the framework. We illustrate this by modeling tolerance to redundancy as an explicit configurable parameter, which can be set to better suit the characteristics of the IR task, or the evaluation metrics, as we illustrate empirically.
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...Pablo Castells
Slides of the paper presentation at RecSys 2011.
Abstract: The Recommender Systems community is paying increasing attention to novelty and diversity as key qualities beyond accuracy in real recommendation scenarios. Despite the raise of interest and work on the topic in recent years, we find that a clear common methodological and conceptual ground for the evaluation of these dimensions is still to be consolidated. Different evaluation metrics have been reported in the literature but the precise relation, distinction or equivalence between them has not been explicitly studied. Furthermore, the metrics reported so far miss important properties such as taking into consideration the ranking of recommended items, or whether items are relevant or not, when assessing the novelty and diversity of recommendations.
We present a formal framework for the definition of novelty and diversity metrics that unifies and generalizes several state of the art metrics. We identify three essential ground concepts at the roots of novelty and diversity: choice, discovery and relevance, upon which the framework is built. Item rank and relevance are introduced through a probabilistic recommendation browsing model, building upon the same three basic concepts. Based on the combination of ground elements, and the assumptions of the browsing model, different metrics and variants unfold. We report experimental observations which validate and illustrate the properties of the proposed metrics.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation
1. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
On Target Item Sampling
in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es
Virtual Event, Brazil, September 2020
2. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation
Is system A better than B?
A B
3. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
···
···
Offline evaluation
Rank
Compute
metrics
Test data
Training
Unrated
Test
Set of
all items
Getting it right
• Correlate with production setting / online evaluation
• Consistent & comparable with other offline experiments
A > B?
A B
Training
data
4. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation – Target items
A
• Exclude items with training data
• Include a certain number of non-
relevant items (e.g. to reduce cost)
• Can this change the outcome?
Set of
all items
B
Compute
metrics
Test data
A > B?
Training
Unrated
Test
Training
data
Target
items
5. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Training
Unrated
Target
items
6. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Unrated
Test + all unrated
All items except
training items
Largest
Target items
Test + no unrated
Just test ratings
Smallest
Target
items
7. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test + all unrated
Target items
Test + no unrated
Test
Liked
Not liked
Unrated
Test + some unrated
Target items
May the number of unrated target items
affect the evaluation outcome?
8. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
Ranking A
1
2
3
4
5
All unrated
Ranking B
1
2
3
4
5
P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5
Ranking A
No unrated
Ranking B
May this affect the evaluation outcome?
> <
1
2
3
1
2
3
4
5
4
5
Test
Unrated
Liked
Not liked
9. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
0
0.2
0.4
0.6
Full Test
P@10P@10
0
0.6
0.4
0.2
8 systems
iMF (full)
iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average rating
Popularity
Random
0
0.2
0.4
0.6
Full Test
P@10
0
0.2
0.4
0.6
Full Test
P@10
10. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Best
system
Worst
system
P@10
11. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Kendall 𝛕 = 𝟎. 𝟏𝟒
A simple offline experiment on MovieLens 1M
Best
No unrated
Worst
All unrated
12. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Biased disagreement
Systematic disagreements Many Few
Unrated target items
13. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement
Which one is right?
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Many Few
Unrated target items
14. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement – Is either one right?
Since we want to match
online evaluation,
let’s compare to
unbiased evaluation
Few
unrated items
Many
unrated items
Unbiased
evaluation
Which one is right?
Yahoo! R3
MAR ratings → Unbiased evaluation
MNAR ratings → Biased evaluation
15. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Comparison to unbiased evaluation
Biased vs. unbiased evaluation
with Yahoo! R3
Neither all nor zero unrated targets
match unbiased evaluation well
How about something in between?
Let’s explore the target size range…
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
16. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
“Sweet spot” Kendall
0
1
2
5
10
20
50
100
200
500
All
0
1
0.8
0.6
0.4
0.2
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
17. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
MovieLens 1M
# unrated target items
“Sweet spot”
?
No MAR data
Check discriminative power
Kendall
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
All
0
0
All
18. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items
Sweet spot?
# ties
# ties
Check discriminative power
Kendall
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
19. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
Why do ties increase in the extremes?
• Few unrated items: small set of items to rank
• Many unrated items: metric → 0 as # unrated →
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items
Sweet spot?
# ties
# ties
Kendall
Many ties
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
20. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
All
1000
2000
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: 𝑝-values
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
The number of ties seems more informative than 𝑝-values
# unrated target items
Sweet spot?
Sum of
𝑝-values
# ties
Kendall
# ties
Sum of
𝑝-values
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
0
40
20
0
100
300
0
All
21. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Loss of coverage
Small target sets can easily cause incomplete rankings
Risk of highly misleading results depending on how the metric deals with this
No
unrated
1
2
All
unrated
Metric cutoff
3
4
Coverage
loss
1
2
3
4
5
6 6
5 0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
1000
2000
Full
Coverage@10
0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
Full
Coverage@10
Yahoo! R3 MovieLens 1M
kNN with
small 𝑘 kNN with
small 𝑘
# unrated target items # unrated target items
0
1
2
5
10
20
50
100
200
500
All
0
1
2
5
10
20
50
100
200
500
All
1000
2000
0
1
0.8
0.6
0.4
0.2
Coverage@10
22. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Conclusion
Different target sets produce different evaluation outcomes
– The disagreements are systematic on specific algorithms and configurations
Weakness of small target sets
– More difficult to produce different rankings → discrimination power loss
– Incomplete rankings
Weakness of large target sets
– Exposure to observation bias (popularity or any other MNAR bias)
– More difficult to produce metric values > 0 → discrimination power loss
Tie analysis can provide helpful orientation
Neithersemsideal!
Sweetspot→balance
23. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Future work
Target items introduce a pre-filter that may alter the evaluated algorithms
– Different target sampling distributions (e.g. popularity)
– Different split protocols (e.g. temporal) also affect this
Further research on offline evaluation bias
– Does unbiased Yahoo! R3 match a real setting?
Also check out:
– Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020
– Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020