Slides for my lecture on IR evaluation, presented at 11th European Summer School in Information Retrieval (ESSIR 2017) at Universitat Pompeu Fabra, Barcelona.
These slides were based on
1. Evaluation lecture @ QMUL; Thomas Roelleke & Mounia Lalmas
3. Lecture 8: Evaluation @ Stanford University; Pandu Nayak & Prabhakar Raghavan
4. Retrieval Evaluation @ University of Virginia; Hongnig Wang
5. Lectures 11 and 12 on Evaluation @ Berkeley; Ray Larson
6. Evaluation of Information Retrieval Systems @ Penn State University; Lee Giles
Textbooks:
1. Information Retrieval, 2nd edition, C.J. van Rijsbergen (1979)
2. Introduction to Information Retrieval, C.D. Manning, P. Raghavan & H. Schuetze (2008)
3. Modern Information Retrieval: The Concepts and Technology behind Search, 2nd ed; R. Baeza-Yates & B. Ribeiro-Neto (2011)
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Information retrieval 14 fuzzy set models of irVaibhav Khanna
Fuzzy Model is a set theoretic model of document retrieval based on fuzzy theory. An opposite to this is the Exact match mechanism by which only the objects satisfying some well specified criteria, against object attributes, are returned to the user as a query answer.
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Information retrieval 14 fuzzy set models of irVaibhav Khanna
Fuzzy Model is a set theoretic model of document retrieval based on fuzzy theory. An opposite to this is the Exact match mechanism by which only the objects satisfying some well specified criteria, against object attributes, are returned to the user as a query answer.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
Asynchronous Transfer Mode ATM is the cell relay protocol designed by ATM Forum and adopted by the ITU-T. Cell, a small fixed size block of information with asynchronous TDM ensures high speed real time transmission with efficient and cheaper technology. Instead of user addresses, it uses virtual circuit identifier and virtual path identifier, which can be repeated at unrelated locations. This technology ensures connectivity to much more users than normal packet switching networks.
ATM and ISDN-B combination allows high-speed interconnection of world's network.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
Asynchronous Transfer Mode ATM is the cell relay protocol designed by ATM Forum and adopted by the ITU-T. Cell, a small fixed size block of information with asynchronous TDM ensures high speed real time transmission with efficient and cheaper technology. Instead of user addresses, it uses virtual circuit identifier and virtual path identifier, which can be repeated at unrelated locations. This technology ensures connectivity to much more users than normal packet switching networks.
ATM and ISDN-B combination allows high-speed interconnection of world's network.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
presents the foundational aspects of web analytics and some specifics such as the hotel problem. Discusses trace data, behaviorism, and other cool web analytics stuff
Today, I had the big honor to give the opening keynote at the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2020), being held virtually. HCOMP is the home of the human computation and crowdsourcing community working on frameworks, methods and systems that bring together people and machine intelligence to achieve better results. I decided to totally revamp a previous talk to focus on so-called "human in the loop" and showed how we incorporate human in the loop to personalise at scale, with some of the research at Spotify. Sharing the slides for general interests.
These are the slides of my invited talk at the REVEAL workshop at RecSys 2019. The workshop focuses on the offline evaluation for recommender systems, and this year’s focus was on Reinforcement Learning. Although not directly related to reinforcement learning, it is clear that there are connections to what research in reinforcement learning is attempting to achieve (defining the rewards) and metrics that are optimized by recommender systems. I presented various works and personal thoughts on how to develop metrics of user engagement, which recommender systems can optimize for. An important message was that, for recommender systems to work both in the short and the long-term, it is important to consider the heterogeneity of both user and content to formalise the notion of engagement, and in turn design the appropriate metrics to capture these and optimize for. One way to achieve this is to follow these four steps: 1) Understanding intents; 2) Optimizing for the right metric; 3) Acting on segmentation; and 4) Thinking about diversity.
An previous version of this talk was given to UMAP 2019. See https://www.slideshare.net/mounialalmas/metrics-engagement-personalization
These are the slides of the keynote I gave at UMAP 2019 (User Modeling, Adaptation and Personalization) held in Larnaca, June 2019. The theme of the conference this year was "Making Personalization Transparent: Giving Control Back To The User". I looked at the 1st part for my talk.
When users interact with the recommendations served to them, they leave behind fine-grained traces of interaction patterns, which can be leveraged to predict how satisfying their experience was. This talk will present various works and personal thoughts on how to measure user engagement. It will discuss the definition and development of metrics of user satisfaction that can be used as proxy of user engagement, and will include cases of good, bad and ugly scenarios. An important message will be to show that, to make personalization transparent, it is important to consider the heterogeneity of both user and content to formalise the notion of satisfaction, and in turn design the appropriate satisfaction metrics to capture these. One way to do this is to consider the following angles: 1) Understanding intents; 2) Optimizing for the right metric; 3) Acting on segmentation; and 4) Thinking about diversity.
These are the slides of the tutorial Liangjie Hong and I gave at The Web Conference in San Francisco, 2019. Full details of the tutorial and previous instances can be found at https://onlineuserengagement.github.io/.
Tutorial abstract:
User engagement plays a central role in companies operating online services, such as search engines, news portals, e commerce sites, entertainment services, and social networks. A main challenge is to leverage collected knowledge about the daily online behavior of millions of users to understand what engage them short-term and more importantly long-term. Two critical steps of improving user engagement are metrics and their optimization. The most common way that engagement is measured is through various online metrics, acting as proxy measures of user engagement. This tutorial will review these metrics, their advantages and drawbacks, and their appropriateness to various types of online services. Once metrics are defined, how to optimize them will become the key issue. We will survey methodologies including machine learning models and experimental designs that are utilized to optimize these metrics via directly or indirect ways. As case studies, we will focus on four types of services, news, search, entertainment, and e-commerce.
We will end with lessons learned and a discussion on the most promising research directions.
Presenters:
Liangjie Hong, Director of Engineering, Data Science and Machine Learning at Etsy Inc.
Mounia Lalmas, Director of Research at Spotify, and Head of Tech Research in Personalization.
These are the slides I used for my talk at the BIG Track at the Web Conference 2019. This is a very similar talk to what I gave at the celebration kickoff of Chalmers AI Research Centre in Gothenburg in March 2019. It has a bit more and reflect some of the most recent work we are doing at Spotify Research. I am posted these again as people are asking for the slides. Thank you.
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
These are the slides of a talk about some of our research at Spotify, as part of the celebration kickoff of Chalmers AI Research Centre in Gothenburg. I always like to make a story in my talk, and this time I wanted to reflect on the "push" (think recommender system) and "pull" (think search) paradigms. I am using this quote from Nicholas Belkin and Bruce Croft from their Communications of the ACM article published in 1992 to frame my story: "We conclude that information retrieval and information filtering are indeed two sides of the same coin. They work together to help people get the information needed to perform their tasks."
At the BCS Search Solutions 2018, I gave a talk about work on search we are doing at Spotify. The talk described what search means in the context of Spotify, how it differs what we know about search, and the challenges associated with understanding user intents and mindsets in an "entertainment" context. The talk also discussed various efforts at Spotify to understand why users submit search queries, what they expect, how they assess their search experience, and how Spotify responds to these search queries. This is work done with many colleagues at Spotify in Boston, London, New York and Stockholm, and our wonderful summer interns.
Tutorial on metrics of user engagement -- Applications to Search & E- commerceMounia Lalmas-Roelleke
User engagement plays a central role in companies operating online services, such as search engines, news portals, e-commerce sites, and social networks. A main challenge is to leverage collected knowledge about the daily online behavior of millions of users to understand what engage them short-term and more importantly long-term. The most common way that engagement is measured is through various online metrics, acting as proxy measures of user engagement. This tutorial reviews these metrics and proposes a taxonomy of metrics. As case studies, it focuses on two types of services, search and e-commerce. The tutorial also discusses how to develop better machine learning models to optimize online metrics, and design experiments to test these models.
This tutorial was given by Mounia Lalmas from Spotify and Liangjie Long from Etsy Inc.
This tutorial was presented at WSDM 2018 (11th ACM International Conference on Web Search and Data Mining). It is the first delivery of this tutorial, so feedbacks and comments are welcome. We intend to continue working on this material.
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...Mounia Lalmas-Roelleke
Many of today’s websites have recognised the importance of mobile friendly pages to keep users engaged and to provide a satisfying user experience. However, next to the experience provided by the sites themselves, advertisements, when clicked, present users with landing pages that are not necessarily mobile friendly. We explore what type of features are able to characterise the mobile friendliness of sponsored search ad landing pages. To have a complete understanding of the mobile ad experience in terms of layout and visual appearance, we also explore the notion of the ad page aesthetic appeal. We design and collect annotations for both dimensions on a large set of ads, and find that mobile friendliness and aesthetics represent different notions.
We perform a comprehensive study of the effectiveness of over 120 features on the tasks of friendliness and aesthetics prediction. We find that next to general page size, HTML, and resource usage based features, several features based on the visual composition of landing pages are important to determine mobile friendliness and aesthetics. We demonstrate the additional benefit of these various types of features by comparing against the mobile friendliness guidelines provided by W3C. Finally, we use our models to determine the state of landing page mobile friendliness and aesthetics on a large sample of advertisements of a major internet company.
These are the slides of work presented at WWW 2017 in Perth:
M. Bron, M. Redi, F. Silvestri, H. Evans, M. Chute and M. Lalmas. Friendly, Appealing or Both? Characterising User Experience in Sponsored Search Landing Pages, 26th International World Wide Web Conference (WWW 2017), Industrial Track, Perth, Australia, 3-7 April, 2017.
Slides for keynote "Social Media and AI: Don’t forget the users" at WWW 2017 workshop "International Workshop on Modeling Social Media: Machine Learning and AI for Modeling and Analyzing Social Media". I am arguing that we need consider two things: the source of what we use to make good algorithms and whether users are impacted the way we want to impact them. The talk is based on two uses cases around providing diversity (something many of us believe is good) to users:
1. Engaging through diversity: serendipity (same algorithm, different sources)
2. Engaging through diversity: awareness (effective algorithm, perception)
My goal is to say, we may have the best AI, but we may get it wrong if we forget the users. I don't have answers, but it is important that we ask the right questions in today's world.
Native advertising is a specific form of online advertising where ads replicate the look-and-feel of their serving platform. In such context, providing a good user experience with the served ads is crucial to ensure a positive user experience and hence long-term user engagement. In this talk, I will describe work at Yahoo aiming at understanding the user experience on ads in the mobile context and building learning frameworks to identify and account for ads of low quality while ensuring a return of investment to advertisers.
Slides for the Invited Talk at BigData Innovators Gathering (BIG), co-located with WWW 2017, Perth 2017 (https://big2017.org). Earlier versions of this talk were given at various venues in London.
Describing Patterns and Disruptions in Large Scale Mobile App Usage DataMounia Lalmas-Roelleke
The advertising industry is seeking to use the unique data provided by the increasing usage of mobile devices and mobile applications (apps) to improve targeting and the experience with apps. As a consequence, understanding user behaviours with apps has gained increased interests from both academia and industry. In this paper we study user app engagement patterns and disruptions of those patterns in a data set unique in its scale and coverage of user activity. First, we provide a detailed account of temporal user activity patterns with apps and compare these to previous studies on app usage behavior. Then, in the second part, and the main contribution of this work, we take advantage of the scale and coverage of our sample and show how app usage behavior is disrupted through major political, social, and sports events.
Slides for paper presented at TempWeb 2017:
S. Van Canneyt, M. Bron, A. Haines and M. Lalmas. Describing Patterns and Disruptions in Large Scale Mobile App Usage Data, 7th Temporal Web Analytics Workshop (TempWeb), International World Wide Web Conference (WWW 2017), Industrial Track, Perth, Australia, 3-7 April, 2017.
Story-focused Reading in Online News and its Potential for User EngagementMounia Lalmas-Roelleke
We study the news reading behaviour of several hundred thousand users on 65 highly-visited news sites. We focus on a specific phenomenon: users reading several articles related to a particular news development, which we call story-focused reading. Our goal is to understand the effect of story-focused reading on user engagement and how news sites can support this phenomenon. We found that most users focus on stories that interest them and that even casual news readers engage in story-focused reading. During story-focused reading, users spend more time reading and a larger number of news sites are involved. In addition, readers employ different strategies to find articles related to a story.
We also analyse how news sites promote story-focused reading, by looking at how they link their articles to related content published by them, or by other sources. The results show that providing links to related content leads to a higher engagement of the users, and that this is the case even for links to external sites. We also show that the performance of links can be affected by their type, their position, and how many of them are present within an article.
This work co-authored with J. Lehmann, C. Castillo and R. Baeza-Yates has been published in the Journal of The Association For Information Science And Technology (JASIST), available online in May 2016. The work was presented at the Yahoo TechPulse Annual conference in December 2016.
Native advertising is a specific form of online advertising where ads replicate the look and feel of their serving platform. In such context, providing a good user experience with the served ads is crucial to ensure long-term user engagement. This talk present an overview of work aimed at understanding the user preclick experience of ads and building a learning framework to identify ads with low preclick quality.
Work in collaboration with Ke (Adam) Zhou, Miriam Redi and Andy Haines. An version of this work was presented at WWW Montreal, April 2016.
Native advertising is a specific form of online advertising where ads replicate the look-and-feel of their serving platform. In such context, providing a good user experience with the served ads is crucial to ensure long-term user engagement. In this work, we explore the notion of ad quality, namely the effectiveness of advertising from a user experience perspective. We design a learning framework to predict the pre-click quality of native ads. More specifically, we look at detecting offensive native ads, showing that, to quantify ad quality, ad offensive user feedback rates are more reliable than the commonly used click-through rate metrics. We then conduct a crowd-sourcing study to identify which criteria drive user preferences in native advertising. We translate these criteria into a set of ad quality features that we extract from the ad text, image and advertiser, and then use them to train a model able to identify offensive ads. We show that our model is very effective in detecting offensive ads, and provide in-depth insights on how different features affect ad quality. Finally, we deploy a preliminary version of such model and show its effectiveness in the reduction of the offensive ad feedback rate.
There are the slides of our WWW 2016 paper. This is work with Ke (Adam) Zhou, Miriam Redi and Andy Haines.
Improving Post-Click User Engagement on Native Ads via Survival AnalysisMounia Lalmas-Roelleke
In this paper we focus on estimating the post-click engagement on native ads by predicting the dwell time on the corresponding ad landing pages. To infer relationships between feature of the ads and dwell time we resort to the application of survival analysis techniques, which allow us to estimate the distribution of the length of time that the user will spend on the ad. This information is then integrated into the ad ranking function with the goal of promoting the rank of ads that are likely to be clicked and consumed by users (dwell time greater than a given threshold). The online evaluation over live traffic shows that considering post-click engagement has a consistent positive effect on both CTR, decreases the number of bounces and increases the average dwell time, hence leading to a better user post-click experience.
There are the slides of our WWW 2016 paper. This is work with Nicola Barbieri and Fabrizio Silvestri.
Evaluating the search experience: from Retrieval Effectiveness to User Engage...Mounia Lalmas-Roelleke
These are my slides for my presentation at CLEF 2015 which is being held in Toulouse. I discuss evaluation in the context of search, and how to move towards looking at long-term effect of the search experience. I do this through the concept of absence time. I present examples for search but also in the context if mobile advertising. My aim is to frame evaluation within user engagement.
A Journey into Evaluation: from Retrieval Effectiveness to User EngagementMounia Lalmas-Roelleke
Slides of my presentation at SPIRE 2015 at King's College London.
The talk builds on my work on user engagement, and proposes to move beyond clicks and relevance in information retrieval evaluation, towards user engagement. The talk presents some of my work to show how this could be done. There are two focus: moving from intra- to inter-session evaluation, and moving from small- to large-scale evaluation. These focuses are there to acknoweldge that (1) happy users come back, and (2) we need to properly identify who are the happy users. I hope that this talk will bring new perspectives to those building search systems and wanting to evaluate them, beyond their retrieval effectiveness.
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersMounia Lalmas-Roelleke
Click-through rate (CTR) is the most common metric used to assess the performance of an online advert; another performance of an online advert is the user post-click experience. In this paper, we describe the method we have implemented in Yahoo Gemini to measure the post-click experience on Yahoo mobile news streams via an automatic analysis of advert landing pages. We measure the post-click experience by means of two well-known metrics, dwell time and bounce rate. We show that these metrics can be used as proxy of an advert post-click experience, and that a negative post-click experience has a negative effect on user engagement and future ad clicks. We then put forward an approach that analyses advert landing pages, and show how these can affect dwell time and bounce rate. Finally, we develop a prediction model for advert quality based on dwell time, which was deployed on Yahoo mobile news stream app running on iOS. The results show that, using dwell time as a proxy of post- click experience, we can prioritise higher quality ads. We demonstrate the impact of this on users via A/B testing.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
An introduction to system-oriented evaluation in Information Retrieval
1. An introduction to system-oriented evaluation
in Information Retrieval
Mounia Lalmas
2. Outline
o What to evaluate in IR
o Test collection methodology
- Document, information need, query, relevance
- TREC
o Precision and recall
- Average precision, interpolated, mean average precision (MAP)
- P@r, R-Precision, MRR
- E and F measures
o Other measures (DCG, bpref)
o Significance testing
o Large-scale evaluation (web search & clicks)
o Evaluating classifiers
2
Information Retrieval = IR
IR vs. Search
3. Outline
o What to evaluate in IR
o Test collection methodology
- Document, information need, query, relevance
- TREC
o Precision and recall
- Average precision, interpolated, mean average precision (MAP)
- P@r, R-Precision, MRR
- E and F measures
o Other measures (DCG, bpref)
o Significance testing
o Large-scale evaluation (web search & clicks)
o Evaluating classifiers
3
Information Retrieval = IR
IR vs. Search
4. Evaluation in general versus evaluation in IR
o Evaluating a system in computer science is often concerned with
time and space è system performance
o With large collections of documents, system performance is still very
important
o However, in IR, we care a lot about retrieval performance: are the
retrieved documents “relevant” to a “user information need”?
4
5. Why do we need to evaluate an IR system?
o The user wants to find recipes about
“couscous” as cooked in various
countries
o User uses 2 IR systems
o How we can say which one is better?
5
6. Acknowledgements
6
These slides were based on
- Evaluation lecture @ QMUL; Thomas Roelleke & Mounia Lalmas
- Lecture 8: Evaluation @ Stanford University; Pandu Nayak & Prabhakar Raghavan
- Retrieval Evaluation @ University of Virginia; Hongnig Wang
- Lectures 11 and 12 on Evaluation @ Berkeley; Ray Larson
- Evaluation of Information Retrieval Systems @ Penn State University; Lee Giles
o Information Retrieval, 2nd edition, C.J. van Rijsbergen (1979)
o Introduction to Information Retrieval, C.D. Manning, P. Raghavan & H. Schuetze (2008)
o Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edition; R.
Baeza-Yates & B. Ribeiro-Neto (2011)
7. What to evaluate in IR
o coverage of the collection: extent to which the system includes
relevant material
o time lag (efficiency): average interval between the time a query is
submitted and the answer is given
o presentation of the output
o effort involved by user in obtaining answers to a query
o recall of the system: proportion of relevant documents retrieved
o precision of the system: proportion of the retrieved documents that
are actually relevant
7
8. o coverage has to do with the quality of the collection
o efficiency in terms of speed, memory usage, etc
o presentation has to do with interface and visualisation
issues
o effort has to do with user issues, e.g. user satisfaction.
o recall and precision have to do with retrieval effectiveness
or effectiveness for short è system-oriented evaluation
8
What to evaluate in IR
9. System-oriented evaluation
o Measuring effectiveness has been the most predominant in IR evaluation
o Test collection methodology
- Benchmark (dataset) upon which effectiveness is measured and compared
- Dataset tells for a given query what are the relevant documents
o Metrics to measure effectiveness
- Precision and recall, and variants
- E and F measures
- Others (DCG, bpref)
9
10. Test Collection methodology
o Compare retrieval performance using a test collection
- Document collection, that is the document themselves. The document collection
depends on the task, e.g. evaluating web retrieval requires a collection of HTML
documents.
- Queries, which simulate real user information needs.
- Relevance judgements, stating for a query the relevant documents.
o To compare the performance of two techniques:
- each technique used to answer queries
- results (set or ranked list) compared using some effectiveness performance measure
- most common measures are precision and recall
o Usually use multiple measures to get different views of performance
o Usually test with multiple collections as performance can be collection
dependent
10
11. Informa(on need, query and relevance
o The information need is translated into a query
o Relevance is assessed relative to the information need not the query
- Information need: I am looking for information on what are the best places to go on
holiday near the beach and play tennis
- Query: tennis beach holiday
- Evaluate whether the document addresses the information need, not whether it has the
three words “tennis”, “beach” and “holiday”
Sec. 8.1
11
12. Relevance … as defined in system-oriented
evaluation
o A document is relevant if it “has significant and demonstrable bearing
on the matter at hand”.
o There are common assumptions about the nature of relevance in
system-centred evaluation:
- Objectivity: everybody agree on whether a document is relevant or not to a
query
- Topicality: relevance is about whether the document is about the topic
expressed in the query
- Binary nature: either a document is relevant or not
- Independence: the fact that a document is relevant to a query has no effect
of the relevance of another document for that same query
12
13. Relevance is difficult to define satisfactorily
o A document is relevant within the context of a query
- Who judges the relevance? è humans not very consistent (see next slide)
- Is the document useful? è Utility
- Judgment on whether a document is relevant or not depend on more than document
and query
o With real collections, we never know the full set of relevant documents
o Retrieval model incorporates notion of relevance
- Satisfiability of a logical expression in Boolean model
- P(relevance | query, document) in BIRM
- Similarity to query in VSM
- P(query generated | document model) in LM
13
14. Kappa measure for inter-judge relevance
agreement
o Kappa measure
- Agreement measure among judges (assessing document
relevance)
- Designed for categorical judgments (relevant or not)
o Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
o P(A) – proportion of time judges agree
o P(E) – what agreement would be by chance
o Kappa = 0 for chance agreement, 1 for total agreement
Sec. 8.5
14
15. Kappa Measure: Example
Number of documents
assessed
Judge 1 Judge 2
300 Relevant Relevant
70 Non-relevant Non-relevant
20 Relevant Non-relevant
10 Non-relevant Relevant
Sec. 8.5
15
JudgesagreeJudgesdisagree
17. Impact of inter-judge agreement on IR systems
comparisons
o Impact on absolute effecGveness performance measure can be
significant (0.32 vs 0.39)
o But liVle impact on ranking of different systems or rela(ve
effecGveness performance
o If we just want to know if IR system A is beVer than IR system B
è test collecGon methodology gives reliable comparison
Sec. 8.5
17
18. Find the relevant documents in the collection
o Did the IR system find all relevant document?
o To answer accurately, we need complete judgments
- i.e., “yes,” “no,” or some score for every query-document pair
o For small test collections, we can review all documents for all queries
o Not practical for large or even medium-sized collection
- TREC collections have millions of documents
o Pooling method
o Click-based evaluation in web search (later in the lecture)
18
19. Test collection creation
o Manual method:
- Every document in the collection is judged against every query by one of several judges
(human assessors)
- This is feasible for small document collection.
o Pooling method (used for large document collection):
- The queries are run against several IR systems first
- The top, for example 100, documents retrieved by each system are pooled together
- The pool is then judged for relevance (by human assessors)
- This is what TREC does
o Query logs (web search) è see later about “evaluation with clicks”
19
20. Sample test collections (ad hoc retrieval)
Characteristics Cranfield CACM ISI West TREC2
Collection size (docs) 1400 3204 1460 11953 742611
Collection size (MB) 1.5 2.3 2.2 254 2162
Year created 1968 1983 1983 1990 1991
Unique stems 8226 5493 5448 196707 1040415
Stem occurrences 123200 117578 98304 21798833 243800000
Max within document
frequency
27 27 1309
Mean document length
(words)
88 36.5 67.3 1823 328
Number of queries 225 50 35 44 100
20
ad hoc retrieval: query, document, ranking
21. CIS
o 1239 documents about cystic fibrosis from MEDLINE collection
o Fields: author, title, source, major and minor subjects, abstracts, references and
citations
o 100 queries, developed by relevance judges
o Unusual features:
- 4 judges per document per query (3 experts,
1 medical bibliographer)
- 3 levels of relevance (0-2)
- Combined relevance on scale of 0-8
222 2
221 2
211 2
111 2
222 1
221 1
211 1
111 1
000 0
21
Added so we do not forget history
22. CACM
o 3024 articles on computer science from CACM, 1958 - 1979
o Fields: author, date, word stems for titles and abstracts, categories, direct
referencing, bibliography coupling, number of co-citations for each pair of articles
o 52 queries, each with 2 Boolean formulations
o Unusual features:
- Citation links to other documents, so often used for hypertext-type
experiments
22
Added so we do not forget history
23. TREC o Text REtrieval Conference/
Competition
- http://trec.nist.gov
- Run by NIST (National
Institute of Standards &
Technology)
o Collections: > Terabytes,
o Datasets
- Newswire & full text news
(AP, WSJ, Ziff, FT)
- Government documents
(federal register,
Congressional Record)
- Radio Transcripts (FBIS)
- Web “subsets”
- …
23
25. Queries & relevance judgments at TREC
o Queries devised and judged by “information
specialists” èTREC Topics
o Relevance judgments done only for those documents
retrieved and not entire collection!
- E.g. merge top 100 retrieved documents from systems experimented
with (TREC participants)
- Pooling method
25
26. Example (excerpt) of a TREC document
<doc>
<docno> WSJ880406-0090 </docno>
<hl> AT&T Unveils Services to Upgrade Phone Networks
Under Global Plan </hl>
<author> Janet Guyon (WSJ Sta) </author>
<dateline> New York </dateline>
<text>
American Telephone & Telegraph Co. introduced the rest of a new generation of phone
services with broad ...
</text>
</doc>
26
27. Example (excerpt) of a TREC topic
<top>
<num> Number: 168 </docno>
<title> Topic: Financing AMTRAK
<desc> Description
A document will address the role of the Federal Government in financing the operation of
the National Railroad Transportation Corporation (AMTRAK)
<nar> Narrative:
A relevant document must provide information on the government's responsibility to make
AMTRAK an economically viable entity. It could also discuss the privatisation of AMTRAK
as an alternative to continuing government subsidies. Documents comparing government
subsidies given to air and bus transportation with those provided to AMTRAK would also be
relevant.
</top>
27
28. TREC legacy
o Pros:
- made research systems scale to large collections (pre-WWW)
- allows for controlled comparisons
o Cons:
- emphasis on high recall, often unrealistic for what most users want è but
recall-oriented search exist (patent retrieval, e-discovery)
- very long queries, unrealistic è systems optimized for long queries and
hence perform worse for shorter, more realistic queries
- focus on batch ranking (one-off result) rather than interaction (but session track
was introduced to evaluate a “user search session”)
28
29. Others evaluation forums
o CLEF (Cross-Language Evaluation Forum)
o NCTIR (NII Testbeds and Community for Information access Research)
o FIRE (Forum for Information Retrieval Evaluation)
o INEX (The Initiative for the Evaluation of XML retrieval)
29
30. Effectiveness
o We recall that the goal of an IR system is to retrieve as
many relevant documents as possible and as few non-
relevant documents as possible.
o Evaluating the above consists of a comparative evaluation
of technical performance of IR system(s):
- In traditional IR, technical performance means the effectiveness of the IR
system: the ability of the IR system to retrieve relevant documents and
suppress non-relevant documents
- Effectiveness is measured by the combination of recall and precision
30
31. Intuition behind precision and recall
o Collection of 10,000 documents, 50 relevant to a given topic
o Ideal system finds these 50 documents and reject all others
o An actual system likely identifies 25 documents; 20 are relevant
and 5 were on other topics
Precision: 20/25 = 0.8 (80% of retrieved document are relevant)
Recall: 20/50 = 0.4 (40% of the relevant document are found)
31
32. Measuring Precision and Recall
Precision is easy to measure:
o Look at each document retrieved and decide whether it is relevant or not
o In previous example, only the 25 documents that are found need to be
examined
Recall is difficult to measure:
o To know all relevant items, we must go through entire collection, looking
at every document to decide if it is relevant or not
o In previous example, all 10,000 documents must be examined! è remember
the pooling method at TREC
32
33. Recall / Precision
Document collection
Retrieved RelevantRetrieved and relevant
Knowing which documents are relevant to which queries comes from the test collection
For a given query, the document collection can be divided into three sets:
the set of retrieved document, the set of relevant documents,
and the rest of the documents.
33
34. Recall / Precision
In the ideal case, the set of retrieved documents is equal to the set of relevant
documents. However, in most cases, the two sets will be different.
This difference is formally measured with precision and recall.
34
Document collection
Retrieved RelevantRetrieved and relevant
precision =
number of relevant documents retrieved
number of documents retrieved
recall =
number of relevant documents retrieved
number of documents relevant
35. Retrieved vs. Relevant Documents
Relevant
Very high precision, very low recall
retrieved
35
High precision rate is achieved by returning documents that we know for sure
are relevant à Is this a good idea?
36. Retrieved vs. Relevant Documents
Relevant
High recall, but low precision
retrieved
36
100% recall can be achieved by returning all documents in the collection
à This is for sure a bad idea!
37. Retrieved vs. Relevant Documents
Relevant
Very low precision, very low recall (0 for both)
retrieved
37
Total failure!
38. Retrieved vs. Relevant Documents
Relevant
High precision, high recall
retrieved
38
The perfect scenario!
39. Recall and Precision
The above two measures do not take into account where the relevant documents
are retrieved, this is, at which rank (crucial since the output of most IR systems
is a ranked list of documents).
This is very important because an effective IR system should not only retrieve
as many relevant documents as possible and as few non-relevant documents as
possible, but also it should retrieve relevant documents before the non-relevant
ones.
39
precision =
number of relevant documents retrieved
number of documents retrieved
recall =
number of relevant documents retrieved
number of documents relevant
40. Recall and Precision
o Let us assume that for a given query, the following documents are relevant (10 relevant
documents)
{d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
o Now suppose that the following documents are retrieved for that query:
For each relevant document (in red bold), we calculate the precision value and the recall value. For
example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the precision
is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant documents
being 10), so recall is 2/10.
rank doc precision recall rank doc precision recall
1
2
3
4
5
6
7
d123
d84
d56
D6
d8
d9
d511
1/1
2/3
3/6
1/10
2/10
3/10
8
9
10
11
12
13
14
d129
d187
d25
d48
d250
d113
d3
4/10
5/14
4/10
5/10
40
41. Recall and Precision
o For each query, we obtain pairs of recall and precision values
- In our example, we would obtain
(1/10, 1/1) (2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) …
which are usually expressed in %
(10%, 100%) (20%, 66.66%) (30%, 50%) (40%, 40%), (50%, 35.71%) …
- This can be read for instance: at 20% recall, we have 66.66% precision; at 50%
recall, we have 35.71% precision
The pairs of values are
plotted into a graph, which
has the following curve
Recall (%)
Precision (%)
10 20 30 40 50 60 70 80 90 100
100
90
80
70
60
50
40
30
20
10
41
42. Recall and Precision
o We have shown how to derive the recall and precision curve for a
given query
o Now we describe how using the above for all queries, the
effectiveness of an IR system is evaluated and thus compared to
other IR systems.
o Note that we can also compare the same system, but with different
versions (e.g. different parameters are used). The idea here is to
find out the best version of the IR system.
42
43. The complete methodology
For each IR system / IR system version
1. For each query in the test collection
a. We first run the query against the system to obtain a ranked list of retrieved
documents
b. We use the ranking and relevance judgements to calculate recall/precision pairs
2. Then we average recall / precision values across all queries, to
obtain an overall measure of the effectiveness
43
44. Averaging across queries
o Hard to compare precision and recall graphs or tables for
individual queries (too much data)
- Need to average over many queries
o Two main types of averaging
- Macro-average: each query is a point in the average
- Micro-average: each relevant document is a point in the average
- Macro is mostly used (all queries count equally)
44
45. (Macro) Interpolated average precision
o Average precision at standard recall points
o For a given query, compute precision and recall point for every relevant
document
o Interpolate precision at standard recall levels
- 11-pt is usually 100%, 90%, 80%, ..., 10%, 0%
o Average over all queries to get average precision at each recall level
45
46. Interpolation
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
recall
Interpolated valueObserved value
precision
It is often the case that recall values are not given for standard recall values (10%,
20%, ….). We therefore need to interpolate to obtain standard recall values.
For example, the
value is 25%, and is
interpolated to the
nearest standard
recall value on the
right, that is 30%.
46
47. Interpolated average precision
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
recall
query1
query 2
average
We have precision values at standard recall values for two queries. The
precision values for query 1 are higher than those for query 2. This means that the
effectiveness of the IR system is better for query 1 than for query 2. We can plot
the average of the two queries.
47
precision
48. Averaging
The same information
can be displayed in
a table.
48
Precision in %
Recall in % Query 1 Query 2 Average
10 80 60 70
20 80 50 65
30 60 40 50
40 60 30 45
50 40 25 32.5
60 40 20 30
70 30 15 30
80 30 10 22.5
90 20 5 11.5
100 20 5 11.5
49. Comparison of systems
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
recall
precision
system 1
system 2
We can now compare IR systems / system versions. For example, here we see that at low
recall, system 2 is better than system 1, but this changes from recall value 30%, etc. It is
common to calculate an average precision value across all recall levels, so that to have a
single value to compare.
49
50. Averaging across averages
o Average interpolated recall levels to get single result
- Called “interpolated average precision”
- Not used much anymore; “mean average precision” more common
- Values at specific interpolated points still commonly used
o Mean average precision (MAP)
- (“Average average precision” sounds weird)
- Average precision over all relevant documents, non-interpolated
- Reward systems that retrieve relevant documents quickly (highly ranked)
50
51. Mean Average Precision
Consider rank position of each relevant document (n) for given query
r1, r2, … rn
Compute precision@r (denoted P@r) for each r1, r2, … rn
Average precision = average of P@r for given query
MAP is Average Precision across multiple queries
1
3
.(
1
1
+
2
3
+
3
5
) ⇡ 0.76
51
52. Mean Average Precision (MAP)
52
average precision query 1 (AP) = (1.0 + 0.67 + 0.5 + 0.44 + 0.5)/5 = 0.62
average precision query 2 (AP) = (0.5 + 0.4 + 0.43)/3 = 0.44
mean average precision (MAP) = (0.62 + 0.44)/2 = 0.53
53. More about mean average precision (MAP)
o If a relevant document is not retrieved, precision corresponding to that
relevant document is zero
o Most commonly used measure in research papers … with issues
o Not so good for web search evaluation (precision oriented)
- MAP assumes user is interested in finding many relevant documents
53
54. TREC (trec_eval) evaluation results
Recall Level Precision Average
Recall Precision
0.0
0.1
…
1.0
0.61
0.45
…
0.003
average precision over all relevant documents
Non-interpolated (MAP) 0.23
54
55. Average precision per query
1.0
-1.0
0.0
200 201 202 203 204 …… Topic ids
Difference average precision
55
A system may perform badly for some information needs (MAP = 0.1) and excellently
on others (MAP = 0.7)
èoften the case that variance in performance of same system across queries is much greater
than variance of different systems on the same query
There are easy informaGon needs and hard ones!
56. Rank-based measures
o Binary relevance
- Mean Average Precision (MAP)
- P@r
- R-Precision
- Mean Reciprocal Rank (MRR)
- bpref
o Multiple levels of relevance
- Normalized Discounted Cumulative Gain (NDCG)
56
57. P@r or Precision @ rank r
Set a rank threshold r
Compute % relevant documents in top r
Ignores documents ranked lower than r
P@3 = 2/3
P@4 = 2/4
P@5 = 3/5
actual performance as a user
might see it
often used in web retrieval
used at fixed rank values:
P@5, P@10
57
Note the slight difference with P@r in slide 51
58. R-Precision
o Precision after R documents are retrieved
o R = number of relevant documents for the topic
o De-emphasize exact ranking of retrieved relevant documents, which can
be useful for topics with large number of relevant documents
o Perfect system could score 1.0
o Average R-precision
- Example: 2 topics, with 50 and 10 relevant documents respectively.
- Assume IR system return 17 relevant documents in the top 50 documents for
1st topic and 7 relevant documents in top 10 for 2nd topic
- Average R-precision for this IR system is (17/50 + 7/10) / 2 = 0.52
58
59. Mean Reciprocal Rank (MRR)
o Suppose there is only one relevant document
o Scenarios: known-item search, navigational queries, looking for a fact
o Search duration à rank of the answer
measures a user effort in finding that one and only document
Consider rank position, r, of first relevant document
Reciprocal Rank score =
MRR is the mean RR across multiple queries
1
r
59
60. E-measure
o Used to emphasize precision (or recall)
- Essentially a weighted average of precision and recall
- Large α increases importance of precision
o Can be transformed by α = 1/(β2+1) leading to
- When β =1 (α=1/2) equal importance of precision and recall
- Normalised symmetric difference of retrieved and relevant sets
60
E = 1
1
↵ 1
P + (1 ↵) 1
R
E = 1
( 2
+ 1)PR
2P + R
61. Symmetric Difference and E
A is the retrieved set of documents
B is the set of relevant documents
P = |A∩B|/|A|
R = |A∩B|/|B|
A⊗B (the symmetric difference) is the shaded area
A⊗B = |A∪B|- |A∩B|
= |A|+|B|-2|A∩B|
Eβ=1=1-(2PR / (P+R)) = (P+R-2PR)/(P+R)
= …
= A⊗B / (|A|+|B|)
symmetric difference
normalised
61
A
B
A∩B
62. F measure
o F = 1-E often used
- Good results mean larger values of F
- “F1” measure is popular: F with β=1
- particular popular with evaluating classification approaches
harmonic mean
of P and R
62
F = 1 E =
( 2
+ 1)PR
2P + R
F1 =
2PR
P + R
=
1
1
2 ( 1
R + 1
P )
Harmonic mean is a conservaGve average
63. F measure, geometric interpretation
A is the retrieved set of documents
B is the set of relevant documents
P = |A∩B|/|A|
R = |A∩B|/|B|
A
B
A∩B
63
F =1 = 2PR/(P + R)
= 2
|A B|2
|A| + |B|
/(|A B|(
1
|A|
+
1
|B|
))
=
2|A B|
|A| + |B|
64. Relation to Contingency Table
Why is accuracy not much used in IR in large documents collections?
- Most document are NOT relevant
- Most documents are NOT retrieved
- Inflates the accuracy value
Document is
Relevant
Document is NOT
relevant
Document is
retrieved a b
Document is NOT
retrieved c d
64
Accuracy : (a + d)/(a + b + c + d)
Precision : a/(a + b)
Recall : a/(a + c)
66. Discounted Cumulative Gain (DCG)
o Popular measure for evaluating web search
o Two assumptions:
- Highly relevant documents are more useful than marginally relevant
documents
- The lower the ranked position of a relevant document, the less useful it is for
the user, since it is less likely to be examined
66
67. Discounted Cumulative Gain (DCG)
o Uses graded relevance as a measure of usefulness, or gain, from
examining a document
o Gain is accumulated starting at the top of the ranking and can be
reduced, or discounted, at lower ranks
o Typical discount is 1/log(rank)
- With base 2, the discount at rank 4 is 1/2 , and at rank 8 it is 1/3
67
68. Summarize a Ranking with DCG
o Relevance judgments in a scale of [0,r] with r>2
o Cumulative Gain (CG) at rank n
- Let the ratings of the n documents be r1, r2, …rn (in ranked order)
- CG = r1+r2+…+rn
o Discounted Cumulative Gain (DCG) at rank n
- DCG = r1 + r2/log22 + r3/log23 + … + rn/log2n (We may use any base for the logarithm)
68
DCGn = rel1 +
nX
i=2
reli
log2i
70. Summarize a Ranking with NDCG
o Normalized Discounted Cumulative Gain (NDCG) at rank n
- Normalize DCG at rank n by the DCG value at rank n of the ideal ranking
- Ideal ranking would first return the documents with the highest relevance level, then
the next highest relevance level, and so on (we get Max DCG)
o Normalization useful for contrasting queries with varying numbers of
relevant documents
o NDCG popular in evaluating web search70
NDCG =
DCG
MaxDCG
72. Problem with the test collection methodology
o Building larger test collections along with complete relevance judgment is difficult
or impossible
- require assessor time, which is very expensive
- require many diverse retrieval “runs”
o Recall is difficult if not impossible to get correctly as there is no way we can find all the
relevant documents for each query
o Precision at top n often not stable enough
o Issue:
- Non-judged documents are assumed non-relevant
- Can we reuse the test collection later on?
72
73. bpref measure
o Binary preference-based measure
- Introduced in 2004
- Unlike MAP, P@10, and recall and precision, only uses information from judged documents
o A function of how frequently relevant documents are retrieved before non-
relevant documents.
R is the number of judged relevant documents, r is a relevant retrieved
document, and n is a member of the first R irrelevant retrieved documents. Non
judged documents are ignored.
73
bpref =
1
R
X
r
1
n ranked higher than r
R
74. bpref measure
o When comparing systems over test collections with complete judgments, MAP
and bpref are reported to be equivalent
o With incomplete judgments, bpref is shown to be more stable
- We look at what happen when we use less queries, more queries
- We look at what happen when we swap documents in the ranking
74
75. bpref - Example
Retrieved result set with D2 and D5 being relevant:
D1
D2
D3 not judged
D4
--------
D5
D6
D7
D8
D9
D10 R=2
bpref= 1/2 [1-(1/2)]75
76. bpref - Example
Retrieved result set with D2, D5 and D7 are relevant:
D1
D2
D3 not judged
D4 not judged
D5
D6
D7
D8
----------
D9
D10 R=3
bpref= 1/3 [(1 -1/3) + (1 -1/3) + (1 -2/3)]76
77. bpref Example
Retrieved result set with D2, D4, D6 and D9 are relevant:
D1
D2
D3
D4
D6
D7
D8
----------
D9
D10 R=4
bpref= 1/4 [(1-1/4) + (1 -2/4) + (1 -2/4)]
77
78. Evaluating interaction with the IR systems
o Empirical data involving human users is time consuming to
gather and difficult to draw universal conclusions from
o Evaluation metrics for user interaction (interface)
- Time required to learn the system
- Time to achieve goals on benchmark tasks
- Error rates
- Retention of the use of the interface over time
- User satisfaction
78
79. Why significance testing
o System A beats System B on one query
- Is it just a lucky query for System A?
- Maybe System B does better on some other query?
- Needs as many queries as possible
Empirical research suggests 25 is minimum needed
TREC tracks generally aim for at least 50 queries
o Systems A and B identical on all but one query
- If System A beats System B by enough on that one query, average will make A look better than B
As above could just be a lucky break for System A
- Need A to beat B frequently to believe A is really better
o System A is only 0.00001% better than system B
- Even if true in all queries, does it mean much
o Significance testing consider those issues
79
80. Significance tests
o Are observed differences statistically different?
- Make use of statistics
o Generally we cannot make assumptions about underlying distribution
- Most significance tests do make such assumptions
o Significance tests are easier to do on single-valued effectiveness measures (MAP, bpref)
o Example: Sign test
- Do not require that data be normally distributed
- For techniques A and B, compare average precision for each pair of results generated by queries in
the test collection
- If difference is large enough, count as + or -, otherwise ignore
- Use number of +’s and the number of significant differences to determine significance level
80
81. Measures for large-scale systems … web search
o Typical user behavior in web search shown preference for high precision
o Graded scales of relevance seem more useful than binary è NDCG
o Recall difficult to measure on the web
- Often use precision at top k, such as k=5, k =10, …
o . . . or measures that reward you more for getting rank 1 right than for getting
rank 10 right è NDCG
o Use non-relevance-based datasets such as click-through data (query logs)
o A/B testing
81
82. A/B tes(ng
o Test a a single new “innovaGon”
o Have most users use old system
o Divert a small proporGon of traffic (e.g., 1%) to the new system that includes
the innovaGon
o Evaluate with an “automaGc” measure like click-through rates
o Now we can directly see if the innovaGon does improve retrieval performance
(e.g. click-through rate)
o Probably the evaluaGon methodology that large search engines trust most
Sec. 8.6.3
82
83. Bias in where users click
# of clicks received
Strong position bias, so absolute click rates unreliable
83
84. Relative vs absolute ratings
Hard to conclude Result1 > Result3
Probably can conclude Result3 > Result2
User click
sequence
pairwise relative
rating instead of
individual rating
Assess in terms of
conformance with historical
pairwise preferences
recorded from user clicks
84
85. Comparing two rankings via clicks and
interleaving method
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
SVM software
SVM tutorial
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
SVM software
Query: [support vector machines]
System A System B
85
(Joachims, 2002)
87. Count user clicks
87
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
Clicks
Ranking A: 3
Ranking B: 1
ê
A, B
A
A
System A is
better than
System B
88. 88
o Focus on measuring its effectiveness rather than efficiency
o We recall that:
- Effectiveness is the ability to make the right classification decision
- Efficiency is concerned with time and space requirement
Evaluation of classifiers
89. 89
Evaluation of classifiers
o After a classifier is constructed using a training set, the
effectiveness is evaluated using a test set
o For each category ci, we calculate the following sets:
- TPi: true positives
- FPi: false positives
- TNi: true negatives
- FNi: false negatives
90. 90
True and false positives with respect to a
cageory
o TPi: true positives with respect to category ci
- the set of documents that both the classifier and the previous
judgments (as recorded in the test set) classify under ci
o FPi: false positives with respect to category ci
- the set of documents that the classifier classifies under ci, but the test
set indicates that they do not belong to ci
91. 91
o TNi: true negatives with respect to category ci
- both the classifier and the test set agree that the documents in
TNi do not belong to ci
o FNi: false negatives with respect to category ci
- the classifier do not classify the documents in FNi under ci, but
the test set indicates that they should be classified under ci
True and false negatives with respect to a
cageory
92. 92
Evaluation measures for classifiers
o Precision with respect to category ci
o Recall with respect to category ci
TPiFPi FNi
TNi
Classified ci
(what it returns)
Test Class ci
(what it should return)
Pi =
TPi
TPi + FPi
Ri =
TPi
TPi + FNi
93. 93
Evaluation measures for classifiers
o for obtaining estimates for precision and recall in the collection as
a whole, two different methods may be adopted:
- Micro-averaging
counts for true positives, false positives and false negatives for all categories are first
summed up
precision and recall are calculated using the global values
- Macro-averaging
average of precision (recall) for individual categories
94. 94
Micro- vs macro-averaging
o microaveraging and macroaveraging may give quite
different results, if the different categories have very
different generality
o e.g. the ability of a classifier to behave well also on
categories with low generality (i.e. categories with few
positive training instances) will be emphasized by
macroaveraging
o choice depends on the application
95. Conclusions … some few words
o Here we solely focused on system-oriented evaluation. We should not
forget about user-oriented evaluation
o Here we focus on batch-style evaluation. We should not forget that
search is part of a bigger task.
o At the end, it is all about making the users “happy”. We should not forget
about long-term engagement.
o Lots of work and research looked beyond precision and recall, in terms of
validations, extensions or alternatives
o Lots of work such as “significance testing” so that we can be sure that IR
system A is indeed better than IR system B.
o Here we focused on “document” and text. We should not forget
multimedia, mobile, social media, etc, where evaluating effectiveness
may mean something a bit different.
95