SlideShare a Scribd company logo
IR Evaluation:
Designing an End-to-End
Offline Evaluation Pipeline (2)
Jin Young Kim, Microsoft
jink@microsoft.com
Emine Yilmaz, University College London
emine.yilmaz@ucl.ac.uk
Speaker Bio
• Graduated from UMass Amherst with Ph.D in 2012
• Spent past 3 years in Bing’s Relevance Measurement / Science Team
• Taught MSFT course on offline evaluation
• Passionate for working with data of all kinds
(search, personal, baseball, …)
Evaluating a Data Product
• How would you evaluate Web Search, App Recommendations, and
even an Intelligent Agent?
Better Evaluation = Better Data Product
• Investment decisions
• Shipping decisions
• Compensation decisions
• More effective ML models
Tutorial Objective
• Overview End-to-End process of how evaluation works
in a large-scale commercial web search engine
• Learn about various decisions and tips for each step
• Practice designing a judging interface for specific task
• Review related literature in various fronts
What Makes Evaluation in Industry different?
• Larger scale / team / business at stake
• More diverse signals for evaluation (online + offline)
• More diverse evaluation targets (not just documents)
• Need for a sustainable evaluation pipeline
Agenda: Steps for Offline Evaluation
• Preparing tasks
• Designing a judging interface
• Designing an experiment
• Running the experiment
• Evaluating the Experiment
Preparing tasks
What constitutes a task?
• Goal
• You want to evaluate the target
for task description provided
• Task description
• Some (expression of) information need
• Search query / user profile / …
• Target
• System response to satisfy the need
• SERP / webpage / answer / …
Sampling tasks (queries)
• Random sample of user query is common method
• What can go wrong in this approach?
• Sampling criteria
• Representative: Are the samples representative of the user traffic?
• Actionable: Are they targeted for what we’re trying to improve on?
• Need for more context
• Are queries specific enough for consistent judgment?
Add contexts if query alone is not enough
• Context examples:
• User’s location
• Task description
• Session history
• …
• Cost of contextual judging
• Potentially need more judgments
• Increase judge’s cognitive load
Designing a judging interface
Goals in designing a judging interface
• Maximum information
• Minimum efforts
• Minimum errors
Designing a judging interface: SERP*
• Questions
• Responses
• Judging Target
Q: How would you rate
the search results?
Not Relevant
Fair
Good
Excellent
Q: Why do you think so?
*SERP: Search Engine Results Page
Practice: Design your own Judging Interface
• What can go wrong with the evaluation interface?
• How can you improve the evaluation interface?
What can go wrong here?
• Judges may like some part of the page, but not others
• Judges may not understand the query at all
• Each judge may understand the task differently
• Rating can be very subjective without a clear baseline
• …
Designing a judging interface: web result
Given ‘crowdsourcing’ as
a query, how would you
rate the webpage?
Not Relevant
Fair
Good
Excellent
Q: Why do you think so?
Now the judging target is specific enough
Judging Guideline
• A document for judges to read
before starting the task
• Need to keep simple (i.e., one
page), especially for crowd judges
• Can’t rely on the guideline for all
instructions: use training / tooltips
Designing a judging interface: side-by-side
Q: How would you
compare two results?
Left much better
Left better
About the same
Right better
Right much better
Q: Why do you think so?
The other page establishes a clear baseline for the judgment
Evaluation by Comparing Result Sets in Context
[Thomas’06]
Here or There: Preference Judgments for
Relevance [Carterette et al. 2008]
Higher inter-judge agreement in preference judgement
Tips on judging interface design
• Use plain language (i.e., avoid jargons)
• Make the UI light and simple (e.g., no scroll)
• Put ‘I don’t know’ (skip) option (to avoid random responses)
• Collect optional textual comments (for rationale or feedback)
• Collect judging time and behavioral log data (for quality control)
Using Hidden Tasks for Quality Control [Alonso ’15]
• Ask simple questions that
require judges to read the
contents
• This prepare the judge for
actual judging task
• This provide ways to verify if
the response is bogus
Designing an experiment
From judgments to an experiment
• Experiment
• A set of judgments collected with a particular goal
• A typical experiment consists of many tasks and judgments
• Multiple judgments are collected for each task (overlap)
• Types of goals
• Resource planning: where to invest in next few months?
• Feature debugging: what can go wrong with this feature?
• Shipping decision: should we ship the feature to the production?
9 tasks X 3 overlap
Judgments
Tasks
Breakdown of Experimental Cost
• How much money (time) spent per task?
• How many (overlap) judgments per task?
• How many tasks within experiment?
$ (time)
per Judgment
# Judgments
per Task
# Tasks within
Experiment
10 cent = 30 second
(12$/HR)
3 judgments per task 9 tasks
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
Total cost: 2.7$
Judgments
Tasks
Effect of Pay per Task
• Higher pay per task doesn’t improve judging quality, but throughput
[Mason and Watts, 2009]
Why overlap judgments?
• Better task understanding
• What’s the distribution of labels?
• What are judges’ collective feedback?
• Quality control for labels / judges
• What is the majority opinion for each task?
• Who tends to disagree with the majority opinion?
Majority opinion is not always right, especially
before you have enough of good judges
Majority Voting and Label Quality
• Ask multiple labellers, keep majority label as “true” label
• Quality is probability of being correct
p: probability
of individual
labeller being
correct
[Kuncheva et al., PA&A, 2003]
High vs. Low overlap experiment
• High-overlap
• Early iteration stage
• Information-centric tasks
• Low-overlap
• Mature / production stage
• Number-centric tasks
3 tasks X 9 overlap
9 tasks X 3 overlap
Judgments
Tasks
Judgments
Tasks
Summary: Evaluation Goals & Guidelines
Evaluation Goal Judgment Design Experiment Design
Feature Planning /
Debugging
Label + Comments Information-centric
(High overlap)
Training Data Label + Comments Specific to the algorithm
Shipping Decision
(ExpA vs. ExpB)
Label + Comments Number-centric
(Low overlap)
Running the experiment
Choosing judge pools
• Development Team
• In-house (managed) judges
• Crowdsourcing judges
Less expertise
More judgments
Closer to users
Ground Truth
Judgments
Ground Truth
Judgments
Ground Truth
Judgments
Collect ground
truth labels for
next stage
Choosing judge within the pool
• Considerations
• Do judges have necessary knowledge?
• Do judge profiles match with target users?
• Can they perform the task with reasonable accuracy?
• Methods
• Pre-screen judges by profile
• Filter out judges by screening task
• Kick off ‘bad’ judges regularly
Training judges: Training tasks
Given ‘crowdsourcing’ as
a query, how would you
rate the webpage?
Bad
Fair
Good
Excellent
Perfect
Q: Why do you think so?
The Answer is ‘Excellent’
This document satisfies user’s main
intent by providing well curated
information about the topic
Initial
qualification
task
Interleaved
training task
Interleaved
QA task
Crowd workers communicate with each other!
You need to manage
your reputation as a
requester.
(Quick payment /
Responsive to
workers’ feedback)
Answers shared with
one worker is likely
shared with all.
Cost of Qualification Test [Alonso’13]
• Judges become an order of
magnitude slower under the
presence of qualification
tasks
• However, depending on the
type of task, the results may
worth the delay and cost
Tips on running an experiment
• Scale up judging tasks slowly
• Beware of the quality of golden hits
• Submit a big task in small batches
(for task debugging / judge engagement)
• Monitor & respond to judges’ feedback
Evaluating the Experiment
Analyzing the judgment quality
• Agreement with ground truth (aka golden hits)
• Inter-rater agreement
• Behavioral signals (time, label distribution)
• Agreement with other metrics
Comparing Inter-rater Metrics
• Percentage agreement: the ratio the cases that received the same
rating by two judges and divides the number by the total number of
cases rated by the two judges.
• Cohen’s kappa. estimate the degree of consensus between two
judges by correcting if they are operating by chance alone.
• Fleiss’ kappa: generalization of Cohen to n raters instead of just two.
• Krippendorff’s alpha: accept any number of observers, being
applicable to nominal, ordinal, interval, and ratio levels of
measurement
https://en.wikipedia.org/wiki/Inter-rater_reliability
Analyzing the judgment quality
Automating Crowdsourcing Tasks in an Industrial Environment
Vasilis Kandylas, Omar Alonso, Shiroy Choksey, Kedar Rudre, Prashant Jaiswal
Using Behavior of Crowd Judges for QA
• Predictive models of task performance can be built based on
behavioral traces, and that these models generalize to related tasks.
Instrumenting the Crowd: Using Implicit Behavioral Measures to Predict
Task Performance, UIST’11, Jeffrey M. Rzeszotarski, Aniket Kittur
Case Study: Relevance Dimensions in
Preference-based IR Evaluation [Kim et al. ’13]
Q: How would you
compare two results?
Overall
Relevance
Diversity
Freshness
Authority
Caption
Q: Why do you think so?
Left Tie Right
Allow judges to break down their judgments along several dimensions
Case Study: Relevance Dimensions in
Preference-based IR Evaluation [Kim et al. ’13]
• Inter-judge Agreement • Preference judgments vs.
Delta in NDCG@{1,3} correlation
All achieved with 10% increase in judging time
Conclusions
Building a Production Evaluation Pipeline
Omar Alonso, Implementing crowdsourcing-based relevance
experimentation: an industrial perspective. Inf. Retr. 16(2): 101-120 (2013)
Recap: Steps for Offline Evaluation
• Preparing tasks
• Designing a judging interface
• Designing an experiment
• Running the experiment
• Evaluating the Experiment
Main References
• Implementing crowdsourcing-based relevance experimentation: an
industrial perspective. Omar Alonso
• Tutorial on Crowdsourcing Panos Ipeirotis
• Amazon Mechanical Turk: Requester Best Practices Guide
• Quantifying the User Experience. Sauro and Lewis. (book)
Optional
Impact of Highlights on Document Relevance
• Highlighted versions of the document were perceived to be more
relevant to plain versions. [Alonso, 2013]
• Subtle interface change can affect the outcome significantly
Architecture Example: BingDAT
Automating Crowdsourcing Tasks in an Industrial Environment
Vasilis Kandylas, Omar Alonso, Shiroy Choksey, Kedar Rudre, Prashant Jaiswal
Computing Cohen’s Kappa
• Statistic used for measuring inter-rater agreement
• Can be used to measure
• Agreement with gold data
• Agreement between two workers
• More robust than error rate as it takes into account agreement by
chance
Computing Quality Score: Cohen’s Kappa
)Pr(1
)Pr()Pr(
e
ea



Pr(a): Observed agreement among raters
Pr(e): Hypothetical probability of chance of
agreement (agreement due to chance)
Computing Cohen’s Kappa
• Computing probability of agreement (Pr(a))
• Generate the contingency table
• Compute number of cases of agreement/ total number of ratings
9 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Total:
13
14
9
Total: 15 12 9 Overall total: 36
Computing Cohen’s Kappa
• Computing probability of agreement (Pr(a))
• Generate the contingency table
• Compute number of cases of agreement/ total number of ratings
9 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Pr(a) = (9+8+6)/36 = 23/36
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(w1=a&w2=a) = (15/36)*(13/36)
E[w1=a&w2=a] = (15/36)*(13/36)*36
= 5.42
Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(w1=a&w2=a) = (13/36)*(15/36)
E[w1=a&w2=a] = (13/36)*(15/36)*36
= 5.42
Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 (4.67) 2
2 1 6 (2.25)
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(w1=a&w2=a) = (13/36)*(15/36)
E[w1=a&w2=a] = (13/36)*(15/36)*36
= 5.42
Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 (4.67) 2
2 1 6 (2.25)
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(e) = (5.42+4.67+2.25)/36
Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 (4.67) 2
2 1 6 (2.25)
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(e) = 12.34/36
Pr(a) = 23/36
Kappa = (23-12.34)/(36-12.34) = 0.45
What is a good value for Kappa?
• Kappa >= 0.70 => reliable inter-rater agreement
• For the above example, inter-rater reliability is not satisfactory
• If Kappa<0.70, need ways to improve worker quality
• Better incentives
• Better interface for the task
• Better guidelines/clarifications for the task
• Training before the task…
Calculating the Confidence
Interval
Drawing Conclusions
• Hypothesis testing (covered in Part I)
• How confident can we be about our conclusion?
• Confidence interval
• How big is the improvement?
• How precise is our estimate?
Both statistical significance and confidence interval
should be reported!
Confidence Interval and Hypothesis Testing
• Confidence Interval
• Does the 95% C.I. of sample mean include zero?
• Hypothesis Testing
• Does 95% C.I. under H0 include the critical value ?
Critical Value0
95% Confidence Interval
0 Sample Mean
95% Conf. Int. under H0
Sampling Distribution and Confidence Interval
• 95% confidence interval: 95% of
sample means will fall under this
interval
• This means 95% of sample will
include the mean of original
sample
http://rpsychologist.com/d3/CI/
Computing the Confidence Interval
• Determine confidence level (typically 95%)
• Estimate a sampling distribution (sample mean & variance)
• Calculate confidence interval
• 𝐶𝑜𝑛𝑓𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙95 = 𝑋 ± 𝑍 ×
𝜎
𝑛
Sampling
Distribution
95% Confidence Interval
𝑋
𝑍: 1.96 (for 95% C.I.)
𝑋: sample mean
𝜎: sample variance
𝑛: sample size

More Related Content

What's hot

신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
Minho Lee
 
실리콘 밸리 데이터 사이언티스트의 하루
실리콘 밸리 데이터 사이언티스트의 하루실리콘 밸리 데이터 사이언티스트의 하루
실리콘 밸리 데이터 사이언티스트의 하루
Jaimie Kwon (권재명)
 
제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션
제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션
제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션
BOAZ Bigdata
 
[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론
[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론
[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론
PAP (Product Analytics Playground)
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
BOAZ Bigdata
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
BOAZ Bigdata
 
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
Yongho Ha
 
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
BOAZ Bigdata
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
BOAZ Bigdata
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
BOAZ Bigdata
 
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
Kenneth Ceyer
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
BOAZ Bigdata
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)
BOAZ Bigdata
 
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
XAIC
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
Liang Xiang
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기
BOAZ Bigdata
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
Liang Xiang
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
Yongho Ha
 
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA)
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA) 오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA)
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA)
Yongho Ha
 
Recommender Engines
Recommender EnginesRecommender Engines
Recommender Engines
Thomas Hess
 

What's hot (20)

신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
 
실리콘 밸리 데이터 사이언티스트의 하루
실리콘 밸리 데이터 사이언티스트의 하루실리콘 밸리 데이터 사이언티스트의 하루
실리콘 밸리 데이터 사이언티스트의 하루
 
제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션
제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션
제 11회 보아즈(BOAZ) 빅데이터 컨퍼런스 - 코끼리(BOAZ) 사서의 도서 추천 솔루션
 
[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론
[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론
[팝콘 시즌1] 이윤희 : 다짜고짜 배워보는 인과추론
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
 
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
 
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
 
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)
 
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [로깅줍깅] : 로그 스트림 파이프라인 여행기
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
 
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA)
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA) 오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA)
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA)
 
Recommender Engines
Recommender EnginesRecommender Engines
Recommender Engines
 

Viewers also liked

Pixelim Media - İnteraktif Çözümler
Pixelim Media - İnteraktif ÇözümlerPixelim Media - İnteraktif Çözümler
Pixelim Media - İnteraktif Çözümler
Haluk YAMANER
 
Cross Slot technology
Cross Slot technologyCross Slot technology
Cross Slot technology
crossslotusa
 
Input symbols
Input symbolsInput symbols
Input symbols
kjoeper123
 
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
Jin Young Kim
 
Input plot
Input plot Input plot
Input plot
kjoeper123
 
Input plot
Input plot Input plot
Input plot
kjoeper123
 
EVALUACION
EVALUACIONEVALUACION
EVALUACION
Jessy_ango
 
evaluacion
evaluacionevaluacion
evaluacion
Jessy_ango
 
Rockefeller tsg meeting 27_may2012_fnl
Rockefeller tsg meeting 27_may2012_fnlRockefeller tsg meeting 27_may2012_fnl
Rockefeller tsg meeting 27_may2012_fnl
DaraWestling
 
Twig Templating
Twig TemplatingTwig Templating
Twig Templating
Rj Bautista
 
connexió windows 7-htc desire
connexió windows 7-htc desireconnexió windows 7-htc desire
connexió windows 7-htc desire
jcazorlaga
 
Plot - introduction
Plot - introductionPlot - introduction
Plot - introduction
kjoeper123
 
Presentación1 magi
Presentación1 magiPresentación1 magi
Presentación1 magi
mago_rdz77
 
conexion-ordenador-movil-presentacion
conexion-ordenador-movil-presentacionconexion-ordenador-movil-presentacion
conexion-ordenador-movil-presentacionccaballem
 
Digital Still Camera Market in India by Abhinava Mishra
Digital Still Camera Market in India by Abhinava MishraDigital Still Camera Market in India by Abhinava Mishra
Digital Still Camera Market in India by Abhinava Mishra
Abhinava Mishra
 
Input characters
Input charactersInput characters
Input characters
kjoeper123
 
Subtleties in Tracking Happiness -- Seattle QS#10
Subtleties in Tracking Happiness -- Seattle QS#10Subtleties in Tracking Happiness -- Seattle QS#10
Subtleties in Tracking Happiness -- Seattle QS#10
Jin Young Kim
 
Social Entrepreneur meets Technology by 황진솔 대표
Social Entrepreneur meets Technology by 황진솔 대표Social Entrepreneur meets Technology by 황진솔 대표
Social Entrepreneur meets Technology by 황진솔 대표
Jin Young Kim
 
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)
Jin Young Kim
 

Viewers also liked (20)

Pixelim Media - İnteraktif Çözümler
Pixelim Media - İnteraktif ÇözümlerPixelim Media - İnteraktif Çözümler
Pixelim Media - İnteraktif Çözümler
 
Cross Slot technology
Cross Slot technologyCross Slot technology
Cross Slot technology
 
Input symbols
Input symbolsInput symbols
Input symbols
 
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표랭킹 최적화를 넘어 인간적인 검색으로  - 서울대 융합기술원 발표
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
 
Input plot
Input plot Input plot
Input plot
 
Input plot
Input plot Input plot
Input plot
 
EVALUACION
EVALUACIONEVALUACION
EVALUACION
 
evaluacion
evaluacionevaluacion
evaluacion
 
Rockefeller tsg meeting 27_may2012_fnl
Rockefeller tsg meeting 27_may2012_fnlRockefeller tsg meeting 27_may2012_fnl
Rockefeller tsg meeting 27_may2012_fnl
 
Twig Templating
Twig TemplatingTwig Templating
Twig Templating
 
Sejarah malaysia
Sejarah malaysiaSejarah malaysia
Sejarah malaysia
 
connexió windows 7-htc desire
connexió windows 7-htc desireconnexió windows 7-htc desire
connexió windows 7-htc desire
 
Plot - introduction
Plot - introductionPlot - introduction
Plot - introduction
 
Presentación1 magi
Presentación1 magiPresentación1 magi
Presentación1 magi
 
conexion-ordenador-movil-presentacion
conexion-ordenador-movil-presentacionconexion-ordenador-movil-presentacion
conexion-ordenador-movil-presentacion
 
Digital Still Camera Market in India by Abhinava Mishra
Digital Still Camera Market in India by Abhinava MishraDigital Still Camera Market in India by Abhinava Mishra
Digital Still Camera Market in India by Abhinava Mishra
 
Input characters
Input charactersInput characters
Input characters
 
Subtleties in Tracking Happiness -- Seattle QS#10
Subtleties in Tracking Happiness -- Seattle QS#10Subtleties in Tracking Happiness -- Seattle QS#10
Subtleties in Tracking Happiness -- Seattle QS#10
 
Social Entrepreneur meets Technology by 황진솔 대표
Social Entrepreneur meets Technology by 황진솔 대표Social Entrepreneur meets Technology by 황진솔 대표
Social Entrepreneur meets Technology by 황진솔 대표
 
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)
 

Similar to SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline

Design, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptxDesign, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptx
Le Hung
 
Webinar: How to Conduct Unmoderated Remote Usability Testing
Webinar: How to Conduct Unmoderated Remote Usability TestingWebinar: How to Conduct Unmoderated Remote Usability Testing
Webinar: How to Conduct Unmoderated Remote Usability Testing
UserZoom
 
How Google works
How Google worksHow Google works
How Google works
Accesstrade Vietnam
 
Test case design techniques
Test case design techniquesTest case design techniques
Test case design techniques
Ashutosh Garg
 
Test case design techniques
Test case design techniquesTest case design techniques
Test case design techniques
2PiRTechnologies
 
Dare to Explore: Discover ET!
Dare to Explore: Discover ET!Dare to Explore: Discover ET!
Dare to Explore: Discover ET!
Raj Indugula
 
Beyond "Quality Assurance"
Beyond "Quality Assurance"Beyond "Quality Assurance"
Beyond "Quality Assurance"
Jason Benton
 
The art of project estimation
The art of project estimationThe art of project estimation
The art of project estimation
Return on Intelligence
 
Intro to Lean UX with UserTesting
Intro to Lean UX with UserTestingIntro to Lean UX with UserTesting
Intro to Lean UX with UserTesting
Carlos González de Villaumbrosia
 
UX 14_Evaluation.pptx
UX 14_Evaluation.pptxUX 14_Evaluation.pptx
UX 14_Evaluation.pptx
hansjuwiantho2
 
Remote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & ToolsRemote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & Tools
Susan Price
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
oralonso
 
Communicating Design
Communicating DesignCommunicating Design
Communicating Design
悠識學院
 
The Art of Project Estimation
The Art of Project EstimationThe Art of Project Estimation
The Art of Project Estimation
Return on Intelligence
 
ICS3211 lecture 10
ICS3211 lecture 10ICS3211 lecture 10
ICS3211 lecture 10
Vanessa Camilleri
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
Randy Shoup
 
Test estimation session
Test estimation sessionTest estimation session
Test estimation session
Vipul Agarwal
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
Alan Said
 
User Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersUser Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with Users
Laura B
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability Evaluation
Saqib Shehzad
 

Similar to SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline (20)

Design, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptxDesign, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptx
 
Webinar: How to Conduct Unmoderated Remote Usability Testing
Webinar: How to Conduct Unmoderated Remote Usability TestingWebinar: How to Conduct Unmoderated Remote Usability Testing
Webinar: How to Conduct Unmoderated Remote Usability Testing
 
How Google works
How Google worksHow Google works
How Google works
 
Test case design techniques
Test case design techniquesTest case design techniques
Test case design techniques
 
Test case design techniques
Test case design techniquesTest case design techniques
Test case design techniques
 
Dare to Explore: Discover ET!
Dare to Explore: Discover ET!Dare to Explore: Discover ET!
Dare to Explore: Discover ET!
 
Beyond "Quality Assurance"
Beyond "Quality Assurance"Beyond "Quality Assurance"
Beyond "Quality Assurance"
 
The art of project estimation
The art of project estimationThe art of project estimation
The art of project estimation
 
Intro to Lean UX with UserTesting
Intro to Lean UX with UserTestingIntro to Lean UX with UserTesting
Intro to Lean UX with UserTesting
 
UX 14_Evaluation.pptx
UX 14_Evaluation.pptxUX 14_Evaluation.pptx
UX 14_Evaluation.pptx
 
Remote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & ToolsRemote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & Tools
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
 
Communicating Design
Communicating DesignCommunicating Design
Communicating Design
 
The Art of Project Estimation
The Art of Project EstimationThe Art of Project Estimation
The Art of Project Estimation
 
ICS3211 lecture 10
ICS3211 lecture 10ICS3211 lecture 10
ICS3211 lecture 10
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 
Test estimation session
Test estimation sessionTest estimation session
Test estimation session
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
User Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersUser Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with Users
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability Evaluation
 

More from Jin Young Kim

DnA Playshop - Serious Fun with LEGO.pptx
DnA Playshop - Serious Fun with LEGO.pptxDnA Playshop - Serious Fun with LEGO.pptx
DnA Playshop - Serious Fun with LEGO.pptx
Jin Young Kim
 
Frontiers in Data Science For Modern Web Search Engine
Frontiers in Data Science For Modern Web Search EngineFrontiers in Data Science For Modern Web Search Engine
Frontiers in Data Science For Modern Web Search Engine
Jin Young Kim
 
Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...
Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...
Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...
Jin Young Kim
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Jin Young Kim
 
Measuring the Quality of Online Service - Jinyoung kim
Measuring the Quality of Online Service - Jinyoung kimMeasuring the Quality of Online Service - Jinyoung kim
Measuring the Quality of Online Service - Jinyoung kim
Jin Young Kim
 
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
Jin Young Kim
 
반상식적이고 주관적인 (CS) 유학 이야기
반상식적이고 주관적인 (CS) 유학 이야기반상식적이고 주관적인 (CS) 유학 이야기
반상식적이고 주관적인 (CS) 유학 이야기
Jin Young Kim
 

More from Jin Young Kim (7)

DnA Playshop - Serious Fun with LEGO.pptx
DnA Playshop - Serious Fun with LEGO.pptxDnA Playshop - Serious Fun with LEGO.pptx
DnA Playshop - Serious Fun with LEGO.pptx
 
Frontiers in Data Science For Modern Web Search Engine
Frontiers in Data Science For Modern Web Search EngineFrontiers in Data Science For Modern Web Search Engine
Frontiers in Data Science For Modern Web Search Engine
 
Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...
Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...
Data Science for Online Services: Problems & Frontiers (Changbal Conference 2...
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
 
Measuring the Quality of Online Service - Jinyoung kim
Measuring the Quality of Online Service - Jinyoung kimMeasuring the Quality of Online Service - Jinyoung kim
Measuring the Quality of Online Service - Jinyoung kim
 
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
 
반상식적이고 주관적인 (CS) 유학 이야기
반상식적이고 주관적인 (CS) 유학 이야기반상식적이고 주관적인 (CS) 유학 이야기
반상식적이고 주관적인 (CS) 유학 이야기
 

Recently uploaded

Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 

Recently uploaded (20)

Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 

SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline

  • 1. IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline (2) Jin Young Kim, Microsoft jink@microsoft.com Emine Yilmaz, University College London emine.yilmaz@ucl.ac.uk
  • 2. Speaker Bio • Graduated from UMass Amherst with Ph.D in 2012 • Spent past 3 years in Bing’s Relevance Measurement / Science Team • Taught MSFT course on offline evaluation • Passionate for working with data of all kinds (search, personal, baseball, …)
  • 3. Evaluating a Data Product • How would you evaluate Web Search, App Recommendations, and even an Intelligent Agent?
  • 4. Better Evaluation = Better Data Product • Investment decisions • Shipping decisions • Compensation decisions • More effective ML models
  • 5. Tutorial Objective • Overview End-to-End process of how evaluation works in a large-scale commercial web search engine • Learn about various decisions and tips for each step • Practice designing a judging interface for specific task • Review related literature in various fronts
  • 6. What Makes Evaluation in Industry different? • Larger scale / team / business at stake • More diverse signals for evaluation (online + offline) • More diverse evaluation targets (not just documents) • Need for a sustainable evaluation pipeline
  • 7. Agenda: Steps for Offline Evaluation • Preparing tasks • Designing a judging interface • Designing an experiment • Running the experiment • Evaluating the Experiment
  • 9. What constitutes a task? • Goal • You want to evaluate the target for task description provided • Task description • Some (expression of) information need • Search query / user profile / … • Target • System response to satisfy the need • SERP / webpage / answer / …
  • 10. Sampling tasks (queries) • Random sample of user query is common method • What can go wrong in this approach? • Sampling criteria • Representative: Are the samples representative of the user traffic? • Actionable: Are they targeted for what we’re trying to improve on? • Need for more context • Are queries specific enough for consistent judgment?
  • 11. Add contexts if query alone is not enough • Context examples: • User’s location • Task description • Session history • … • Cost of contextual judging • Potentially need more judgments • Increase judge’s cognitive load
  • 12. Designing a judging interface
  • 13. Goals in designing a judging interface • Maximum information • Minimum efforts • Minimum errors
  • 14. Designing a judging interface: SERP* • Questions • Responses • Judging Target Q: How would you rate the search results? Not Relevant Fair Good Excellent Q: Why do you think so? *SERP: Search Engine Results Page
  • 15. Practice: Design your own Judging Interface • What can go wrong with the evaluation interface? • How can you improve the evaluation interface?
  • 16. What can go wrong here? • Judges may like some part of the page, but not others • Judges may not understand the query at all • Each judge may understand the task differently • Rating can be very subjective without a clear baseline • …
  • 17. Designing a judging interface: web result Given ‘crowdsourcing’ as a query, how would you rate the webpage? Not Relevant Fair Good Excellent Q: Why do you think so? Now the judging target is specific enough
  • 18. Judging Guideline • A document for judges to read before starting the task • Need to keep simple (i.e., one page), especially for crowd judges • Can’t rely on the guideline for all instructions: use training / tooltips
  • 19. Designing a judging interface: side-by-side Q: How would you compare two results? Left much better Left better About the same Right better Right much better Q: Why do you think so? The other page establishes a clear baseline for the judgment
  • 20. Evaluation by Comparing Result Sets in Context [Thomas’06]
  • 21. Here or There: Preference Judgments for Relevance [Carterette et al. 2008] Higher inter-judge agreement in preference judgement
  • 22. Tips on judging interface design • Use plain language (i.e., avoid jargons) • Make the UI light and simple (e.g., no scroll) • Put ‘I don’t know’ (skip) option (to avoid random responses) • Collect optional textual comments (for rationale or feedback) • Collect judging time and behavioral log data (for quality control)
  • 23. Using Hidden Tasks for Quality Control [Alonso ’15] • Ask simple questions that require judges to read the contents • This prepare the judge for actual judging task • This provide ways to verify if the response is bogus
  • 25. From judgments to an experiment • Experiment • A set of judgments collected with a particular goal • A typical experiment consists of many tasks and judgments • Multiple judgments are collected for each task (overlap) • Types of goals • Resource planning: where to invest in next few months? • Feature debugging: what can go wrong with this feature? • Shipping decision: should we ship the feature to the production? 9 tasks X 3 overlap Judgments Tasks
  • 26. Breakdown of Experimental Cost • How much money (time) spent per task? • How many (overlap) judgments per task? • How many tasks within experiment? $ (time) per Judgment # Judgments per Task # Tasks within Experiment 10 cent = 30 second (12$/HR) 3 judgments per task 9 tasks 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Total cost: 2.7$ Judgments Tasks
  • 27. Effect of Pay per Task • Higher pay per task doesn’t improve judging quality, but throughput [Mason and Watts, 2009]
  • 28. Why overlap judgments? • Better task understanding • What’s the distribution of labels? • What are judges’ collective feedback? • Quality control for labels / judges • What is the majority opinion for each task? • Who tends to disagree with the majority opinion? Majority opinion is not always right, especially before you have enough of good judges
  • 29. Majority Voting and Label Quality • Ask multiple labellers, keep majority label as “true” label • Quality is probability of being correct p: probability of individual labeller being correct [Kuncheva et al., PA&A, 2003]
  • 30. High vs. Low overlap experiment • High-overlap • Early iteration stage • Information-centric tasks • Low-overlap • Mature / production stage • Number-centric tasks 3 tasks X 9 overlap 9 tasks X 3 overlap Judgments Tasks Judgments Tasks
  • 31. Summary: Evaluation Goals & Guidelines Evaluation Goal Judgment Design Experiment Design Feature Planning / Debugging Label + Comments Information-centric (High overlap) Training Data Label + Comments Specific to the algorithm Shipping Decision (ExpA vs. ExpB) Label + Comments Number-centric (Low overlap)
  • 33. Choosing judge pools • Development Team • In-house (managed) judges • Crowdsourcing judges Less expertise More judgments Closer to users Ground Truth Judgments Ground Truth Judgments Ground Truth Judgments Collect ground truth labels for next stage
  • 34. Choosing judge within the pool • Considerations • Do judges have necessary knowledge? • Do judge profiles match with target users? • Can they perform the task with reasonable accuracy? • Methods • Pre-screen judges by profile • Filter out judges by screening task • Kick off ‘bad’ judges regularly
  • 35. Training judges: Training tasks Given ‘crowdsourcing’ as a query, how would you rate the webpage? Bad Fair Good Excellent Perfect Q: Why do you think so? The Answer is ‘Excellent’ This document satisfies user’s main intent by providing well curated information about the topic Initial qualification task Interleaved training task Interleaved QA task
  • 36. Crowd workers communicate with each other! You need to manage your reputation as a requester. (Quick payment / Responsive to workers’ feedback) Answers shared with one worker is likely shared with all.
  • 37. Cost of Qualification Test [Alonso’13] • Judges become an order of magnitude slower under the presence of qualification tasks • However, depending on the type of task, the results may worth the delay and cost
  • 38. Tips on running an experiment • Scale up judging tasks slowly • Beware of the quality of golden hits • Submit a big task in small batches (for task debugging / judge engagement) • Monitor & respond to judges’ feedback
  • 40. Analyzing the judgment quality • Agreement with ground truth (aka golden hits) • Inter-rater agreement • Behavioral signals (time, label distribution) • Agreement with other metrics
  • 41. Comparing Inter-rater Metrics • Percentage agreement: the ratio the cases that received the same rating by two judges and divides the number by the total number of cases rated by the two judges. • Cohen’s kappa. estimate the degree of consensus between two judges by correcting if they are operating by chance alone. • Fleiss’ kappa: generalization of Cohen to n raters instead of just two. • Krippendorff’s alpha: accept any number of observers, being applicable to nominal, ordinal, interval, and ratio levels of measurement https://en.wikipedia.org/wiki/Inter-rater_reliability
  • 42. Analyzing the judgment quality Automating Crowdsourcing Tasks in an Industrial Environment Vasilis Kandylas, Omar Alonso, Shiroy Choksey, Kedar Rudre, Prashant Jaiswal
  • 43. Using Behavior of Crowd Judges for QA • Predictive models of task performance can be built based on behavioral traces, and that these models generalize to related tasks. Instrumenting the Crowd: Using Implicit Behavioral Measures to Predict Task Performance, UIST’11, Jeffrey M. Rzeszotarski, Aniket Kittur
  • 44. Case Study: Relevance Dimensions in Preference-based IR Evaluation [Kim et al. ’13] Q: How would you compare two results? Overall Relevance Diversity Freshness Authority Caption Q: Why do you think so? Left Tie Right Allow judges to break down their judgments along several dimensions
  • 45. Case Study: Relevance Dimensions in Preference-based IR Evaluation [Kim et al. ’13] • Inter-judge Agreement • Preference judgments vs. Delta in NDCG@{1,3} correlation All achieved with 10% increase in judging time
  • 47. Building a Production Evaluation Pipeline Omar Alonso, Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Inf. Retr. 16(2): 101-120 (2013)
  • 48. Recap: Steps for Offline Evaluation • Preparing tasks • Designing a judging interface • Designing an experiment • Running the experiment • Evaluating the Experiment
  • 49. Main References • Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Omar Alonso • Tutorial on Crowdsourcing Panos Ipeirotis • Amazon Mechanical Turk: Requester Best Practices Guide • Quantifying the User Experience. Sauro and Lewis. (book)
  • 51. Impact of Highlights on Document Relevance • Highlighted versions of the document were perceived to be more relevant to plain versions. [Alonso, 2013] • Subtle interface change can affect the outcome significantly
  • 52. Architecture Example: BingDAT Automating Crowdsourcing Tasks in an Industrial Environment Vasilis Kandylas, Omar Alonso, Shiroy Choksey, Kedar Rudre, Prashant Jaiswal
  • 54. • Statistic used for measuring inter-rater agreement • Can be used to measure • Agreement with gold data • Agreement between two workers • More robust than error rate as it takes into account agreement by chance Computing Quality Score: Cohen’s Kappa )Pr(1 )Pr()Pr( e ea    Pr(a): Observed agreement among raters Pr(e): Hypothetical probability of chance of agreement (agreement due to chance)
  • 55. Computing Cohen’s Kappa • Computing probability of agreement (Pr(a)) • Generate the contingency table • Compute number of cases of agreement/ total number of ratings 9 3 1 4 8 2 2 1 6 Worker 1 Worker 2 a b c a b c Total: 13 14 9 Total: 15 12 9 Overall total: 36
  • 56. Computing Cohen’s Kappa • Computing probability of agreement (Pr(a)) • Generate the contingency table • Compute number of cases of agreement/ total number of ratings 9 3 1 4 8 2 2 1 6 Worker 1 Worker 2 a b c a b c Pr(a) = (9+8+6)/36 = 23/36 Total: 15 12 9 Overall total: 36 Total: 13 14 9
  • 57. Computing Cohen’s Kappa • Computing probability of agreement due to chance • Compute expected frequency for agreements that would occur due to chance • What is the probability that worker 1&worker 2 both label any item as an a? • What is the expected number of items labelled as a by both worker 1 and worker 2? 9 3 1 4 8 2 2 1 6 Worker 1 Worker 2 a b c a b c Total: 15 12 9 Overall total: 36 Total: 13 14 9 Pr(w1=a&w2=a) = (15/36)*(13/36) E[w1=a&w2=a] = (15/36)*(13/36)*36 = 5.42
  • 58. Computing Cohen’s Kappa • Computing probability of agreement due to chance • Compute expected frequency for agreements that would occur due to chance • What is the probability that worker 1&worker 2 both label any item as an a? • What is the expected number of items labelled as a by both worker 1 and worker 2? 9 (5.42) 3 1 4 8 2 2 1 6 Worker 1 Worker 2 a b c a b c Total: 15 12 9 Overall total: 36 Total: 13 14 9 Pr(w1=a&w2=a) = (13/36)*(15/36) E[w1=a&w2=a] = (13/36)*(15/36)*36 = 5.42
  • 59. Computing Cohen’s Kappa • Computing probability of agreement due to chance • Compute expected frequency for agreements that would occur due to chance • What is the probability that worker 1&worker 2 both label any item as an a? • What is the expected number of items labelled as a by both worker 1 and worker 2? 9 (5.42) 3 1 4 8 (4.67) 2 2 1 6 (2.25) Worker 1 Worker 2 a b c a b c Total: 15 12 9 Overall total: 36 Total: 13 14 9 Pr(w1=a&w2=a) = (13/36)*(15/36) E[w1=a&w2=a] = (13/36)*(15/36)*36 = 5.42
  • 60. Computing Cohen’s Kappa • Computing probability of agreement due to chance • Compute expected frequency for agreements that would occur due to chance • What is the probability that worker 1&worker 2 both label any item as an a? • What is the expected number of items labelled as a by both worker 1 and worker 2? 9 (5.42) 3 1 4 8 (4.67) 2 2 1 6 (2.25) Worker 1 Worker 2 a b c a b c Total: 15 12 9 Overall total: 36 Total: 13 14 9 Pr(e) = (5.42+4.67+2.25)/36
  • 61. Computing Cohen’s Kappa • Computing probability of agreement due to chance • Compute expected frequency for agreements that would occur due to chance • What is the probability that worker 1&worker 2 both label any item as an a? • What is the expected number of items labelled as a by both worker 1 and worker 2? 9 (5.42) 3 1 4 8 (4.67) 2 2 1 6 (2.25) Worker 1 Worker 2 a b c a b c Total: 15 12 9 Overall total: 36 Total: 13 14 9 Pr(e) = 12.34/36 Pr(a) = 23/36 Kappa = (23-12.34)/(36-12.34) = 0.45
  • 62. What is a good value for Kappa? • Kappa >= 0.70 => reliable inter-rater agreement • For the above example, inter-rater reliability is not satisfactory • If Kappa<0.70, need ways to improve worker quality • Better incentives • Better interface for the task • Better guidelines/clarifications for the task • Training before the task…
  • 64. Drawing Conclusions • Hypothesis testing (covered in Part I) • How confident can we be about our conclusion? • Confidence interval • How big is the improvement? • How precise is our estimate? Both statistical significance and confidence interval should be reported!
  • 65. Confidence Interval and Hypothesis Testing • Confidence Interval • Does the 95% C.I. of sample mean include zero? • Hypothesis Testing • Does 95% C.I. under H0 include the critical value ? Critical Value0 95% Confidence Interval 0 Sample Mean 95% Conf. Int. under H0
  • 66. Sampling Distribution and Confidence Interval • 95% confidence interval: 95% of sample means will fall under this interval • This means 95% of sample will include the mean of original sample http://rpsychologist.com/d3/CI/
  • 67. Computing the Confidence Interval • Determine confidence level (typically 95%) • Estimate a sampling distribution (sample mean & variance) • Calculate confidence interval • 𝐶𝑜𝑛𝑓𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙95 = 𝑋 ± 𝑍 × 𝜎 𝑛 Sampling Distribution 95% Confidence Interval 𝑋 𝑍: 1.96 (for 95% C.I.) 𝑋: sample mean 𝜎: sample variance 𝑛: sample size

Editor's Notes

  1. Different from software evaluation: Output depends on task & user / Subjective quality
  2. Evaluation is critical in every stages of development Harry Shum: ‘We are as good as having the perfect WSE if we perfect the evaluation’
  3. Compared to Pt.1 where Emine focused on Academic IR evaluation, I’ll focus on what people in industry care about
  4. For the rest of this talk, I’ll follow the steps for …
  5. Mention TREC topic desc.
  6. No ground for comparison / What if the judge doesn’t understand the intent?
  7. No ground for comparison / What if the judge doesn’t understand the intent?
  8. Should we use ‘about the same’ vs. ‘the same??
  9. One judgment is not enough!
  10. Pay per task: how much of judges’ time do you want to borrow?
  11. Different layout?
  12. Dev team should definitely be the first judges
  13. Screenshot?
  14. Tasks with known answers are interleaved with regular tasks
  15. Judges need regular stream of jobs to stick to
  16. For the rest of this talk, I’ll follow the steps for …
  17. Need to be careful if you want to change the judging interface suddenly…
  18. These can be derived from the sampling distribution
  19. , which is our best guess for the population mean
  20. If you flip this, you can have hypothesis testing