Ecir2015 hjung final_public

A
Discrimina+ve
Approach
to

Predic+ng
Assessor
Accuracy

Hyun
Joon
Jung
and
Ma,hew
Lease

University
of
Texas
at
Aus;n

Presented
by
HyunJoon
Jung

Introduc+on
-‐
Crowdsourcing
for
IR
evalua+on

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
2

Manual human judgments:
too expensive (cost) and
too slow (time)
Crowdsourcing
for
IR
Evalua+on

•  Origin:
Alonso
et
al.
(SIGIR
Forum
2008)
•  Collecting relevance judgments from
a globally distributed online crowd 
via the Internet
Faster
Time
Quality
concern
Broader
Demographics
Less Cost
Benefits of Crowdsourcing based IR Evaluation
Alternative efforts in IR:
An effort to reduce the number of  
relevance judgments to collect  
(MTC, StatAP, and Pooling)
Introduc+on
Method
Evalua+on
Conclusion

Introduc+on
-‐
Quality
Control
in
Crowdsourcing

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
3

Crowd
workers
Label
Aggregation
Workflow
Design
Worker
Management
Existing Quality Control Methods
Task Design
Who
is
more
accurate?

(worker
performance
es5ma5on

and
predic5on)
Requester
Online
marketplace
Crowd

workers
Introduc+on
Method
Evalua+on
Conclusion

Introduc+on
–
Problem
seFng

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
4

Introduc+on
Method
Evalua+on
Conclusion

How
to
use
•  Task
rou+ng

•  Label
aggrega+on

•  Worker
ﬁltering

•  Interven+on

Why
Improve
data
quality

and
lower
cost

Problem
Find
a
worker
who
is

most
likely
to
make
a
correct
label

Alice

1
1
0
0
?

1
0
1
0
?

Bob

Introduc+on
–
Related
work

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
5

Alice

Correctness
of
the
ith
task
instance

against
a
gold
label

(1
-‐>
correct
,
0
-‐>
wrong)

1
1
0
0
?

1
0
1
0
?

Bob
A
typical
way:
measure
accuracy

(=2/4
=
0.5)
Problem:
ﬁnd
a
worker
who
is
most
likely
to
make
a
correct
label.

Observa4ons
are
iden4cally
and

independently
distributed
(i.i.d)
.
Consider
this
problem
from
a

temporal
perspec5ve
Let’s
relax
this
assump4on

-‐>
condi4onally
independence

Jung
et.
al’s

Hcomp
2014
Introduc+on
Method
Evalua+on
Conclusion

Method
-‐
Idea

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
6

Integrate
mul+-‐dimensional
features
of
a

crowd
assessor

Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
over time, the existing models are not able to capture such temporal dynamics and thus
prediction based on these models is almost always wrong. In particular, when an asses-
over time, the existing models are not able to capture such temporal dynamics and thus
prediction based on these models is almost always wrong. In particular, when an asses-
sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the
prediction based on the existing models are always 1 (correct) even though the actual
assessor’s next label quality oscillate over time. A similar problem happens in Figure 1
(b) with another worker whose average accuracy is below 0.5.
Fig. 1. Two examples of failures of existing assessor models and success of GAM
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy ass
an actual assessor’s next label quality (GOLD) oscillates over time, the existing as
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta p
[8]) do not follow the temporal variation of the gold labels since they are not a
dynamics of labels properly. On the contrary, our proposed model, GAM is ver
strong accuracy (0.8) which continually degrades over time, whereas acc
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker
quality (yt) is binary (correct/wrong) with respect to ground truth. While
over time, the existing models are not able to capture such temporal dynam
prediction based on these models is almost always wrong. In particular, wh
sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figu
prediction based on the existing models are always 1 (correct) even thoug
assessor’s next label quality oscillate over time. A similar problem happen
(b) with another worker whose average accuracy is below 0.5.
Mul;ple
features

Alice

accuracy
;me

temporal

eﬀect

topic

familiarity

#
of

labels

0
0.7
10.3
0.6
0.8
20

0.6
8.5
0.5
0.2
21
1

0.65
7.5
0.4
0.4
22
0

0.63
11.5
0.3
0.5
23
?

Predict
an
assessor’s
next
label

quality
based
on
a
single
feature

Alice

0.6

0.5

0.4

0.3

ailures of existing assessor models and success of GAM in predicting
ty ((a) High accuracy assessor and (b) low accuracy assessor). While
abel quality (GOLD) oscillates over time, the existing assessor models
mple Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
mporal variation of the gold labels since they are not able to capture
rly. On the contrary, our proposed model, GAM is very sensitive to
ver time for higher quality prediction.
which continually degrades over time, whereas accuracy of the
s steadily around 0.5. Suppose that a crowd worker’s next label
orrect/wrong) with respect to ground truth. While yt oscillates
models are not able to capture such temporal dynamics and thus
se models is almost always wrong. In particular, when an asses-
is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the
existing models are always 1 (correct) even though the actual
0

1

0

?

temporal

eﬀect

Introduc+on
Method
Evalua+on
Conclusion

Method
-‐
Crowd
Assessor
Features

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
7

[1]
Cartere,e,
B.,
Soboroff,
I.:
The
effect
of
assessor
error
on
IR
system
evalua;on.
SIGIR
’10

[2]
Ipeiro;s,
P.G.,
Gabrilovich,
E.:
Quizz:
targeted
crowdsourcing
with
a
billion
(poten;al)
users.
WWW’14

[3]
Jung,
H.,
et
al.:
Predic;ng
Next
Label
Quality:
A
Time-‐Series
Model
of
Crowdwork.
HCOMP’14

Introduc+on
Method
Evalua+on
Conclusion

How
do
we
flexibly
capture
a
wider
range
of
assessor
behaviors
by

incorpora+ng
mul+-‐dimensional
features?

Feature Name Description
ObservableBayesian Optimistic Accuracy (BAopt) [4]
a Bayesian style accuracy with a prior Beta (16,1)
BAopt = (xt + 16)/(nt + 17)
Bayesian Pessimistic Accuracy (BApes) [4]
a Bayesian style accuracy with a prior Beta (1,16)
BApes = (xt + 1)/(nt + 17)
Bayesian Uniform Accuracy (BAuni) [8]
a Bayesian style accuracy with a prior Beta (0.5,0.5)
BAuni = (xt + 0.5/(nt + 1)
Sample Running Accuracy (SA) SAt = xt/nt
CurrentLabelQuality
a binary value indicating whether a current label is
correct or wrong.
TaskTime time to spend in completing this judgment task. (ms)
AccuracyChangeDirection (ACD)
a binary value indicating the absolute difference
between SAt 1 SAt.
TopicChange
a binary value indicating a topic change between
time t 1 and time t.
NumLabels a cumulative number of completed relevance judgments at time t.
TopicEverSeen
a real value [0⇠1] indicating the familiarity of a topic.
1
a number of judgments on topic k at time t
Latent
Asymptotic Accuracy (AA) [9]
a time-series accuracy estimated by latent time-series model
proposed by Jung et al. c
1
.
[9]
a temporal correlation indicating how frequently a sequence
of correct/wrong observations has changed over time.
c [9]
a variable indicating the direction of judgments
between correct and wrong.
Table 1. Features of generalized assessor model (GAM). n is the number of total judgments and
x is the number of relevance judgments at time t.
[1]

[1]

[2]

[3]

[3]

[3]

Various

accuracy

measures

Task
features

Temporal

features

Method
–
Predic+on
model

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
8

Input:
X
(features
for
crowd
assessor
model)

Learning
Framework
[

]

first normalize our features in order to ensure that normalized feature values implicitly
weight all features equally in a model learning process. Logistic regression models the
probability distribution of the class label y given a feature vector X as follows:
p(y = 1|x;✓) = (✓T
x) =
1
1 + exp( ✓T x)
(1)
Here ✓ = { 0, T
1 ,..., T
M} are the parameters of the logistic regression model; and
(·) is the sigmoid function, defined by the second equality. The following function
attempts to maximize the log-likelihood in order to fit a model to a given training data.
max
✓
{
NX
i=1
[yi( 0 + T
xi) log(1 + e 0+ T
xi
)]
MX
j=1
| j|}. (2)
3.3 Prediction with Decision Reject Option
Our predictive model can generate two types of outputs: a probabilistic label (yi+1 2
{0,1}) indicating the degree of polarity and a binary label (0 or 1). While binary labels
(hard label) can be used as it is, probabilistic labels (soft label) can be used after a
In prediction, we consider a supervised learning task where w
instances {(xi,yi),i = 1,...,N}. Here, each xi 2 RM
is an M
vector, and yi 2 0,1 is a class label indicating whether an ass
is correct (1) or wrong (0). Before fitting a model to our featur
first normalize our features in order to ensure that normalized fe
weight all features equally in a model learning process. Logistic
probability distribution of the class label y given a feature vector
p(y = 1|x;✓) = (✓T
x) =
1
1 + exp( ✓T x
Here ✓ = { 0, T
1 ,..., T
M} are the parameters of the logistic
(·) is the sigmoid function, defined by the second equality. T
attempts to maximize the log-likelihood in order to fit a model to
max
✓
{
NX
i=1
[yi( 0 + T
xi) log(1 + e 0+ T
xi
)]
3.3 Prediction with Decision Reject Option
Output:
Y
(likelihood
of
gekng
correct
label
at
t)

Generalizable
feature-‐based
Assessor
Model
(GAM)

Introduc+on
Method
Evalua+on
Conclusion

Evalua+on
SeFng

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
9

Method

Dataset

NIST
TREC
Crowdsourcing
track
2011
dataset

•  Proliﬁc
workers
(numLabels
>=
20)
-‐>
54
workers

•  Avg.
number
of
labels
per
worker:
163.

GAM

(Generalized
Assessor

Model)

TS

(Time-‐series
Model

Jung
et
al.
’14)

BA

(Bayesian
Accuracy,

Cartere,e
&Soboroﬀ
‘10,

Ipeiro;s
&
Gabrilovich’14)

SA

(Sample
Accuracy)

Introduc+on
Method
Evalua+on
Conclusion

How
important
are
the
features?

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
10

. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
varying decision rejection options ( =[0⇠0.25] by 0.05). While the other methods s
cant decrease in coverage, under all the given reject options, GAM shows better cov
ll as prediction performance.
49#
43#
39#
28#
27#
23#
22#
20#
19#
16#
10#
7#
5#
0# 10# 20# 30# 40# 50#
AA#
BA_opt#
BA_PES#
C#
NumLabels#
CurrentLabelQuality#
AccChangeDirecHon#
SA#
Phi#
BA_uni#
TaskTime#
TopicChange#
TopicEverSeen#
Fig. 4. Summary of relative feature importance across 54 regression models.
ases (27), which implicitly indicates that task familiarity affects an assessor’s
A GAM with the only top 5 features shows good performance
(7-10% less than full-featured GAM )
Introduc+on
Method
Evalua+on
Conclusion

Relative feature importance across 54 individual prediction models.

Metric GAM TS BAuni BAopt BApes SA
Accuracy 0.802* 0.621 0.599 0.601 0.522 0.599
% Improvement NA 29.1 33.9 33.4 53.6 33.9
# of Wins NA 50 52 50 54 52
# of Ties NA 3 1 3 0 1
# of Losses NA 1 1 1 0 1
MAE 0.340* 0.444 0.459 0.448 0.488 0.458
% Improvement NA 23.4 25.9 24.1 33.0 25.8
# of Wins NA 53 53 53 54 53
# of Losses NA 1 1 1 0 1
n performance (Accuracy and Mean Average Error) of different predictive mod-
nt indicates an improvement in prediction performance between GAM vs. each
baseline)
eline
). # of Wins indicates the number of assessors that GAM outperforms
d while # of Losses indicates the opposite of # of Wins. # of Ties indicates the
ors that both a method and GAM show the same prediction performance for
Predic+on
Performance

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
11

Introduc+on
Method
Evalua+on
Conclusion

baselines
Proposed
model

%
improvement

=
(GAM
/
baseline)
–
1

e.g.
GAM
vs.
TS

(0.802
/
0.621)
-‐
1=
0.291
=
29.1%

Predic+on
Coverage
vs.
Performance

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
12

0.1
0.2
0.3
0.4
0.5
0.25 0.50 0.75 1.00
Coverage
MAE
Method
0_GAM
1_TS
2_BA_uni
3_BA_opt
4_BA_pes
5_SA
!=0
!=0.05
!=0.1
!=0.15
!=0.2
!=0.25
Fig. 3. Prediction performance (MAE) of assessors’ next judgments and corresponding
across varying decision rejection options ( =[0⇠0.25] by 0.05). While the other metho
0
1

0.5+δ
0.5-‐δ

0.5

Decision
Reject
Op;on

Delta:
conﬁdence
threshold

parameter
for
decision

rejec;on

(larger
increases
accuracy,

decreases
coverage)

Introduc+on
Method
Evalua+on
Conclusion

!"#$% <"0.5"& & or"$% ≥ 0.5"+"&##(ℎ*+ use"$%
*,-* discard"$%

Task
Rou+ng:
Judgment
Quality
vs.
Cost

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
13

Quality

Improvement

GAM
>>
baselines

(under
varying
#
of
judges)
Higher
quality
labels

with
Less
#
of
judges

Cost
Saving

3.7
judges
per
task

Introduc+on
Method
Evalua+on
Conclusion

Prediction Models for Task routing No Routing
Number of Judges GAM TS BAuni BAopt BApes SA Random All labels
1 0.786* 0.604 0.578 0.582 0.558 0.569 0.556
0.595
% Improvement NA 30.1 36.0 35.1 40.9 38.1 41.4
2 0.816** 0.617 0.592 0.595 0.574 0.582 0.572
% Improvement NA 32.3 37.8 37.1 42.2 40.2 42.7
3 0.880* 0.647 0.608 0.623 0.598 0.608 0.581
% Improvement NA 36.0 44.7 41.3 47.2 44.7 51.5
Table 3. Accuracy of relevance judgments via predictive models. Number of Judges indicate
he number of judges per query-document pair. When the Number of Judges > 1, majorit
oting is used for label aggregation. Accuracy is measured against NIST expert gold labels
% Improvement indicates an improvement in label accuracy between GAM vs. each baselin
(GAM baseline)
baseline
). The average number of judges per query-document pair is 3.7. (*) indi
ates that GAM prediction outperforms the other six methods with high statistical signiﬁcanc
p<0.01).
uality is measured with accuracy, and a paired t-test is conducted to check whethe

Conclusion

5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
14

Build
mul;-‐dimensional
features
of
crowd

assessors
(;me-‐series,
Bayesian,
behavioral)
Mul+-‐dimensional

Assessor
Features

Discrimina+ve

Predic+on
Model

Be,er
predic;on
accuracy,
predic;on
coverage,

higher
quality
relevance
judgment
in
task
rou;ng

Results

Integrate
mul;-‐dimensional
features
via
a

discrimina;ve
predic;on
model
Introduc+on
Method
Evalua+on
Conclusion

Eﬀect
of
limited
supervision,
more
realis;c
online

task
rou;ng
(with
Bandit
approaches)
Future
work

Ecir2015 hjung final_public

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Ecir2015 hjung final_public