1. A
Discrimina+ve
Approach
to
Predic+ng
Assessor
Accuracy
Hyun
Joon
Jung
and
Ma,hew
Lease
University
of
Texas
at
Aus;n
Presented
by
HyunJoon
Jung
2. Introduc+on
-‐
Crowdsourcing
for
IR
evalua+on
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
2
Manual human judgments:
too expensive (cost) and
too slow (time)
Crowdsourcing
for
IR
Evalua+on
• Origin:
Alonso
et
al.
(SIGIR
Forum
2008)
• Collecting relevance judgments from
a globally distributed online crowd
via the Internet
Faster
Time
Quality
concern
Broader
Demographics
Less Cost
Benefits of Crowdsourcing based IR Evaluation
Alternative efforts in IR:
An effort to reduce the number of
relevance judgments to collect
(MTC, StatAP, and Pooling)
Introduc+on
Method
Evalua+on
Conclusion
3. Introduc+on
-‐
Quality
Control
in
Crowdsourcing
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
3
Crowd
workers
Label
Aggregation
Workflow
Design
Worker
Management
Existing Quality Control Methods
Task Design
Who
is
more
accurate?
(worker
performance
es5ma5on
and
predic5on)
Requester
Online
marketplace
Crowd
workers
Introduc+on
Method
Evalua+on
Conclusion
4. Introduc+on
–
Problem
seFng
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
4
Introduc+on
Method
Evalua+on
Conclusion
How
to
use
• Task
rou+ng
• Label
aggrega+on
• Worker
filtering
• Interven+on
Why
Improve
data
quality
and
lower
cost
Problem
Find
a
worker
who
is
most
likely
to
make
a
correct
label
Alice
1
1
0
0
?
1
0
1
0
?
Bob
5. Introduc+on
–
Related
work
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
5
Alice
Correctness
of
the
ith
task
instance
against
a
gold
label
(1
-‐>
correct
,
0
-‐>
wrong)
1
1
0
0
?
1
0
1
0
?
Bob
A
typical
way:
measure
accuracy
(=2/4
=
0.5)
Problem:
find
a
worker
who
is
most
likely
to
make
a
correct
label.
Observa4ons
are
iden4cally
and
independently
distributed
(i.i.d)
.
Consider
this
problem
from
a
temporal
perspec5ve
Let’s
relax
this
assump4on
-‐>
condi4onally
independence
Jung
et.
al’s
Hcomp
2014
Introduc+on
Method
Evalua+on
Conclusion
6. Method
-‐
Idea
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
6
Integrate
mul+-‐dimensional
features
of
a
crowd
assessor
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
over time, the existing models are not able to capture such temporal dynamics and thus
prediction based on these models is almost always wrong. In particular, when an asses-
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
over time, the existing models are not able to capture such temporal dynamics and thus
prediction based on these models is almost always wrong. In particular, when an asses-
sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the
prediction based on the existing models are always 1 (correct) even though the actual
assessor’s next label quality oscillate over time. A similar problem happens in Figure 1
(b) with another worker whose average accuracy is below 0.5.
Fig. 1. Two examples of failures of existing assessor models and success of GAM
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy ass
an actual assessor’s next label quality (GOLD) oscillates over time, the existing as
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta p
[8]) do not follow the temporal variation of the gold labels since they are not a
dynamics of labels properly. On the contrary, our proposed model, GAM is ver
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas acc
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker
quality (yt) is binary (correct/wrong) with respect to ground truth. While
over time, the existing models are not able to capture such temporal dynam
prediction based on these models is almost always wrong. In particular, wh
sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figu
prediction based on the existing models are always 1 (correct) even thoug
assessor’s next label quality oscillate over time. A similar problem happen
(b) with another worker whose average accuracy is below 0.5.
Mul;ple
features
Alice
accuracy
;me
temporal
effect
topic
familiarity
#
of
labels
0
0.7
10.3
0.6
0.8
20
0.6
8.5
0.5
0.2
21
1
0.65
7.5
0.4
0.4
22
0
0.63
11.5
0.3
0.5
23
?
Predict
an
assessor’s
next
label
quality
based
on
a
single
feature
Alice
0.6
0.5
0.4
0.3
ailures of existing assessor models and success of GAM in predicting
ty ((a) High accuracy assessor and (b) low accuracy assessor). While
abel quality (GOLD) oscillates over time, the existing assessor models
mple Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
mporal variation of the gold labels since they are not able to capture
rly. On the contrary, our proposed model, GAM is very sensitive to
ver time for higher quality prediction.
which continually degrades over time, whereas accuracy of the
s steadily around 0.5. Suppose that a crowd worker’s next label
orrect/wrong) with respect to ground truth. While yt oscillates
models are not able to capture such temporal dynamics and thus
se models is almost always wrong. In particular, when an asses-
is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the
existing models are always 1 (correct) even though the actual
0
1
0
?
temporal
effect
Introduc+on
Method
Evalua+on
Conclusion
7. Method
-‐
Crowd
Assessor
Features
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
7
[1]
Cartere,e,
B.,
Soboroff,
I.:
The
effect
of
assessor
error
on
IR
system
evalua;on.
SIGIR
’10
[2]
Ipeiro;s,
P.G.,
Gabrilovich,
E.:
Quizz:
targeted
crowdsourcing
with
a
billion
(poten;al)
users.
WWW’14
[3]
Jung,
H.,
et
al.:
Predic;ng
Next
Label
Quality:
A
Time-‐Series
Model
of
Crowdwork.
HCOMP’14
Introduc+on
Method
Evalua+on
Conclusion
How
do
we
flexibly
capture
a
wider
range
of
assessor
behaviors
by
incorpora+ng
mul+-‐dimensional
features?
Feature Name Description
ObservableBayesian Optimistic Accuracy (BAopt) [4]
a Bayesian style accuracy with a prior Beta (16,1)
BAopt = (xt + 16)/(nt + 17)
Bayesian Pessimistic Accuracy (BApes) [4]
a Bayesian style accuracy with a prior Beta (1,16)
BApes = (xt + 1)/(nt + 17)
Bayesian Uniform Accuracy (BAuni) [8]
a Bayesian style accuracy with a prior Beta (0.5,0.5)
BAuni = (xt + 0.5/(nt + 1)
Sample Running Accuracy (SA) SAt = xt/nt
CurrentLabelQuality
a binary value indicating whether a current label is
correct or wrong.
TaskTime time to spend in completing this judgment task. (ms)
AccuracyChangeDirection (ACD)
a binary value indicating the absolute difference
between SAt 1 SAt.
TopicChange
a binary value indicating a topic change between
time t 1 and time t.
NumLabels a cumulative number of completed relevance judgments at time t.
TopicEverSeen
a real value [0⇠1] indicating the familiarity of a topic.
1
a number of judgments on topic k at time t
Latent
Asymptotic Accuracy (AA) [9]
a time-series accuracy estimated by latent time-series model
proposed by Jung et al. c
1
.
[9]
a temporal correlation indicating how frequently a sequence
of correct/wrong observations has changed over time.
c [9]
a variable indicating the direction of judgments
between correct and wrong.
Table 1. Features of generalized assessor model (GAM). n is the number of total judgments and
x is the number of relevance judgments at time t.
[1]
[1]
[2]
[3]
[3]
[3]
Various
accuracy
measures
Task
features
Temporal
features
8. Method
–
Predic+on
model
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
8
Input:
X
(features
for
crowd
assessor
model)
Learning
Framework
[
]
first normalize our features in order to ensure that normalized feature values implicitly
weight all features equally in a model learning process. Logistic regression models the
probability distribution of the class label y given a feature vector X as follows:
p(y = 1|x;✓) = (✓T
x) =
1
1 + exp( ✓T x)
(1)
Here ✓ = { 0, T
1 ,..., T
M} are the parameters of the logistic regression model; and
(·) is the sigmoid function, defined by the second equality. The following function
attempts to maximize the log-likelihood in order to fit a model to a given training data.
max
✓
{
NX
i=1
[yi( 0 + T
xi) log(1 + e 0+ T
xi
)]
MX
j=1
| j|}. (2)
3.3 Prediction with Decision Reject Option
Our predictive model can generate two types of outputs: a probabilistic label (yi+1 2
{0,1}) indicating the degree of polarity and a binary label (0 or 1). While binary labels
(hard label) can be used as it is, probabilistic labels (soft label) can be used after a
In prediction, we consider a supervised learning task where w
instances {(xi,yi),i = 1,...,N}. Here, each xi 2 RM
is an M
vector, and yi 2 0,1 is a class label indicating whether an ass
is correct (1) or wrong (0). Before fitting a model to our featur
first normalize our features in order to ensure that normalized fe
weight all features equally in a model learning process. Logistic
probability distribution of the class label y given a feature vector
p(y = 1|x;✓) = (✓T
x) =
1
1 + exp( ✓T x
Here ✓ = { 0, T
1 ,..., T
M} are the parameters of the logistic
(·) is the sigmoid function, defined by the second equality. T
attempts to maximize the log-likelihood in order to fit a model to
max
✓
{
NX
i=1
[yi( 0 + T
xi) log(1 + e 0+ T
xi
)]
3.3 Prediction with Decision Reject Option
Output:
Y
(likelihood
of
gekng
correct
label
at
t)
Generalizable
feature-‐based
Assessor
Model
(GAM)
Introduc+on
Method
Evalua+on
Conclusion
9. Evalua+on
SeFng
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
9
Method
Dataset
NIST
TREC
Crowdsourcing
track
2011
dataset
• Prolific
workers
(numLabels
>=
20)
-‐>
54
workers
• Avg.
number
of
labels
per
worker:
163.
GAM
(Generalized
Assessor
Model)
TS
(Time-‐series
Model
Jung
et
al.
’14)
BA
(Bayesian
Accuracy,
Cartere,e
&Soboroff
‘10,
Ipeiro;s
&
Gabrilovich’14)
SA
(Sample
Accuracy)
Introduc+on
Method
Evalua+on
Conclusion
10. How
important
are
the
features?
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
10
. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
varying decision rejection options ( =[0⇠0.25] by 0.05). While the other methods s
cant decrease in coverage, under all the given reject options, GAM shows better cov
ll as prediction performance.
49#
43#
39#
28#
27#
23#
22#
20#
19#
16#
10#
7#
5#
0# 10# 20# 30# 40# 50#
AA#
BA_opt#
BA_PES#
C#
NumLabels#
CurrentLabelQuality#
AccChangeDirecHon#
SA#
Phi#
BA_uni#
TaskTime#
TopicChange#
TopicEverSeen#
Fig. 4. Summary of relative feature importance across 54 regression models.
ases (27), which implicitly indicates that task familiarity affects an assessor’s
A GAM with the only top 5 features shows good performance
(7-10% less than full-featured GAM )
Introduc+on
Method
Evalua+on
Conclusion
Relative feature importance across 54 individual prediction models.
11. Metric GAM TS BAuni BAopt BApes SA
Accuracy 0.802* 0.621 0.599 0.601 0.522 0.599
% Improvement NA 29.1 33.9 33.4 53.6 33.9
# of Wins NA 50 52 50 54 52
# of Ties NA 3 1 3 0 1
# of Losses NA 1 1 1 0 1
MAE 0.340* 0.444 0.459 0.448 0.488 0.458
% Improvement NA 23.4 25.9 24.1 33.0 25.8
# of Wins NA 53 53 53 54 53
# of Losses NA 1 1 1 0 1
n performance (Accuracy and Mean Average Error) of different predictive mod-
nt indicates an improvement in prediction performance between GAM vs. each
baseline)
eline
). # of Wins indicates the number of assessors that GAM outperforms
d while # of Losses indicates the opposite of # of Wins. # of Ties indicates the
ors that both a method and GAM show the same prediction performance for
Predic+on
Performance
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
11
Introduc+on
Method
Evalua+on
Conclusion
baselines
Proposed
model
%
improvement
=
(GAM
/
baseline)
–
1
e.g.
GAM
vs.
TS
(0.802
/
0.621)
-‐
1=
0.291
=
29.1%
12. Predic+on
Coverage
vs.
Performance
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
12
0.1
0.2
0.3
0.4
0.5
0.25 0.50 0.75 1.00
Coverage
MAE
Method
0_GAM
1_TS
2_BA_uni
3_BA_opt
4_BA_pes
5_SA
!=0
!=0.05
!=0.1
!=0.15
!=0.2
!=0.25
Fig. 3. Prediction performance (MAE) of assessors’ next judgments and corresponding
across varying decision rejection options ( =[0⇠0.25] by 0.05). While the other metho
0
1
0.5+δ
0.5-‐δ
0.5
Decision
Reject
Op;on
Delta:
confidence
threshold
parameter
for
decision
rejec;on
(larger
increases
accuracy,
decreases
coverage)
Introduc+on
Method
Evalua+on
Conclusion
!"#$% <"0.5"& & or"$% ≥ 0.5"+"&##(ℎ*+ use"$%
*,-* discard"$%
13. Task
Rou+ng:
Judgment
Quality
vs.
Cost
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
13
Quality
Improvement
GAM
>>
baselines
(under
varying
#
of
judges)
Higher
quality
labels
with
Less
#
of
judges
Cost
Saving
3.7
judges
per
task
Introduc+on
Method
Evalua+on
Conclusion
Prediction Models for Task routing No Routing
Number of Judges GAM TS BAuni BAopt BApes SA Random All labels
1 0.786* 0.604 0.578 0.582 0.558 0.569 0.556
0.595
% Improvement NA 30.1 36.0 35.1 40.9 38.1 41.4
2 0.816** 0.617 0.592 0.595 0.574 0.582 0.572
% Improvement NA 32.3 37.8 37.1 42.2 40.2 42.7
3 0.880* 0.647 0.608 0.623 0.598 0.608 0.581
% Improvement NA 36.0 44.7 41.3 47.2 44.7 51.5
Table 3. Accuracy of relevance judgments via predictive models. Number of Judges indicate
he number of judges per query-document pair. When the Number of Judges > 1, majorit
oting is used for label aggregation. Accuracy is measured against NIST expert gold labels
% Improvement indicates an improvement in label accuracy between GAM vs. each baselin
(GAM baseline)
baseline
). The average number of judges per query-document pair is 3.7. (*) indi
ates that GAM prediction outperforms the other six methods with high statistical significanc
p<0.01).
uality is measured with accuracy, and a paired t-test is conducted to check whethe
14. Conclusion
5/20/15
ECIR
'15
-‐
Hyun
Joon
Jung
and
Ma,hew
Lease
14
Build
mul;-‐dimensional
features
of
crowd
assessors
(;me-‐series,
Bayesian,
behavioral)
Mul+-‐dimensional
Assessor
Features
Discrimina+ve
Predic+on
Model
Be,er
predic;on
accuracy,
predic;on
coverage,
higher
quality
relevance
judgment
in
task
rou;ng
Results
Integrate
mul;-‐dimensional
features
via
a
discrimina;ve
predic;on
model
Introduc+on
Method
Evalua+on
Conclusion
Effect
of
limited
supervision,
more
realis;c
online
task
rou;ng
(with
Bandit
approaches)
Future
work