SlideShare a Scribd company logo
1 of 14
A	
  Discrimina+ve	
  Approach	
  to	
  
Predic+ng	
  Assessor	
  Accuracy	
  
Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
  
University	
  of	
  Texas	
  at	
  Aus;n	
  
	
  
Presented	
  by	
  HyunJoon	
  Jung	
  
Introduc+on	
  -­‐	
  Crowdsourcing	
  for	
  IR	
  evalua+on	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   2	
  
Manual human judgments:
too expensive (cost) and
too slow (time)
Crowdsourcing	
  for	
  IR	
  Evalua+on	
  
•  Origin:	
  Alonso	
  et	
  al.	
  (SIGIR	
  Forum	
  2008)
•  Collecting relevance judgments from
a globally distributed online crowd

via the Internet
Faster
Time
Quality
concern
Broader
Demographics
Less Cost
Benefits of Crowdsourcing based IR Evaluation
Alternative efforts in IR:
An effort to reduce the number of 

relevance judgments to collect 

(MTC, StatAP, and Pooling)
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Introduc+on	
  -­‐	
  Quality	
  Control	
  in	
  Crowdsourcing	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   3	
  
Crowd	
  workers	
Label
Aggregation	
Workflow
Design	
Worker
Management	
Existing Quality Control Methods	
Task Design	
Who	
  is	
  more	
  accurate?	
  
(worker	
  performance	
  es5ma5on	
  
and	
  predic5on)	
Requester	
Online	
  marketplace	
Crowd	
  
workers	
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Introduc+on	
  –	
  Problem	
  seFng	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   4	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
How	
  to	
  use
•  Task	
  rou+ng	
  	
  
•  Label	
  aggrega+on	
  	
  
•  Worker	
  filtering	
  
•  Interven+on	
  
Why
Improve	
  data	
  quality	
  
and	
  lower	
  cost	
  
Problem
Find	
  a	
  worker	
  who	
  is	
  	
  
most	
  likely	
  to	
  make	
  a	
  correct	
  label	
  
Alice	
  
1	
   1	
   0	
   0	
   ?	
  
1	
   0	
   1	
   0	
   ?	
  
Bob	
  
Introduc+on	
  –	
  Related	
  work	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   5	
  
Alice	
  
Correctness	
  of	
  the	
  ith	
  task	
  instance	
  
against	
  a	
  gold	
  label	
  
	
  (1	
  -­‐>	
  correct	
  ,	
  0	
  -­‐>	
  wrong)	
  
1	
   1	
   0	
   0	
   ?	
  
1	
   0	
   1	
   0	
   ?	
  
Bob	
   A	
  typical	
  way:	
  measure	
  accuracy	
  
(=2/4	
  =	
  0.5)
Problem:	
  find	
  a	
  worker	
  who	
  is	
  most	
  likely	
  to	
  make	
  a	
  correct	
  label.	
  
Observa4ons	
  are	
  iden4cally	
  and	
  
independently	
  distributed	
  (i.i.d)	
  .
Consider	
  this	
  problem	
  from	
  a	
  
temporal	
  perspec5ve
Let’s	
  relax	
  this	
  assump4on	
  
-­‐>	
  condi4onally	
  independence	
  
Jung	
  et.	
  al’s	
  
Hcomp	
  2014
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Method	
  -­‐	
  Idea	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   6	
  
Integrate	
  mul+-­‐dimensional	
  features	
  of	
  a	
  
crowd	
  assessor	
  
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
over time, the existing models are not able to capture such temporal dynamics and thus
prediction based on these models is almost always wrong. In particular, when an asses-
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While
an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
[8]) do not follow the temporal variation of the gold labels since they are not able to capture
dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas accuracy of the
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label
quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates
over time, the existing models are not able to capture such temporal dynamics and thus
prediction based on these models is almost always wrong. In particular, when an asses-
sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the
prediction based on the existing models are always 1 (correct) even though the actual
assessor’s next label quality oscillate over time. A similar problem happens in Figure 1
(b) with another worker whose average accuracy is below 0.5.
Fig. 1. Two examples of failures of existing assessor models and success of GAM
assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy ass
an actual assessor’s next label quality (GOLD) oscillates over time, the existing as
(Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta p
[8]) do not follow the temporal variation of the gold labels since they are not a
dynamics of labels properly. On the contrary, our proposed model, GAM is ver
such dynamics of labels over time for higher quality prediction.
strong accuracy (0.8) which continually degrades over time, whereas acc
right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker
quality (yt) is binary (correct/wrong) with respect to ground truth. While
over time, the existing models are not able to capture such temporal dynam
prediction based on these models is almost always wrong. In particular, wh
sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figu
prediction based on the existing models are always 1 (correct) even thoug
assessor’s next label quality oscillate over time. A similar problem happen
(b) with another worker whose average accuracy is below 0.5.
Mul;ple	
  features	
  
Alice	
  
accuracy	
   ;me	
  
temporal	
  
effect	
  
topic	
  	
  
familiarity	
  
#	
  of	
  
labels	
  
0	
  0.7	
   10.3	
   0.6	
   0.8	
   20	
  
0.6	
   8.5	
   0.5	
   0.2	
   21	
   1	
  
0.65	
   7.5	
   0.4	
   0.4	
   22	
   0	
  
0.63	
   11.5	
   0.3	
   0.5	
   23	
   ?	
  
Predict	
  an	
  assessor’s	
  next	
  label	
  
quality	
  based	
  on	
  a	
  single	
  feature	
  	
  
Alice	
  
0.6	
  
0.5	
  
0.4	
  
0.3	
  
ailures of existing assessor models and success of GAM in predicting
ty ((a) High accuracy assessor and (b) low accuracy assessor). While
abel quality (GOLD) oscillates over time, the existing assessor models
mple Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI
mporal variation of the gold labels since they are not able to capture
rly. On the contrary, our proposed model, GAM is very sensitive to
ver time for higher quality prediction.
which continually degrades over time, whereas accuracy of the
s steadily around 0.5. Suppose that a crowd worker’s next label
orrect/wrong) with respect to ground truth. While yt oscillates
models are not able to capture such temporal dynamics and thus
se models is almost always wrong. In particular, when an asses-
is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the
existing models are always 1 (correct) even though the actual
0	
  
1	
  
0	
  
?	
  
temporal	
  
effect	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Method	
  -­‐	
  Crowd	
  Assessor	
  Features	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   7	
  
[1]	
  Cartere,e,	
  B.,	
  Soboroff,	
  I.:	
  The	
  effect	
  of	
  assessor	
  error	
  on	
  IR	
  system	
  evalua;on.	
  SIGIR	
  ’10	
  
[2]	
  Ipeiro;s,	
  P.G.,	
  Gabrilovich,	
  E.:	
  Quizz:	
  targeted	
  crowdsourcing	
  with	
  a	
  billion	
  (poten;al)	
  users.	
  WWW’14	
  
[3]	
  Jung,	
  H.,	
  et	
  al.:	
  Predic;ng	
  Next	
  Label	
  Quality:	
  A	
  Time-­‐Series	
  Model	
  of	
  Crowdwork.	
  HCOMP’14	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
How	
  do	
  we	
  flexibly	
  capture	
  a	
  wider	
  range	
  of	
  assessor	
  behaviors	
  by	
  
incorpora+ng	
  mul+-­‐dimensional	
  features?	
  
Feature Name Description
ObservableBayesian Optimistic Accuracy (BAopt) [4]
a Bayesian style accuracy with a prior Beta (16,1)
BAopt = (xt + 16)/(nt + 17)
Bayesian Pessimistic Accuracy (BApes) [4]
a Bayesian style accuracy with a prior Beta (1,16)
BApes = (xt + 1)/(nt + 17)
Bayesian Uniform Accuracy (BAuni) [8]
a Bayesian style accuracy with a prior Beta (0.5,0.5)
BAuni = (xt + 0.5/(nt + 1)
Sample Running Accuracy (SA) SAt = xt/nt
CurrentLabelQuality
a binary value indicating whether a current label is
correct or wrong.
TaskTime time to spend in completing this judgment task. (ms)
AccuracyChangeDirection (ACD)
a binary value indicating the absolute difference
between SAt 1 SAt.
TopicChange
a binary value indicating a topic change between
time t 1 and time t.
NumLabels a cumulative number of completed relevance judgments at time t.
TopicEverSeen
a real value [0⇠1] indicating the familiarity of a topic.
1
a number of judgments on topic k at time t
Latent
Asymptotic Accuracy (AA) [9]
a time-series accuracy estimated by latent time-series model
proposed by Jung et al. c
1
.
[9]
a temporal correlation indicating how frequently a sequence
of correct/wrong observations has changed over time.
c [9]
a variable indicating the direction of judgments
between correct and wrong.
Table 1. Features of generalized assessor model (GAM). n is the number of total judgments and
x is the number of relevance judgments at time t.
[1]	
  
[1]	
  
[2]	
  
[3]	
  
[3]	
  
[3]	
  
Various	
  
accuracy	
  
measures	
  
Task	
  features	
  
Temporal	
  
features	
  
Method	
  –	
  Predic+on	
  model	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   8	
  
Input:	
  X	
  (features	
  for	
  crowd	
  assessor	
  model)	
  
Learning	
  Framework	
  [	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ]	
  	
  	
  
first normalize our features in order to ensure that normalized feature values implicitly
weight all features equally in a model learning process. Logistic regression models the
probability distribution of the class label y given a feature vector X as follows:
p(y = 1|x;✓) = (✓T
x) =
1
1 + exp( ✓T x)
(1)
Here ✓ = { 0, T
1 ,..., T
M} are the parameters of the logistic regression model; and
(·) is the sigmoid function, defined by the second equality. The following function
attempts to maximize the log-likelihood in order to fit a model to a given training data.
max
✓
{
NX
i=1
[yi( 0 + T
xi) log(1 + e 0+ T
xi
)]
MX
j=1
| j|}. (2)
3.3 Prediction with Decision Reject Option
Our predictive model can generate two types of outputs: a probabilistic label (yi+1 2
{0,1}) indicating the degree of polarity and a binary label (0 or 1). While binary labels
(hard label) can be used as it is, probabilistic labels (soft label) can be used after a
In prediction, we consider a supervised learning task where w
instances {(xi,yi),i = 1,...,N}. Here, each xi 2 RM
is an M
vector, and yi 2 0,1 is a class label indicating whether an ass
is correct (1) or wrong (0). Before fitting a model to our featur
first normalize our features in order to ensure that normalized fe
weight all features equally in a model learning process. Logistic
probability distribution of the class label y given a feature vector
p(y = 1|x;✓) = (✓T
x) =
1
1 + exp( ✓T x
Here ✓ = { 0, T
1 ,..., T
M} are the parameters of the logistic
(·) is the sigmoid function, defined by the second equality. T
attempts to maximize the log-likelihood in order to fit a model to
max
✓
{
NX
i=1
[yi( 0 + T
xi) log(1 + e 0+ T
xi
)]
3.3 Prediction with Decision Reject Option
Output:	
  Y	
  (likelihood	
  of	
  gekng	
  correct	
  label	
  at	
  t)	
  
Generalizable	
  feature-­‐based	
  Assessor	
  Model	
  (GAM)	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Evalua+on	
  SeFng	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   9	
  
Method	
  
Dataset	
  
NIST	
  TREC	
  Crowdsourcing	
  track	
  2011	
  dataset	
  
•  Prolific	
  workers	
  (numLabels	
  >=	
  20)	
  -­‐>	
  54	
  workers	
  
•  Avg.	
  number	
  of	
  labels	
  per	
  worker:	
  163.	
  
GAM	
  
(Generalized	
  Assessor	
  
Model)	
  
TS	
  
(Time-­‐series	
  Model	
  
Jung	
  et	
  al.	
  ’14)	
  
BA	
  
(Bayesian	
  Accuracy,	
  
Cartere,e	
  &Soboroff	
  ‘10,	
  	
  
Ipeiro;s	
  &	
  Gabrilovich’14)	
  
SA	
  
(Sample	
  Accuracy)	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
How	
  important	
  are	
  the	
  features?	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   10	
  
. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
varying decision rejection options ( =[0⇠0.25] by 0.05). While the other methods s
cant decrease in coverage, under all the given reject options, GAM shows better cov
ll as prediction performance.
49#
43#
39#
28#
27#
23#
22#
20#
19#
16#
10#
7#
5#
0# 10# 20# 30# 40# 50#
AA#
BA_opt#
BA_PES#
C#
NumLabels#
CurrentLabelQuality#
AccChangeDirecHon#
SA#
Phi#
BA_uni#
TaskTime#
TopicChange#
TopicEverSeen#
Fig. 4. Summary of relative feature importance across 54 regression models.
ases (27), which implicitly indicates that task familiarity affects an assessor’s
A GAM with the only top 5 features shows good performance
(7-10% less than full-featured GAM )
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Relative feature importance across 54 individual prediction models.
Metric GAM TS BAuni BAopt BApes SA
Accuracy 0.802* 0.621 0.599 0.601 0.522 0.599
% Improvement NA 29.1 33.9 33.4 53.6 33.9
# of Wins NA 50 52 50 54 52
# of Ties NA 3 1 3 0 1
# of Losses NA 1 1 1 0 1
MAE 0.340* 0.444 0.459 0.448 0.488 0.458
% Improvement NA 23.4 25.9 24.1 33.0 25.8
# of Wins NA 53 53 53 54 53
# of Losses NA 1 1 1 0 1
n performance (Accuracy and Mean Average Error) of different predictive mod-
nt indicates an improvement in prediction performance between GAM vs. each
baseline)
eline
). # of Wins indicates the number of assessors that GAM outperforms
d while # of Losses indicates the opposite of # of Wins. # of Ties indicates the
ors that both a method and GAM show the same prediction performance for
Predic+on	
  Performance	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   11	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
baselines	
  Proposed	
  model	
  
%	
  improvement	
  
=	
  (GAM	
  /	
  baseline)	
  –	
  1	
  
	
  
e.g.	
  GAM	
  vs.	
  TS	
  
(0.802	
  /	
  0.621)	
  -­‐	
  1=	
  0.291	
  =	
  29.1%	
  
Predic+on	
  Coverage	
  vs.	
  Performance	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   12	
  
0.1
0.2
0.3
0.4
0.5
0.25 0.50 0.75 1.00
Coverage
MAE
Method
0_GAM
1_TS
2_BA_uni
3_BA_opt
4_BA_pes
5_SA
!=0
!=0.05
!=0.1
!=0.15
!=0.2
!=0.25
Fig. 3. Prediction performance (MAE) of assessors’ next judgments and corresponding
across varying decision rejection options ( =[0⇠0.25] by 0.05). While the other metho
0	
   1	
  
0.5+δ	
  0.5-­‐δ	
  
0.5	
  
Decision	
  Reject	
  Op;on	
  
Delta:	
  confidence	
  threshold	
  
parameter	
  for	
  decision	
  
rejec;on	
  
	
  
(larger	
  increases	
  accuracy,	
  
decreases	
  coverage)	
  	
  	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
!"#$% <"0.5"& & or"$% ≥ 0.5"+"&##(ℎ*+ use"$%
*,-* discard"$%
Task	
  Rou+ng:	
  Judgment	
  Quality	
  vs.	
  Cost	
  	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   13	
  
Quality	
  
Improvement	
  
GAM	
  >>	
  baselines	
  	
  
(under	
  varying	
  #	
  of	
  judges)
Higher	
  quality	
  labels	
  	
  
with	
  Less	
  #	
  of	
  judges	
  
Cost	
  Saving	
  
3.7	
  judges	
  per	
  task	
  
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Prediction Models for Task routing No Routing
Number of Judges GAM TS BAuni BAopt BApes SA Random All labels
1 0.786* 0.604 0.578 0.582 0.558 0.569 0.556
0.595
% Improvement NA 30.1 36.0 35.1 40.9 38.1 41.4
2 0.816** 0.617 0.592 0.595 0.574 0.582 0.572
% Improvement NA 32.3 37.8 37.1 42.2 40.2 42.7
3 0.880* 0.647 0.608 0.623 0.598 0.608 0.581
% Improvement NA 36.0 44.7 41.3 47.2 44.7 51.5
Table 3. Accuracy of relevance judgments via predictive models. Number of Judges indicate
he number of judges per query-document pair. When the Number of Judges > 1, majorit
oting is used for label aggregation. Accuracy is measured against NIST expert gold labels
% Improvement indicates an improvement in label accuracy between GAM vs. each baselin
(GAM baseline)
baseline
). The average number of judges per query-document pair is 3.7. (*) indi
ates that GAM prediction outperforms the other six methods with high statistical significanc
p<0.01).
uality is measured with accuracy, and a paired t-test is conducted to check whethe
Conclusion	
  
5/20/15	
   ECIR	
  '15	
  -­‐	
  Hyun	
  Joon	
  Jung	
  and	
  Ma,hew	
  Lease	
   14	
  
Build	
  mul;-­‐dimensional	
  features	
  of	
  crowd	
  
assessors	
  (;me-­‐series,	
  Bayesian,	
  behavioral)
Mul+-­‐dimensional	
  
Assessor	
  Features	
  
Discrimina+ve	
  
Predic+on	
  Model	
  
Be,er	
  predic;on	
  accuracy,	
  predic;on	
  coverage,	
  
higher	
  quality	
  relevance	
  judgment	
  in	
  task	
  rou;ng	
  
Results	
  
Integrate	
  mul;-­‐dimensional	
  features	
  via	
  a	
  
discrimina;ve	
  predic;on	
  model
Introduc+on	
   Method	
   Evalua+on	
   Conclusion	
  
Effect	
  of	
  limited	
  supervision,	
  more	
  realis;c	
  online	
  
task	
  rou;ng	
  (with	
  Bandit	
  approaches)
Future	
  work	
  

More Related Content

Recently uploaded

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 

Recently uploaded (20)

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Ecir2015 hjung final_public

  • 1. A  Discrimina+ve  Approach  to   Predic+ng  Assessor  Accuracy   Hyun  Joon  Jung  and  Ma,hew  Lease   University  of  Texas  at  Aus;n     Presented  by  HyunJoon  Jung  
  • 2. Introduc+on  -­‐  Crowdsourcing  for  IR  evalua+on   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   2   Manual human judgments: too expensive (cost) and too slow (time) Crowdsourcing  for  IR  Evalua+on   •  Origin:  Alonso  et  al.  (SIGIR  Forum  2008) •  Collecting relevance judgments from a globally distributed online crowd
 via the Internet Faster Time Quality concern Broader Demographics Less Cost Benefits of Crowdsourcing based IR Evaluation Alternative efforts in IR: An effort to reduce the number of 
 relevance judgments to collect 
 (MTC, StatAP, and Pooling) Introduc+on   Method   Evalua+on   Conclusion  
  • 3. Introduc+on  -­‐  Quality  Control  in  Crowdsourcing   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   3   Crowd  workers Label Aggregation Workflow Design Worker Management Existing Quality Control Methods Task Design Who  is  more  accurate?   (worker  performance  es5ma5on   and  predic5on) Requester Online  marketplace Crowd   workers Introduc+on   Method   Evalua+on   Conclusion  
  • 4. Introduc+on  –  Problem  seFng   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   4   Introduc+on   Method   Evalua+on   Conclusion   How  to  use •  Task  rou+ng     •  Label  aggrega+on     •  Worker  filtering   •  Interven+on   Why Improve  data  quality   and  lower  cost   Problem Find  a  worker  who  is     most  likely  to  make  a  correct  label   Alice   1   1   0   0   ?   1   0   1   0   ?   Bob  
  • 5. Introduc+on  –  Related  work   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   5   Alice   Correctness  of  the  ith  task  instance   against  a  gold  label    (1  -­‐>  correct  ,  0  -­‐>  wrong)   1   1   0   0   ?   1   0   1   0   ?   Bob   A  typical  way:  measure  accuracy   (=2/4  =  0.5) Problem:  find  a  worker  who  is  most  likely  to  make  a  correct  label.   Observa4ons  are  iden4cally  and   independently  distributed  (i.i.d)  . Consider  this  problem  from  a   temporal  perspec5ve Let’s  relax  this  assump4on   -­‐>  condi4onally  independence   Jung  et.  al’s   Hcomp  2014 Introduc+on   Method   Evalua+on   Conclusion  
  • 6. Method  -­‐  Idea   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   6   Integrate  mul+-­‐dimensional  features  of  a   crowd  assessor   Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models (Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI [8]) do not follow the temporal variation of the gold labels since they are not able to capture dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to such dynamics of labels over time for higher quality prediction. strong accuracy (0.8) which continually degrades over time, whereas accuracy of the right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates over time, the existing models are not able to capture such temporal dynamics and thus prediction based on these models is almost always wrong. In particular, when an asses- Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models (Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI [8]) do not follow the temporal variation of the gold labels since they are not able to capture dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to such dynamics of labels over time for higher quality prediction. strong accuracy (0.8) which continually degrades over time, whereas accuracy of the Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models (Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI [8]) do not follow the temporal variation of the gold labels since they are not able to capture dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to such dynamics of labels over time for higher quality prediction. strong accuracy (0.8) which continually degrades over time, whereas accuracy of the right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates Fig. 1. Two examples of failures of existing assessor models and success of GAM in predicting assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy assessor). While an actual assessor’s next label quality (GOLD) oscillates over time, the existing assessor models (Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI [8]) do not follow the temporal variation of the gold labels since they are not able to capture dynamics of labels properly. On the contrary, our proposed model, GAM is very sensitive to such dynamics of labels over time for higher quality prediction. strong accuracy (0.8) which continually degrades over time, whereas accuracy of the right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker’s next label quality (yt) is binary (correct/wrong) with respect to ground truth. While yt oscillates over time, the existing models are not able to capture such temporal dynamics and thus prediction based on these models is almost always wrong. In particular, when an asses- sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the prediction based on the existing models are always 1 (correct) even though the actual assessor’s next label quality oscillate over time. A similar problem happens in Figure 1 (b) with another worker whose average accuracy is below 0.5. Fig. 1. Two examples of failures of existing assessor models and success of GAM assessors’ next label quality ((a) High accuracy assessor and (b) low accuracy ass an actual assessor’s next label quality (GOLD) oscillates over time, the existing as (Time-series (TS) [9]), Sample Running Accuracy (SA), Bayesian uniform beta p [8]) do not follow the temporal variation of the gold labels since they are not a dynamics of labels properly. On the contrary, our proposed model, GAM is ver such dynamics of labels over time for higher quality prediction. strong accuracy (0.8) which continually degrades over time, whereas acc right assessor (b) hovers steadily around 0.5. Suppose that a crowd worker quality (yt) is binary (correct/wrong) with respect to ground truth. While over time, the existing models are not able to capture such temporal dynam prediction based on these models is almost always wrong. In particular, wh sor’s labeling accuracy is greater than 0.5 (eg. avg. accuracy = 0.67 in Figu prediction based on the existing models are always 1 (correct) even thoug assessor’s next label quality oscillate over time. A similar problem happen (b) with another worker whose average accuracy is below 0.5. Mul;ple  features   Alice   accuracy   ;me   temporal   effect   topic     familiarity   #  of   labels   0  0.7   10.3   0.6   0.8   20   0.6   8.5   0.5   0.2   21   1   0.65   7.5   0.4   0.4   22   0   0.63   11.5   0.3   0.5   23   ?   Predict  an  assessor’s  next  label   quality  based  on  a  single  feature     Alice   0.6   0.5   0.4   0.3   ailures of existing assessor models and success of GAM in predicting ty ((a) High accuracy assessor and (b) low accuracy assessor). While abel quality (GOLD) oscillates over time, the existing assessor models mple Running Accuracy (SA), Bayesian uniform beta prior (BA-UNI mporal variation of the gold labels since they are not able to capture rly. On the contrary, our proposed model, GAM is very sensitive to ver time for higher quality prediction. which continually degrades over time, whereas accuracy of the s steadily around 0.5. Suppose that a crowd worker’s next label orrect/wrong) with respect to ground truth. While yt oscillates models are not able to capture such temporal dynamics and thus se models is almost always wrong. In particular, when an asses- is greater than 0.5 (eg. avg. accuracy = 0.67 in Figure 1 (a)), the existing models are always 1 (correct) even though the actual 0   1   0   ?   temporal   effect   Introduc+on   Method   Evalua+on   Conclusion  
  • 7. Method  -­‐  Crowd  Assessor  Features   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   7   [1]  Cartere,e,  B.,  Soboroff,  I.:  The  effect  of  assessor  error  on  IR  system  evalua;on.  SIGIR  ’10   [2]  Ipeiro;s,  P.G.,  Gabrilovich,  E.:  Quizz:  targeted  crowdsourcing  with  a  billion  (poten;al)  users.  WWW’14   [3]  Jung,  H.,  et  al.:  Predic;ng  Next  Label  Quality:  A  Time-­‐Series  Model  of  Crowdwork.  HCOMP’14   Introduc+on   Method   Evalua+on   Conclusion   How  do  we  flexibly  capture  a  wider  range  of  assessor  behaviors  by   incorpora+ng  mul+-­‐dimensional  features?   Feature Name Description ObservableBayesian Optimistic Accuracy (BAopt) [4] a Bayesian style accuracy with a prior Beta (16,1) BAopt = (xt + 16)/(nt + 17) Bayesian Pessimistic Accuracy (BApes) [4] a Bayesian style accuracy with a prior Beta (1,16) BApes = (xt + 1)/(nt + 17) Bayesian Uniform Accuracy (BAuni) [8] a Bayesian style accuracy with a prior Beta (0.5,0.5) BAuni = (xt + 0.5/(nt + 1) Sample Running Accuracy (SA) SAt = xt/nt CurrentLabelQuality a binary value indicating whether a current label is correct or wrong. TaskTime time to spend in completing this judgment task. (ms) AccuracyChangeDirection (ACD) a binary value indicating the absolute difference between SAt 1 SAt. TopicChange a binary value indicating a topic change between time t 1 and time t. NumLabels a cumulative number of completed relevance judgments at time t. TopicEverSeen a real value [0⇠1] indicating the familiarity of a topic. 1 a number of judgments on topic k at time t Latent Asymptotic Accuracy (AA) [9] a time-series accuracy estimated by latent time-series model proposed by Jung et al. c 1 . [9] a temporal correlation indicating how frequently a sequence of correct/wrong observations has changed over time. c [9] a variable indicating the direction of judgments between correct and wrong. Table 1. Features of generalized assessor model (GAM). n is the number of total judgments and x is the number of relevance judgments at time t. [1]   [1]   [2]   [3]   [3]   [3]   Various   accuracy   measures   Task  features   Temporal   features  
  • 8. Method  –  Predic+on  model   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   8   Input:  X  (features  for  crowd  assessor  model)   Learning  Framework  [                                                                          ]       first normalize our features in order to ensure that normalized feature values implicitly weight all features equally in a model learning process. Logistic regression models the probability distribution of the class label y given a feature vector X as follows: p(y = 1|x;✓) = (✓T x) = 1 1 + exp( ✓T x) (1) Here ✓ = { 0, T 1 ,..., T M} are the parameters of the logistic regression model; and (·) is the sigmoid function, defined by the second equality. The following function attempts to maximize the log-likelihood in order to fit a model to a given training data. max ✓ { NX i=1 [yi( 0 + T xi) log(1 + e 0+ T xi )] MX j=1 | j|}. (2) 3.3 Prediction with Decision Reject Option Our predictive model can generate two types of outputs: a probabilistic label (yi+1 2 {0,1}) indicating the degree of polarity and a binary label (0 or 1). While binary labels (hard label) can be used as it is, probabilistic labels (soft label) can be used after a In prediction, we consider a supervised learning task where w instances {(xi,yi),i = 1,...,N}. Here, each xi 2 RM is an M vector, and yi 2 0,1 is a class label indicating whether an ass is correct (1) or wrong (0). Before fitting a model to our featur first normalize our features in order to ensure that normalized fe weight all features equally in a model learning process. Logistic probability distribution of the class label y given a feature vector p(y = 1|x;✓) = (✓T x) = 1 1 + exp( ✓T x Here ✓ = { 0, T 1 ,..., T M} are the parameters of the logistic (·) is the sigmoid function, defined by the second equality. T attempts to maximize the log-likelihood in order to fit a model to max ✓ { NX i=1 [yi( 0 + T xi) log(1 + e 0+ T xi )] 3.3 Prediction with Decision Reject Option Output:  Y  (likelihood  of  gekng  correct  label  at  t)   Generalizable  feature-­‐based  Assessor  Model  (GAM)   Introduc+on   Method   Evalua+on   Conclusion  
  • 9. Evalua+on  SeFng   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   9   Method   Dataset   NIST  TREC  Crowdsourcing  track  2011  dataset   •  Prolific  workers  (numLabels  >=  20)  -­‐>  54  workers   •  Avg.  number  of  labels  per  worker:  163.   GAM   (Generalized  Assessor   Model)   TS   (Time-­‐series  Model   Jung  et  al.  ’14)   BA   (Bayesian  Accuracy,   Cartere,e  &Soboroff  ‘10,     Ipeiro;s  &  Gabrilovich’14)   SA   (Sample  Accuracy)   Introduc+on   Method   Evalua+on   Conclusion  
  • 10. How  important  are  the  features?   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   10   . Prediction performance (MAE) of assessors’ next judgments and corresponding cov varying decision rejection options ( =[0⇠0.25] by 0.05). While the other methods s cant decrease in coverage, under all the given reject options, GAM shows better cov ll as prediction performance. 49# 43# 39# 28# 27# 23# 22# 20# 19# 16# 10# 7# 5# 0# 10# 20# 30# 40# 50# AA# BA_opt# BA_PES# C# NumLabels# CurrentLabelQuality# AccChangeDirecHon# SA# Phi# BA_uni# TaskTime# TopicChange# TopicEverSeen# Fig. 4. Summary of relative feature importance across 54 regression models. ases (27), which implicitly indicates that task familiarity affects an assessor’s A GAM with the only top 5 features shows good performance (7-10% less than full-featured GAM ) Introduc+on   Method   Evalua+on   Conclusion   Relative feature importance across 54 individual prediction models.
  • 11. Metric GAM TS BAuni BAopt BApes SA Accuracy 0.802* 0.621 0.599 0.601 0.522 0.599 % Improvement NA 29.1 33.9 33.4 53.6 33.9 # of Wins NA 50 52 50 54 52 # of Ties NA 3 1 3 0 1 # of Losses NA 1 1 1 0 1 MAE 0.340* 0.444 0.459 0.448 0.488 0.458 % Improvement NA 23.4 25.9 24.1 33.0 25.8 # of Wins NA 53 53 53 54 53 # of Losses NA 1 1 1 0 1 n performance (Accuracy and Mean Average Error) of different predictive mod- nt indicates an improvement in prediction performance between GAM vs. each baseline) eline ). # of Wins indicates the number of assessors that GAM outperforms d while # of Losses indicates the opposite of # of Wins. # of Ties indicates the ors that both a method and GAM show the same prediction performance for Predic+on  Performance   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   11   Introduc+on   Method   Evalua+on   Conclusion   baselines  Proposed  model   %  improvement   =  (GAM  /  baseline)  –  1     e.g.  GAM  vs.  TS   (0.802  /  0.621)  -­‐  1=  0.291  =  29.1%  
  • 12. Predic+on  Coverage  vs.  Performance   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   12   0.1 0.2 0.3 0.4 0.5 0.25 0.50 0.75 1.00 Coverage MAE Method 0_GAM 1_TS 2_BA_uni 3_BA_opt 4_BA_pes 5_SA !=0 !=0.05 !=0.1 !=0.15 !=0.2 !=0.25 Fig. 3. Prediction performance (MAE) of assessors’ next judgments and corresponding across varying decision rejection options ( =[0⇠0.25] by 0.05). While the other metho 0   1   0.5+δ  0.5-­‐δ   0.5   Decision  Reject  Op;on   Delta:  confidence  threshold   parameter  for  decision   rejec;on     (larger  increases  accuracy,   decreases  coverage)       Introduc+on   Method   Evalua+on   Conclusion   !"#$% <"0.5"& & or"$% ≥ 0.5"+"&##(ℎ*+ use"$% *,-* discard"$%
  • 13. Task  Rou+ng:  Judgment  Quality  vs.  Cost     5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   13   Quality   Improvement   GAM  >>  baselines     (under  varying  #  of  judges) Higher  quality  labels     with  Less  #  of  judges   Cost  Saving   3.7  judges  per  task   Introduc+on   Method   Evalua+on   Conclusion   Prediction Models for Task routing No Routing Number of Judges GAM TS BAuni BAopt BApes SA Random All labels 1 0.786* 0.604 0.578 0.582 0.558 0.569 0.556 0.595 % Improvement NA 30.1 36.0 35.1 40.9 38.1 41.4 2 0.816** 0.617 0.592 0.595 0.574 0.582 0.572 % Improvement NA 32.3 37.8 37.1 42.2 40.2 42.7 3 0.880* 0.647 0.608 0.623 0.598 0.608 0.581 % Improvement NA 36.0 44.7 41.3 47.2 44.7 51.5 Table 3. Accuracy of relevance judgments via predictive models. Number of Judges indicate he number of judges per query-document pair. When the Number of Judges > 1, majorit oting is used for label aggregation. Accuracy is measured against NIST expert gold labels % Improvement indicates an improvement in label accuracy between GAM vs. each baselin (GAM baseline) baseline ). The average number of judges per query-document pair is 3.7. (*) indi ates that GAM prediction outperforms the other six methods with high statistical significanc p<0.01). uality is measured with accuracy, and a paired t-test is conducted to check whethe
  • 14. Conclusion   5/20/15   ECIR  '15  -­‐  Hyun  Joon  Jung  and  Ma,hew  Lease   14   Build  mul;-­‐dimensional  features  of  crowd   assessors  (;me-­‐series,  Bayesian,  behavioral) Mul+-­‐dimensional   Assessor  Features   Discrimina+ve   Predic+on  Model   Be,er  predic;on  accuracy,  predic;on  coverage,   higher  quality  relevance  judgment  in  task  rou;ng   Results   Integrate  mul;-­‐dimensional  features  via  a   discrimina;ve  predic;on  model Introduc+on   Method   Evalua+on   Conclusion   Effect  of  limited  supervision,  more  realis;c  online   task  rou;ng  (with  Bandit  approaches) Future  work