Crowdsourcing for Multimedia Retrieval
Upcoming SlideShare
Loading in...5
×
 

Crowdsourcing for Multimedia Retrieval

on

  • 935 views

lecture of Marco Tagliasacchi (Politecnico di Milano) for Summer School on Social Media Modeling and Search, and European Chapter of the ACM SIGMM event, supported by CUbRIK and Social Sensor ...

lecture of Marco Tagliasacchi (Politecnico di Milano) for Summer School on Social Media Modeling and Search, and European Chapter of the ACM SIGMM event, supported by CUbRIK and Social Sensor Project.

10-14 September, Fira, Santorini, Greece in Santorini

Statistics

Views

Total Views
935
Views on SlideShare
873
Embed Views
62

Actions

Likes
0
Downloads
9
Comments
0

3 Embeds 62

http://www.cubrikproject.eu 56
https://twitter.com 4
http://www.twylah.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Crowdsourcing for Multimedia Retrieval Crowdsourcing for Multimedia Retrieval Presentation Transcript

    • +   Crowdsourcing  for   Mul0media  Retrieval   Marco  Tagliasacchi   Politecnico  di  Milano,  Italy  
    • +   Outline   n  Crowdsourcing  applica0ons  in  mul0media  retrieval   n  Aggrega0ng  annota0ons   n  Aggrega0ng  and  learning   n  Crowdsourcing  at  work  
    • +  Crowdsourcing  applica0ons  in   mul0media  retrieval  
    • +   Crowdsourcing   n  Crowdsourcing  is  an  example  of  human  compu+ng   n  Use  an  online  community  of  human  workers  to  complete  useful   tasks   n  The  task  is  outsourced  to  an  undefined  public   n  Main  idea:  design  tasks  that  are   n  Easy  for  humans   n  Hard  for  machines  
    • +   Crowdsourcing   n  Crowdsourcing  plaHorms   n  Paid  contributors   n  Amazon  Mechanical  Turk  (www.mturk.com)   n  CrowdFlower  (crowdflower.com)   n  oDesk  (www.odesk.com)   n  …   n  Volunteers   n  Foldit  (www.fold.it)   n  Duolingo  (www.duolingo.com)   n  …    
    • +   Applica0ons  in  mul0media  retrieval     n  Create  annotated  data  sets  for  training   n  Reduces  both  cost  and  0me  needed  to  gather  annota0ons,   n  …but  annota0ons  might  be  noisy!     n   Validate  the  output  of  mul0media  retrieval  systems   n  Query  expansion  /  reformula0on  
    • +   Crea0ng  annotated  training  sets   [Sorokin  and  Forsyth,  2008]   n  Collect  annota0ons  for  computer  vision  data  sets     n  people  segmenta0on     Protocol 1 Protocol 1 Protocol 2 Protocol 2
    • Proto+   Crea0ng  annotated  training  sets   [Sorokin  and  Forsyth,  2008]   Protocol 2 n  Collect  annota0ons  for  computer  vision  data  sets     n  people  segmenta0on  and  pose  annota0on     Protocol 3 Protocol 4 Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation of
    • +   Experiment 3: trace the boundary of the person. 1 0.8 Crea0ng  annotated  training  sets   area(XOR)/area(AND). The lower the better. Mean 0.21, std 0.14, median 0.16 knee A [Sorokin  and  Forsyth,  2008]   0.6 G B 0.4 F E C D 0.2 A B 0 0 50 100 150 200 250 300 n  Observa0ons:   C n  Annotators  make  errors   D E F G n  Quality  of  annotators  is  heterogeneous   n  The  quality  of  the  annota0ons  depends  on  the  difficulty  of  the  task   Experiment 4: click on 14 landmarks 50 Mean error in pixels between annotation points. The lower the better. Mean 8.71, std 6.29, median 7.35. 40 14 12 12 7 7 14 14 9 13 11 11 1310 1310 30 10 figure 6 9 8 8 9 8 8 8 7 knee 9 9 14 14 7 7 13 13 G 11 13 rWrist 10 10 rHip 20 rAnkle A 3 4 12 B 13 3 3 4 4 C 11 11 13 13 Neck rElbow 12 12 lHip 12 F rKnee 2 2 5 5 4 4 3 3 12 12 D E 2 5 12 lElbow rShoulder Head 10 B C lKnee 6 6 5 5 2 2 A 11 1 11 lWrist 1 11 lShoulder 11 lAnkle 6 1 1 6 6 1 10 10 10 10 0 0 50 100 150 200 9 250 300 350 9 9 9 8 8 8 8 7 7 7 7 14 14 7 7 14 14 14 10 10 13 13 6 14 14 6 6 6 9 8 8 8 14 14 9 12 12 10 10 10 13 13 13 13 11 11 9 9 13 5 13 9 9 10 10 13 5 5 10 10 5 9 9 9 11 11 11 11 11 4 4 12 3 3 8 8 4 8 8 11 11 4 3 3 3 12 12 12 4 4 D E F G 12 12 4 4 3 3 4 4 12 4 7 7 7 7 3 7 7 3 3 3 8 7 3 3 100 4 4 5 5 110 120 130 140 150 160 5 5 170 180 190 200 100 110 120 130 140 150 160 88 5 170 5 5 180 190 200 100 110 120 130 140 150 160 170 180 190 200 100 110 120 130 140 5 5 2 2 2 2 2 2 2 2 2 6 6 6 6 1 1 1 1 1 1 1 6 6 1 1 6 6 6 Figure 6. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. WeFigure 5. Quality details. We presentbest pair forof annotation quality forbetween 35th4. For every image the best fitting between points “C” and detailed analysis all annotations experiments 3 and and 65th percentiles - “E” of experiment 4 in fig. 5.pair of annotations is selected. The score of the best pair is shown in the figure. For experiment 3 we score annotations by the area oftheir symmetric difference (XOR) divided bysame scale:union(OR). For experimenttowe compute the average distance between the the the area of their from image 100 4 200 on horizontal axis and from 3 pixels to 13 pixels of error on the vertical axis. T
    • +   Crea0ng  annotated  training  sets   [Soleymani  and  Larson,  2010]   n  MediaEval  2010  Affect  Task   n  Use  of  Amazon  Mechanical  Turk  to  annotate  the  Affect  Task  Corpus   n  126  videos  (2-­‐5  mins  in  length)   n  Annotate   n  Mood  (e.g.,  pleased,  helpless,  energe0c,  etc.)   n  Emo0on  (e.g.,  sadness,  joy,  anger,  etc.)   n  Boreness  (nine  point  ra0ng  scale)   n  Like  (nine  point  ra0ng  scale)  
    • +   Crea0ng  annotated  training  sets   [Nowak  and  Ruger.,  2010]   n  Crowdsourcing  image  concepts.  53  concepts,  e.g.,   n  Abstract  categories:  pPlace contains threehmutual exclusive concepts, namely In- artylife,  beach   olidays,  snow,  etc.   3.3.1 Design of HIT Template door, Outdoor and No Visual Place. In contrast several op- The design of the HITs at MTurk for the im n  Time  of  the  day:  day,  tional concepts belongue  the category Landscape Elements. night,  no  visual  c to tion task is similar to the annotation tool that w The task of the annotators was to choose exactly one concept to the expert annotators (see Sec. 3.2). Each H n  …   for categories with mutual exclusive concepts and to select of the annotation of one image with all applica all applicable concepts for optional designed concepts. All cepts. It is arranged as a question survey and photos were annotated at an image-based level. The anno- into three sections. The section Scene Descript n  Subset  of  99  images  from  the  ImageCLEF2009  dataset   tator tagged the whole image with all applicable concepts section Representation each contain four questio and then continued with the next image. tion Pictured Objects consists of three questions each section the image to be annotated is pres repetition of the image ensures that the turke while answering the questions without scrolling of the document. Fig. 2 illustrates the questi section Representation. Figure 1: Annotation tool that was used for the ac- quisition of expert annotations.
    • +   Crea0ng  annotated  training  sets   [Nowak  and  Ruger.,  2010]   n  Study  of  expert  and  non-­‐expert  labeling   n  Inter-­‐annota0on  agreement  among  experts:     n  very  high   n  Influence  of  the  expert  ground  truth  on  concept-­‐based  retrieval  ranking:     n  very  limited   n  Inter-­‐annota0on  agreement  among  non-­‐experts   n  High,  although  not  as  good  as  among  experts   n  Influence  of  averaged  annota0ons  (experts  vs.  non  experts)  on  concept-­‐based   retrieval  ranking:   n  Averaging  filters  out  noisy  non-­‐expert  annota0ons  
    • +   Crea0ng  annotated  training  sets   [Vondrick  et  al.,  2010]   n  Crowdsourcing  object  tracking  in  video   4 C. Vondrick, D. Ramanan, D. Patterson n  Annotators  draw  bounding  boxes   Fig. 2: Our video labeling user interface. All previously labeled entities are shown
    • +   Crea0ng  annotated  training  sets   [Vondrick  et  al.,  2010]   n  Annotators  label  the  enclosing  bounding  box  of  an  en0ty  every  T   frames   n  Bounding  boxes  at  intermediate  0me  instants  are  interpolated   n  Interes0ng  trade-­‐off  between     n  Cost  of  12 turk  workers  D. Ramanan, D. Patterson M C. Vondrick, n  Cost  of  interpola0on  on  Amazon  EC2  cloud   (a) Field drills (b) Basketball players
    • +   Crea0ng  annotated  training  sets   ments between F and each of the other e, every document was judged as more 4.1 HI T Design The use of preference judgments is prone to have a very simple [Urbano  et  al.,  2010]   which was judged equally similar (or HIT design (see Figure 4). We asked workers to listen to the new segment appears to the left of F withed more relevant, and G is set up in the the two incipits tor the second iteration, in the rightmost compare. Next, they were asked what variation was more similar s needed because F and G were already to the original melody, allowing 3 options: A is more similar, B isd be the pivot for the leftmost segment. more similar, and they are either equally similar or dissimilar. Weged similar to B, but D and E are evalua0on  of  music  informa0on  retrieval  systems   they n  Goal:  judged as indicated them that if one melody was part of another one, set up in a segment to the right of B. At had to be considered equally similar, so as to comply with therdered groups of relevance formed with original guidelines. As optional questions, they were asked for n  Use  crowdsourcing  amusicalalterna0ve  if o  experts  to  comments gorNote that not all the 21 judgments were their s  an   background, t any, and for create   round-­‐ truths  of  par0ally  ordered  lists   aggregate every incipit (e.g. G is only suggestions to give us some feedback.organized partially ordered list. Pivots for eachace. Documents that have been pivots alreadynts Preference JudgmentsG, B, F C<F, D<F, E<F, A<F, G=F, B<FB , F, G C=B, D>B, E>B, A=B E , F, G C=A, D=ED), (F, G) -ents, the sample of rankings given to eache than with the original method. Whenever over another one, it would be given a rankn case it was judged equally similar, a rankits sample. With the original methodology,anks given to an incipit could rangegreement  (92%  complete  +  par0al)  with  experts   n  Good  a from 1 ch increases the variance of the samples.eme, the two samples of rankings given tos are the opposite and therefore have the Mann-Whitney U tests can be used again ank samples are different or not. Becausevariable, the effect size is larger, which
    • +   Validate  the  output  of  MIR  systems [Snoek  et  al.,  2010][Freiburg  et  al.,  2011]   n  Search  engine  for  archival  rock  ‘n’  roll  concert  video   n  Use  of  crowdsourcing  to  improve,  extend  and  share  automa0cally   Audience detected  concepts  in  video  fragments   Close-up Hands Pinkpop hat Keyboard Guitar player Singer Stage Pink Drummer Over the shuolder Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feed Audience Close-up Hands Pinkpop hat Keyboard Guitar player Drummer Over the shuolder Singer Stage Pinkpop logo Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feedback. 180 Excluded correct fragment labels first exp 160 back. A Crowdsourcing errors 140 vided t a prefer showed Video Fragments 120 respond 100 gregatio 80 between reliable 60 forced, 40 2%. Wi crowdso Figure 2: Timeline-based video player where col- ored dots correspond to automated visual detection 20 tomated results. Users can navigate directly to fragments of 0 can be e interest by interaction with the colored dots, which >50% >60% >70% >80% >90% is an in pop-up a feedback overlay as displayed in Figure 3. User-Feedback Agreement Figure 2: Timeline-based video player where col- 6. AC since 1970 at Landgraaf, the Netherlands. All music videos Figure 4: Results for Experiment 2: Quality vs We th
    • +   Validate  the  output  of  MIR  systems Crowdsourcing Event Detection in YouTube Videos 3 [Steiner  et  al.,  2011]   through a combination of textual, visual, and behavioral analysis techniques. When a user starts watching a video, three event detection processes start: Visual Event Detection Process We detect shots in the video by visually analyzing its content [19]. We do this with the help of a browser extension, i.e., the whole process runs on the client-side using the modern HTML5 [12] JavaScript APIs of the <video> and <canvas> elements. As soon as the shots have been detected, we offer the user the n  Propose  a  browser  extension  to  navigate  detected  events  in  videos   choice to quickly jump into a specific shot by clicking on a representative still frame. n  Visual  events  (shot  changes)   The detected named entitiesvideopresented to the Occurrence Event Detection Process We analyze the available NLP techniques, as outlined in [18]. are metadata using user in a list, and upon click via a timeline-like user interface allow for jumping into n  Occurrence  events  (analysis  of  metadata  by  means  of  NLP  to  detect   one of the shots where the named entity occurs. named  en00es)   JavaScript Detection Processeachsoon asshotsvisualcount clicks been detected, Interest-based Event we attach event listeners to As of the the and events have on shots as an n  Interest-­‐based  events  (click  counters  on  detected  visual  events)   expression of interest in those shots. Fig. 2: Screenshot of the YouTube browser extension, showing the three different event
    • +   Validate  the  output  of  MIR  systems [Goeau  et  al.,  2011]   n  Visual  plant  species  iden0fica0on   n  Based  on  local  visual  features   n  Crowdsourced  valida0on   writing, 858 images were up new users. These images a with uniform background, o background, and involve 15 set of 55 species. Note that within ImageCLEF2011 pla 5. EVALUATION Performances, basically i rates, will be actually show fline version connected to a d an enjoying demo where an leaves. Users would notice s cation (around 2 seconds), suggested in spite of the in cases with occlusions or wit Figure 1: GUI of the web application. a rough guide, a leave one
    • +   Validate  the  output  of  MIR  systems   [Yan  et  al.,  2010]   n  CrowdSearch  combines   n  Automated  image  search   n  Local  processing  on  mobile  phones  +  backend  processing   n  Real-­‐0me  human  valida0on  of  search  results   n  Amazon  Mechanical  Turk   n  Studies  the  trade-­‐off  in  terms  of   n  Delay  man error and bias to maximize accuracy. To balance these !"#$%&()*# +),-.-)/#&()*#0 1"23.4)/#&5)3.-)/.6,&7)080tradeoffs, CrowdSearch uses an adaptive algorithm that uses n  Accuracy   % $ # " !delay and result prediction models of human responses to ju- )(*+,( &( &( &( &( &( +9 n  Cost  diciously use human validation. Once a candidate image isvalidated, it is returned to the user as a valid search result. % $ # " ! )(*+,( &( -. &( &( &( +<3. CROWDSOURCING FOR SEARCH In this section,n  More  on  this  later…   of the Ama- % $ # " ! we first provide a background )(*+,( -. -. -. -. -. +;zon Mechanical Turk (AMT). We then discuss several designchoices that we make while using crowdsourcing for image % $ # " !validation including: 1) how to construct tasks such that )(*+,( &( -. -. &( &( +:they are likely to be answered quickly, 2) how to minimizehuman error and bias, and 3) how to price a validation taskto minimize delay. Figure 2: Shown are an image search query, candi-
    • C./% +3/% *)% -./% 62*7,% #3% #% 3/#26.% 3-2#-/4=% "3% 6*:1/99"$4D% "-% ,/-/6-"*$% *)% 31/6")"6% )/#-+2/3% 7"-."$% $*$AB?% :+9-":/,"#%"$-2*,+6/3% ,"</23"-=% *)% 3/#26.% -/2:3% 3"$6/% ,"))/2/$-% :/:;/23% *)% 6*99/6-"*$3>% R"8"1/,"#% Z/-2"/<#9@% #% -#38% "$% U:#4/?MGQ% &KX(%-./% 62*7,% 7"99% #119=% ,"))/2/$-% 3/#26.% 3-2#-/4"/3% ;#3/,% *$% -./"2% "$<*9</3% 9*6#-"$4% 2/9/<#$-% ":#4/3% )2*:% -./% R"8"1/,"#% ":#4/%)#:"9"#2"-=% 7"-.% -./% 3/#26.% -*1"6>% E*2/*</2@% -./% 62*7,% .#3% ;//$% 6*99/6-"*$% ;#3/,% *$% #% 12*<",/,% -/H-% 0+/2=% #$,% 3/</2#9% 3#:19/%3.*7$% -*% 12*<",/% 4**,% 0+#9"-=% "$% 3-+,"/3% "$<*9<"$4% 2/9/<#$6/% ":#4/3>% % R."9/% R"8"1/,"#% Z/-2"/<#9% /H#:"$/3% $*"3=% #$,% +$3-2+6-+2/,% -/H-+#9% #$$*-#-"*$3% "$% R"8"1/,"#% :+9-":/,"#@% -./% +  F+,4:/$-3>%G</$%7"-.%,"</23"-=@%7/%6#$%3-"99%/H1/6-%3/#26.%0+#9"-=I%3*:/%3-+,"/3%*$%12/,"6-"*$%"$%62*7,3*+26"$4%3=3-/:3%,/:*$3-2#-/% 3/:"3-2+6-+2/,%6*$-/$-%/<#9+#-/,%"$%U:#4/?MGQ%"3%)#2%9/33%$*"3=%-.#-% 2/9"#;"9"-=% *)% -./% #</2#4/% *)% 12/,"6-/,% 36*2/3% ;=% -./% 62*7,% #$,%:*2/%3-2+6-+2/,%-.#$%6*$-/$-%3/#26./3%*$%S*+C+;/>% Query  expansion  /  reformula0on  ":12*</3% #3% -./% 3"J/% *)% -./% 62*7,% "$62/#3/3% &KL@% KK(>% M"8/7"3/@% V/</2#9% 3-+,"/3% .#</% /H#:"$/,% 3/#26.% 0+#9"-=% *$% +3/23+119"/,%3/#26.%0+#9"-=%"3%/H1/6-/,%-*%":12*</%#3%-./%$+:;/2%*)%3/#26./23% -#43%"$%*-./2%R/;%O>L%#119"6#-"*$3>%%]"</23"-=%*)%":#4/%-#4%3/#26.%"$% -./% 62*7,% /H1#$,3>% ?2*7,3*+26"$4% 6*$-2#3-3% 7"-.% 8$*79/,4/% 2/3+9-3% "$% Q9"682% +3"$4% #$% ":19"6"-% 2/9/<#$6/% )//,;#68% :*,/9% "3% [Harris,  2012]  :#28/-3% "$% 9/</9% *)% /$4#4/:/$-D% N"/93/$% :/$-"*$3% "$% &KO(% -.#-% /H19*2/,%;=%<*$%^7*9%!"#$%&#%&KY(@%6*$69+,"$4%-.#-%,"</23"-=%"3%#$%*</2% LP% *)% 8$*79/,4/% :#28/-% 42*+1% 1#2-"6"1#$-3% )#"9% -*% ":1*2-#$-% 6*:1*$/$-% 7./$% 2/-2"/<#9% "3% ;#3/,% *$% 3:#99% ,#-#% 3/-3@%6*$-2";+-/D% -./2/)*2/% -./% 62*7,3*+26"$4% #31/6-% "$-2*,+6/3% 3*:/% 3+6.% #3% -.*3/% )*+$,% "$% ":#4/% -#43>% % _*-.*% !"#$ %&#% /H19*2/%)"$#$6"#9%"$6/$-"</%-*%:*-"<#-/%-#38%1#2-"6"1#-"*$>% )*983*$*:=% -#44"$4@% 7."6.% "3% ;*+$,% ;=% -./% 3#:/% $*"3=% +$3-2+6-+2/,%2/3-2"6-"*$3%#3%S*+C+;/%-#43%&K`(@%;+-%-./"2%3-+,=%7#3%C./%*;F/6-"</%"$%-."3%1#1/2%"3%-*%/H#:"$/%")%-./%62*7,%6#$%12*<",/% 12":#2"9=% )*6+3/,% *$% 2/6*::/$,/2% 3=3-/:3% +3#4/% *)% -./3/% -#43>%#% :*2/% 12/6"3/% 3/-% *)% AB?% 3/#26.% 2/3+9-3@% 4"</$% #% 0+/2=@% a-./23% .#</% /H#:"$/,% :+9-":/,"#% 3/#26.% /))/6-"</$/33% *$%6*:1#2/,% 7"-.% *-./2% :+9-":/,"#% 3/#26.% -**93>% C./% 6*$-2";+-"*$3%*)% -."3% 1#1/2% #2/% #3% )*99*73>% Q"23-% 7/% 6*:1#2/% -./% 2/-2"/<#9% !"! #$%&(%) 8$*79/,4/%:#28/-%7/;3"-/3@%3+6.%#3%?.+#%!"#$%&#%"$%&Kb(%#$,%M"%!"#$ (:;-4)NC)O/974: n  Search  YouTube  user  generated  content  1/2)*2:#$6/%*)%,"))/2/$-%2/-2"/<#9%:*,/93%"$%-/2:3%*)%12/6"3"*$%*$%3/</2#9% 6#-/4*2"/3% +3"$4% AB?% <",/*% 2/0+/3-3% -#8/$% )2*:% 9/#,"$4% %&#%"$%&Kc(D%.*7/</2@%-./"2%)*6+3%"3%-*%9*6#-/%#99%6*$-/$-%#,,2/33"$4% !"*! +,,-./0) #% 31/6")"6% 0+/3-"*$% d/>4>% e.*7% -*f% #$,% e7.=f% 0+/3-"*$% -=1/3g% ./6478:94)F.>. %4:79A)%67:640 !"#$%& ()& *++,#$%& )-.,/.#+$& 0)(+12& 3)& 4.,4/,.)& ()& 567& 7./2/#3% -./% )*6+3% *)% *+2% 3-+,=% "3% *$% )"$,"$4% #$,% 2#$8"$4% <",/*3%8$*79/,4/% :#28/-% 7/;3"-/3>% R/% -./$% 6*:1#2/% S*+C+;/T3% *7$% "4+8)"&9+8&).4(&+9&()&").84(&)99+8":&&;()")&.8)&%#-)$&#$&;.<,)&=:&& H/1)$&H).84( -.#-%)+9)"99%#%31/6")"6%3/#26.%2/0+/3-%d/>4>@%e./91%)"$,%#%<",/*fg>%%3/#26.%"$-/2)#6/%7"-.%#%3/#26.%6*$,+6-/,%;=%3-+,/$-3%#3%7/99%#3%#% >(#,)&()")&"4+8)"&"))0&8)."+$.<,)2&#&#"&,#?),@&1/)&+&3+&#""/)"A& n  Natural  language  queries  are  restated  and  given  as  input  to  3/#26.% #112*#6.% +3"$4% 62*7,3*+26"$4>% % R/% /<#9+#-/% *+2% 2/3+9-3%+3"$4% -7*% :/-.*,3I% :/#$% #</2#4/% 12/6"3"*$% ,/-/2:"$/,% #)-/2% h%)/7%3-+,"/3%.#</%/H#:"$/,%-./%/))/6-"</$/33%*)%62*7,3%*$%$*"3=% +/8&4.,4/,.#+$&+9&%8+/$1&8/(&.$12&9+8&0+"&").84()"2&()8)&3)8)& ,#-#% 3/#26./3>% V-/"$/2% !"#$ %&#% ,/:*$3-2#-/,% 3/#26./3% *)% /</$-% "0.,,& *)84)$.%)& +9& B+/;/<)& -#1)+"& 3)8)& 4+$"#1)8)1& +$,@& .& ^8+31"+/84#$% &#119="$4% 1**9"$4@% #$,% #% 3":19/% 9"3-% 12/)/2/$6/@% 7./2/% -./% /$-"2/% ,/-/6-"*$%:/-.*,3%"$%S*+C+;/%<",/*3%#-%-./%)2#4:/$-%9/</9%&K(>% & ;()& 48+31"+/84#$%& ").84(& "8.)%@& .$1& ()& "/1)$& 8),)-.$:& YouTube  search  interface   n 9"3-%*)%<",/*3%F+,4/,%#3%2/9/<#$-%;=%/#6.%:/-.*,%#2/%6*:1#2/,>%% _3+/.%!"#$%&#%/H#:"$/,%3/#26./3% "$%1*9"-"6#9%;9*43%"$%&OL(%7."6.@%"8.)%#)"& *)89+80)1& <))8& (.$& ()& B+/;/<)& ").84(& ").84(& (:;-4)SC)O/9 ./6478:94)F #9-.*+4.% $*"3=@% ,*% $*-% /H1/2"/$6/% -./% 2/3-2"6-"*$3% "$./2/$-% "$% ."& 0)."/8)1& <@& 5672& .& 8)"/,& (.& #"& ".#"#4.,,@& #$)89.4)&C./%2/:#"$,/2%*)%-./%1#1/2%"3%*24#$"J/,%#3%)*99*73>%U$%V/6-"*$%O% n  Students  7/%1+-%*+2%7*28%"$%-./%6*$-/H-%*)%12/<"*+3%7*28>%U$%V/6-"*$%W%7/% #112*#6.% 6#99/,% ?2*7,V/#26.@% 7."6.% 12*<",/,% $/#22/#9-":/% "#%$#9#4.$&C3+&.#,)12&*DE:EFG:& :+9-":/,"#% -#43>% % U$% &OK(@% S#$% !"#$ %&>% 12*<",/,% #$% "$$*<#-"</% %4:79A)%67:640,"36+33% *+2% /H1/2":/$-#9% 3/-+1>% V/6-"*$% X% *))/23% #% ,"36+33"*$% *)% (:;-4)<")=>47:--)?@+)59,745)8,7)4:9A)54:79A)567:640B") H/1)$&H).84( #33/33:/$-% *)% ":#4/3>% h9-.*+4.% -./% #+-.*23T% )*6+3% 7#3% *$%V/6-"*$%Y>% n  Crowd  in  Mturk  -./% 2/3+9-3>% R/% 6*$69+,/% #$,% 12*<",/% "$3"4.-% "$-*% )+-+2/% 7*28% "$% 9#;/9"$4% ":#4/3@% -./"2% #112*#6.% 6*+9,% )/#3";9=% ;/% /H-/$,/,%%4:79A)%67:640B) 9*6#-"$4%3":"9#2%:/,"#%*$%S*+C+;/>% -*% ?@+) ^8+31"+/84#$ H/1)$&H).84(& E:FXK& & ;+&#,,/"8.)2&#$&; ^8+31"+/84#$%& E:FQX& U1#99#4/,V2&+&+< B+/;/<)&H).84(& E:=Fa& 3)&3+/,1&)W*)4 & 4)$":& & ;+& +< H#$4)& I)".)1& J/)8#)"& 3)8)& %8+/*)1& #$+& (8))& ")*.8.)& 48+31"+/84#$%2& 4.)%+8#)"& C)."@2& 0)1#/02& .$1& 1#99#4/,G2& 3)& )-.,/.)1& ()0& 0#$/)"&.$1&#$4/ ")*.8.),@& 9+8& ).4(& ").84(& "8.)%@:& & ;()& 8)"/,"& .8)& 8)*+8)1& #$& 8)*8)")$& ,+$%& ) ;.<,)&K:& 48+312& ."& 4+0*. .?)"& 3+& (#81" MAP   (:;-4)!C)?@+)59,745)8,7)4:9A)54:79A)567:640BD);7,E4/)F,G/) ;B)54:79A)9:640,7B") )M/#-.,)$&.0+/$ +/8&*.8.0+/$&+ +& *8+-#1)& ()& < %4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6) 4+$"#1)8.#+$2&+/ H/1)$&H).84(& E:T=T& E:FQT& E:KXQ& ()&<)"&8.1)+99& ^8+31"+/84#$%& E:TQa& E:FQX& E:KEK& !"1! %.23-4) B+/;/<)&H).84(& E:FE`& E:=KK& E:XXK& >)& .**,@& ^+*), PXT2& XYR2& #"& .& % & ;.<,)&K&8.#")"&"+0)&#$)8)"#$%&*+#$"&9+8&1#"4/""#+$:&&L#8"2&567& *8)9)8)$4)":& & ^+ -./012)3")+4214.25)67)892)4.:26)1281.24;<)=16>2??).@46<4.@/)A600B2C?)?2;1>9).@8217;>2D)?80:2@8?D);@:)892)>165:") "4+8)"&9+8&)."@&M/)8#)"&.8)&0/4(&0+8)&4+$"#")$&.48+""&"8.)%#)"& 3+& ,#""& 9+8& .& %# 4+0*.8)1& 3#(& (+")& 9+8& 0)1#/0& +8& 1#99#4/,& ").84()":& & ;(#"& #"& ."")""+8S"&*8)9)8) % <@&$/0<)8&+9&-#4 ,#?),@&.&8)"/,&+9&.&().-#)8&8),#.$4)&9+8&"/1)$"&.$1&()&48+31&+$& ()& ".$1.81& B+/;/<)& ").84(& #$)89.4)& 9+8& ()& )."#)8& M/)8#)"2& 3#$$)8:&&>)&)W.0
    • +   Aggrega0ng  annota0ons  
    • +   Annota0on  model   n  A  set  of  objects  to  annotate   i = 1, . . . , I n  A  set  of  annotators   j = 1, . . . , J n  Types  of  annota0ons   n  Binary   n  Categorical  (mul0-­‐class)   n  Numerical   n  Other    
    • +   Annota0on  model   True  labels   Objects   Annotators   1 y1 1 y2 2 y1 y1 Annota0ons   3 y2 y2 j yi ∈ L 3 y3 Binary   |L| = 2 y3 4 y1   5 Mul0-­‐class   |L| > 2 y1 5 y2
    • +   Aggrega0ng  annota0ons   n  Majority  vo0ng  (baseline)   n  For  each  object,  assign  the  label  that  received  the  largest  number  of  votes   n  Aggrega0ng  annota0ons   n  [Dawid  and  Skene,  1979]   n  [Snow  et  al.,  2008]   n  [Whitehill  et  al.,  2009]   n  …   n  Aggrega0ng  and  learning   n  [Sheng  et  al.,  2008]   n  [Donmez  et  al.,  2009]   n  [Raykar  et  al.,  2010]   n  …  
    • +   Aggrega0ng  annota0ons   Majority  vo0ng   n  Assume  that     j n  The  annotator  quality  is  independent  from  the  object  P (yi = yi ) = pj   n  All  annotators  have  the  same  quality   pj = p n  The  integrated  quality  of  majority  vo0ng  using    I              2N    +    1       =                   annotators  is   ￿ ￿ 2N + 1 ￿ N q = P (y M V = y) = p2N +1−i · (1 − p)i i l=0
    • +   Aggrega0ng  annota0ons   Majority  vo0ng   -q 1 0.9 p=1.0 ly. p=0.9 Integrated qualityme 0.8 p=0.8 0.7 p=0.7 0.6 p=0.6 0.5 p=0.5 y) 0.4 p=0.4me 0.3 , 0.2 U. 1 3 5 7 9 11 13 yi Number of labelers is Figure 2: The relationship between integrated label-ue ing quality, individual quality, and the number of la- el belers.
    • +   Aggrega0ng  annota0ons   [Snow  et  al.,  2008]   j n  Binary  labels:     yi ∈ {0, 1} n  The  true  label  is  es0mated  evalua0ng  the  posterior  log-­‐odds,  i.e.,   1 J P (yi = 1|yi , . . . , yi ) log 1 J P (yi = 0|yi , . . . , yi ) n  Applying  Bayes  theorem   P (yi = 1|yi , . . . , yi ) ￿ 1 J j P (yi |yi = 1) P (yi = 1) log 1 J = log j + log P (yi = 0|yi , . . . , yi ) j P (yi |yi = 0) P (yi = 0) posterior   likelihood   prior  
    • +   Aggrega0ng  annota0ons   [Snow  et  al.,  2008]   j j n  How  to  es0mate P (yi |y      =    1)    and  P    (y  i    |y  i      =    0)  ?    i                                     n  Gold  standard:     n  Some  objects  have  known  labels   n  Ask  to  annotate  these  objects   n  Compute  empirical  p.m.f.  for  object(s)  with  known  labels   Number of correct annotations P (y j = 1|y = 1) = Number of annotations of object with label = 1 n  Compute  the  performance  of  annotator    j    (independent  from  the  object)       j j j P (y1 |y1 = 1) = P (y2 |y2 = 1) = . . . = P (yI |yI = 1) = P (y j |y = 1)
    • +   Aggrega0ng  annota0ons   [Snow  et  al.,  2008]   n  Each  annotator  vote  is  weighted  by  the  log-­‐likelihood  ra0o  for  their   given  response  (Naïve  Bayes)   n  More  reliable  annotators  are  weighted  more   P (yi = 1|yi , . . . , yi ) ￿ 1 J j P (yi |yi = 1) P (yi = 1) log 1 J = log j + log P (yi = 0|yi , . . . , yi ) j P (yi |yi = 0) P (yi = 0) n  Issue:  Obtaining  a  gold  standard  is  costly!    
    • +   Aggrega0ng  annota0ons   [Kumar  and  Lease,  2011]   Figure 1: p1:w ∼U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensus label accuracy) provides little benefit. Instead, labeling effort is better spent single labeling more examples. n  With  very  accurate  annotators,  it  is  berer  to  label  more  examples   once   pj ∼ U (0.6, 1.0) Figure 2: p1:w ∼U(0.4, 0.6). With very noisy annotators, single labeling yields such poor training data that n  With  very  noisy  annotators,  aggrega0ng  labels  helps,  if  annotator   there is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise to produce more ∼U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensus Figure 1: p1:w noise. In contrast, by modeling worker accuracies and weighting their labels appropriately, accuracies  are  taken  into  account   label accuracy) provides little benefit. Instead, labeling effort is better spent single labeling more examples. NB can improve consensus labeling accuracy (and thereby classifier accuracy).   pj ∼ U (0.3, 0.7) Figure 2: p1:w ∼U(0.4, 0.6). With very noisy annotators, single labeling yields such poor training data that Figure 3: p1:w ∼U(0.3, 0.7). With greater variance in accuracies vs. Figure 2, NB further improves. SL:  Single  Labeling,  MV:  Majority  Vo0ng;  NB:  Naïve  Bayes   there is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise to produce more noise. In contrast, by modeling worker accuracies and weighting their labels appropriately, NB can improve consensus labeling accuracy (and thereby classifier accuracy).
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   j n  Mul0-­‐class  labels   yi ∈ {1, . . . , K} n  Each  annotator  is  characterized  by  the  (unknown)  error  rates   j πlk = P (y j = l|y = k) k, l = 1, . . . , K n  Given  a  set  of  observed  labels,  D    =    {y  i  ,  .  .  .  ,    y  i    }  I            es0mate                       1           J   i=1       j n  The  error  rates   πlk n  The  a-­‐posteriori  probabili0es  P (yi = k|D)
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   j n  For  simplicity,  consider  the  case  with  binary  labels  yi ∈ {0, 1} n  Each  annotator  is  characterized  by  the  (unknown)  error  rates   j j P (yi = 1|yi = 1) = α1 True  posi0ve  rate   j j P (yi = 0|yi = 0) = α0 True  nega0ve  rate   n  Also  assume  that  the  prior  is  known,  i.e.,    P (yi = 1) = 1 − P (yi = 0) = pi
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   True  labels   y1 y2 ··· yI 1 2 1 2 3 Observed  labels   y1 y1 y2 y2 y2 Annotator  accuracies   α1 α2 α3 ··· αJ
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   n  The  likelihood  func0on  of  the  parameters              1  ,    α0  }  given  the   {α               observa0ons    D    =    {y  i    ,    .    .    .    ,    y  i      }  I              is  factored  as                     1   J   i=1   I ￿ 1 J P (D|α1 , α0 ) = P (yi , . . . , yi |α1 , α0 ) i=1 ￿I 1 J = P (yi , . . . , yi |yi = 1, α1 )P (yi = 1) i=1 1 J + P (yi , . . . , yi |yi = 0, α0 )P (yi = 0)
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   J ￿ J ￿ j j j j j yi j 1−yi P (yi |yi = 1, α1 ) = [α1 ] [1 − α1 ] j=1 j=1 I ￿ P (D|α1 , α0 )= 1 J P (yi , . . . , yi |yi = 1, α1 )P (yi = 1) i=1 1 J + P (yi , . . . , yi |yi = 0, α0 )P (yi = 0) J ￿ J ￿ j j j j j 1−yi j yi P (yi |yi = 0, α0 ) = [α0 ] [1 − α0 ] j=1 j=1
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   n  The  parameters  are  found  by  maximizing  the  log-­‐likelihood  func0on   {α1 , α0 } = arg max log P (D|θ) ˆ ˆ θ = {α1 , α2 } θ n  The  solu0on  is  based  on  Expecta0on-­‐Maximiza0on   n  Expecta+on  step   1 J pi = P (yi = 1) (prior) µi = P (yi = 1|yi , . . . , yi , θ) J ￿ 1 J j j ∝ P (yi , . . . , yi |yi = 1, θ)P (yi = 1|θ) j yi j 1−yi a1,i = [α1 ] [1 − α1 ] j=1 a1,i pi = J ￿ a1,i pi + a0,i (1 − pi ) a0,i = [α0 ] j j 1−yi [1 − j yi α0 ] j j=1
    • +   Aggrega0ng  annota0ons   [Dawid  and  Skene,  1979]   n  Maximiza+on  step   n  False  posi0ve  and  false  nega0ve  rates  can  be  es0mated  in  closed  form   ￿I j ￿I j j i=1 µ i yi j i=1 (1 − µi )(1 − yi ) α1 = ￿I α0 = ￿I i=1 µi i=1 (1 − µi )
    • +   Aggrega0ng  annota0ons   [Tang  and  Lease,  2011]   0.9 MV EM NB 0.9 0.85 n  A  semi-­‐supervised  approach  between   0.8 0.85 accuracy Accuracy n  A  supervised  approach  based  on  gold  standard   0.75 0.8 n  Naïve  Bayes  [Snow  et  al.,  2008]   0.75 SNB (128 labeled examples) 1 0.7 SNB (256 labeled examples) An  unsupervised  approach     2 0.7 n  SNB3 (512 labeled examples) SNB4 (1024 labeled examples) SNB5 (2048 labeled examples) n  Expecta0on-­‐Maximiza0on  [Dawid  and  Skene,  1979]   0.65 0.65 128 256 512 1024 2048 64 128 256 512 1024 2048 4096 5000 Number of labeled examples Number of unlabeled examples Figure 4: Supervised NB vs. unsupervised MV and Figure 6: Semi-supervised SNB vs. supervised NB n  A  very  modest  amount  of  supervision  can  provide  significant  benefit   EM on the synthetic dataset. method on the synthetic dataset. 0.71 0.71 MV EM 0.7 NB 0.7 0.69 0.69 0.68 0.68 0.67 Accuracy Accuracy 0.67 0.66 0.66 0.65 0.65 SNB (128 labeled examples) 0.64 1 SNB2 (256 labeled examples) SNB3 (512 labeled examples) 0.63 0.64 SNB4 (1024 labeled examples) SNB5 (2048 labeled examples) 0.62 0.63 128 256 512 1024 2048 64 128 256 512 1024 2048 4096 15758 Number of labeled examples Number of unlabeled examples Figure 5: Supervised NB vs. unsupervised MV and Figure 7: Semi-supervised SNB vs. supervised NB EM on the MTurk dataset.
    • +   Aggrega0ng  annota0ons   [Whitehill  et  al.,  2009]   j n  Binary  labels:     yi ∈ {0, 1} n  Annotators  have  different  exper0se:   j 1 p(yi = yi |αj , βi ) = 1 + e−αj βi n  More  skilled  annotators  (higher  α  j)  have  higher  probability  of         labeling  correctly   n  As  the  difficulty  of  the  image      1/β  i    increases,  the  probability  of  the               label  being  correct  moves  towards  0.5   n  GLAD  (Genera0ve  model  of  Labels,  Abili0es,  and  Difficul0es)  
    • +   Aggrega0ng  annota0ons   [Whitehill  et  al.,  2009]   Object  difficul0es   β1 β2 ··· βI True  labels   y1 y2 ··· yI 1 2 1 2 3 Observed  labels   y1 y1 y2 y2 y2 Annotator  accuracies   α1 α2 α3 ··· αJ
    • +   Aggrega0ng  annota0ons   [Whitehill  et  al.,  2009]   j n  The  observed  labels  are  samples  from  the            i    }    random  variables   {y   n  The  unobserved  variables  are   n  The  true  image  labels   yi , i = 1, . . . , I n  The  object  difficulty  parameters   βi , i = 1, . . . , I j n  The  different  annotators  accuracies   α , j = 1, . . . , J n  Goal:  find  the  most  likely  values  of  unobserved  variables  given  the   observed  data   n  Solu0on:  Expecta+on-­‐Maximiza+on  (EM)  
    • +   Aggrega0ng  annota0ons   [Whitehill  et  al.,  2009]   n  Expecta+on  step:   n  Compute  the  posterior  probabili0es  of  all    y  i      ∈    {0, 1}                 n  given  the  α,  β  values  from  the  last  M  step                   j yi = {yi￿ |i￿ = i} Bayes’  theorem   P (yi |yi , α, βi ) ∝ P (yi |α, βi )P (yi |yi , α, βi ) ￿ j ∝ P (yi ) P (yi |yi , αj , βi ) j Annotators  independence   ￿ ￿y i ￿ j ￿1−yi j j j 1 1 p(yi |yi = 1, α , βi ) = 1− 1 + e−αj βi 1 + e−αj βi
    • +   Aggrega0ng  annota0ons   [Whitehill  et  al.,  2009]   n  Maximiza+on  step:   n  Maximize  the  auxiliary  func0on   Q(α, β) = E[log p(y1 , . . . , yI , y|α, β)] n  Where  the  expecta0on  is  with  respect  to  the  posterior  probabili0es  of  all    yi ∈ {0, 1} computed  in  the  E-­‐step      ￿ ￿ j Q(α, β) = E log p(yi ) p(yi |yi , αj , βi ) i j ￿ ￿ j = E[log p(yi )] + E[log p(yi |yi , αj , βi )] i ij n  The  parameters    α,            are  es0mated  using  gradient  ascent           β (α∗ , β ∗ ) = arg max Q(α, β) α,β
    • Proportion of Labels Correc 0.95 0.8+   Correlation 0.9 0.6 Aggrega0ng  annota0ons   [Whitehill  et  al.,  2009]   0.85 0.4 0.8 0.2 GLAD Be Majority vote Alp 0.75 0 5 10 15 20 5 10 Number of Labelers of Number of Labelers on Parameter Esti Effect of Number of Labelers on Accuracy Effect Number of L 1 1 Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferrin Proportion of Labels Correct class labels on simulation data. Right: The ability of GLAD to recover the tru 0.95 parameters on simulation data. 0.8 Correlation 0.9 0.6 Image Type Labeler type Hard Easy 0.85 0.4 Good 0.95 1 Bad 0.54 1 0.8 0.2 We measured GLAD performance in terms of proportion of correctlyBeta: Spearman Co estimated labels. We approaches: (1) our proposed method, GLAD; (2) the method Alpha: Pearsonwhic Majority vote proposed in [5], Cor 0.75 ability but not image difficulty; and 0 Majority Vote. The simulations were re (3) 5 10and average performance calculated for the three methods. The results shown belo 15 20 5 10 15 Number of Labelers Number of Labelers modeling image difficulty can result in significant performance improvements. Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlying Method Error class labels on simulation data. Right: The ability of GLAD to recover the true alpha and bet GLAD 4.5% parameters on simulation data. Majority Vote 11.2%
    • +   Aggrega0ng  annota0ons   [Welinder  and  Perona,  2010]   n  Seung  similar  to  [Whitehill  et  al.,  2009],  with  some  differences   n  Object  difficulty  is  not  explicitly  modeled   n  Annotators  quality  dis0nguishes  between  true  posi0ve  and  true  nega0ve   rate   j j αj = [α0 , α1 ]T j j P (yi = 1|yi = 1) = α1 j j P (yi = 0|yi = 0) = α0 n  A  prior  distribu0on  is  set  on  α  j  to  capture  2  kinds  of  annotators           n  Honest  annotators  (with  different  quali0es,  from  unreliable  to  experts)   n  Adversarial  annotators  
    • +   Aggrega0ng  annota0ons   [Welinder  and  Perona,  2010]   n  Batched  algorithm   n  Expecta0on  step   ￿ j P (yi |yi , α) ∝ P (yi ) P (yi |yi , αj ) j j j j j j P (yi |yi = 1, αj ) = (α1 )yi (1 − α1 )1−yi j j j j j yi j 1−yi P (yi |yi = 0, α ) = (α0 ) (1 − α0 ) n  Maximiza0on  step   Prior   ￿ ￿ j ￿ Q(α) = E[log P (yi )] + E[log P (yi |yi , αj )] + log P (αj ) i ij j α∗ = arg max Q(α) α
    • Table 1. Summary of the datasets collected from Amazon Mechan- 0.02 ical Turk showing the number of images per dataset, the number of+   per image (assignments), and total number of workers that labels 0 2 4 6 8 10 12 14 2 4 provided labels. Presence-1/2 arennota0ons   Aggrega0ng  a binary labels, and Attributes-1/2 labels per image are[Welinder  and  Perona,  2010]   multi-valued labels. Figure 8. Error rates vs. the number of la majority the Presence datasets for the online algor −1 10 GLAD sion. The ground truth was the estimates ours (batch) algorithm with all 15 labels per image av have zero error at 15 labels per image). error rate −2 Online crowdsourcing: rating annotators and obtaining cost-effective labels 10 Peter Welinder Pietro Perona California Institute of Technology the datasets, we generated synthetic d {welinder,perona}@caltech.edu −3 10 Abstract 2 the ground truth, as follows: (1) We u 25 22 26 2 25 22 26 2 25 22 26 0 2 4 Labeling large datasets has become faster, cheaper, and easier with the advent of crowdsourcing services like Ama- 6 8 10 timate aj for all 47 annotators in th zon Mechanical Turk. How can one trust the labels ob- no. of assignments per image tained from such services? We propose a model of the la- beling process which includes label uncertainty, as well a (2) For each of 2000 target values (h multi-dimensional measure of the annotators’ ability. From sampled labels from m randomly ch 2 25 22 26 2 25 22 26 2 25 22 26 the model we derive an online algorithm that estimates the most likely value of the labels and the annotator abilities. Figure 7. Comparison between the majority rule, GLAD [14], and It finds and prioritizes experts when requesting labels, and actively excludes unreliable annotators. Based on labels the labels were generated according to already obtained, it dynamically chooses which images will our algorithm on synthetic data as the number of assignments per be labeled next, and how many labels to request in order to achieve a desired level of confidence. Our algorithm is Equation 10. As can be seen from Figure 1. Examples of binary labels obtained from Amazon Me- image is increased. The synthetic data is generated by the model general and can handle binary, multi-valued, and continu- ous annotations (e.g. bounding boxes). Experiments on a chanical Turk, (see Figure 2 for an example of continuous labels). dataset containing more than 50,000 labels show that our algorithm reduces the number of labels required, and thus achieves a consistently lower error rat The boxes show the labels provided by four workers (identified by the number in each box); green indicates that the worker selected
    • +   Aggrega0ng  annota0ons   [Welinder  and  Perona,  2010]   n  Online  algorithm   n  For  each  annotator   j n  Es0mate   α n  If  the  es0ma0on  of  the  annotator  quality  is  reliable  (  var(α  j  )    <          )                                   θ n  If  annotator    j    is  an  expert,  add  to  expert-­‐list     E ← {E ∪ {j}} n  Otherwise,  add  to  bot-­‐list   B ← {B ∪ {j}}
    • +   Aggrega0ng  annota0ons   [Welinder  and  Perona,  2010]   n  Online  algorithm  (con0nued)   n  For  each  object  to  be  annotated   n  Compute      P  (y    )    from  available  labels      y  i  and   α              i         n  If  es0mated  label  is  unreliable  (  max    P            i  )      <    τ    )  ask  to  experts  in  list  E                     (y           yi n  If  labels  cannot  be  obtained  from  experts,  ask  to  annotators  not  in  bot-­‐ list    B n  Stop  asking  labels  when    max          (y  i  )      ≥    τ      or  maximum  number  of                   P                   y annota0ons  is  exceeded   i
    • +   Aggrega0ng  annota0ons   [Welinder  and  Perona,  2010]   Presence−1 Presence−2kers Online crowdsourcing: rating annotators and obtaining cost-effective labels 0.12 batch batch7 online online 0.1 Peter Welinder Pietro Perona4 California Institute of Technology error rate 7 0.08 {welinder,perona}@caltech.edu 0 0.06 Abstract5 0.04 2 25 22 26 2 25 22 26 2 25 22 26 Labeling large datasets has become faster, cheaper, and Mechan- easier with the advent of crowdsourcing services like Ama- 0.02 zon Mechanical Turk. How can one trust the labels ob- umber of tained from such services? We propose a model of the la- beling process which includes label uncertainty, as well arkers that 0 2 4 6 8 10 12 14 multi-dimensional6 2 4 measure of the 10 8 annotators’ ability. From 12 14 2 25 22 26 2 25 22 26 2 25 22 26 the model we derive an online algorithm that estimates thebutes-1/2 labels per image labels per image most likely value of the labels and the annotator abilities. It finds and prioritizes experts when requesting labels, and actively excludes unreliable annotators. Based on labels already obtained, it dynamically chooses which images will Figure 8. Error rates vs. the number of labels used per image on be labeled next, and how many labels to request in order the The  on-­‐line  algorithm  allows  to  reduce  tthe batch ver- of  annota0ons  labels obtained from Amazon Me- n  Presence datasets for the online algorithm and he  number   Figure 1. Examples of binary /   to achieve a desired level of confidence. Our algorithm is general and can handle binary, multi-valued, and continu- y chanical Turk, (see Figure 2 for an example of continuous labels). ous annotations (e.g. bounding boxes). Experiments on a sion. The ground he  same  target  error  rate   object,  for  t truth was the estimates when running the batch Thenumbershow the box); green indicates that the worker selected boxes labels provided by four workers (identified by dataset containing more than 50,000 labels show that our the in each algorithm reduces the number of labels required, and thusbatch) algorithm with all 15 labels per image available (thus batch will the image, red means that he or she did not. The task for each an- the total cost of labeling, by a large factor while keeping notator was to select only images that he or she thought contained error rates low on a variety of datasets. a Black-chinned Hummingbird. Figure 5 shows the expertise and have zero error at 15 labels per image). 1. Introduction bias of the workers. Worker 25 has a high false positive rate, and 22 has a high false negative rate. Worker 26 provides inconsistent labels, and 2 is the annotator with the highest accuracy. Photos Crowdsourcing, the act of outsourcing work to a large in the top row were classified to contain a Black-chinned Hum- crowd of workers, is rapidly changing the way datasets are mingbird by our algorithm, while the ones in the bottom row were created. Not long ago, labeling large datasets could take not. weeks, if not months. It was necessary to train annotators
    • +   Aggrega0ng  annota0ons   [Karger  et  al.,  2011]   n  Infers  labels  and  annotators  quali0es   n  No  prior  knowledge  on  annotator  quali0es   n  Inspired  to  belief  propaga+on  and  message-­‐passing   j n  Binary  labels   yi = {−1, +1} j n  Define  a      I      ×    J    matrix    A  ,  such  that   Aij = yi                
    • +   Aggrega0ng  annota0ons   [Karger  et  al.,  2011]   Objects   Annotators   +1 +1 +1   +1 +1 − +1 −A= − − −1 − −1  −1 +1 − −1 − +1 −1 +1 −1 +1
    • +   Aggrega0ng  annota0ons   [Karger  et  al.,  2011]   Objects   Annotators   yj→i Reliability  of  annotator    j     +1 es0ma0ng  object  i   +1 +1 (k) ￿ (k−1) xi→j = Aij ￿ yj ￿ →i −1 j ￿ ∈∂ij −1 +1 Es0mated  (sov)  label  of     i using  all  annotators  but  j   −1 +1
    • +   Aggrega0ng  annota0ons   [Karger  et  al.,  2011]   Objects   Annotators   ￿ +1 (k) (k−1) yj→i = Ai￿ j xi￿ →j +1 i￿ ∈∂ji +1 Reliability  of  annotator         j es0ma0ng  object  i   −1 −1 +1 −1 +1
    • +   Aggrega0ng  annota0ons   [Karger  et  al.,  2011]   Objects   Annotators   +1 +1 Final  es0mate     +1 ￿ yi = sgn  ˆ Aij yj→i  j∈∂i −1 −1 +1 −1 +1
    • +  m of the answers weighted by each worker’s reliability: Aggrega0ng  annota0ons   ￿￿ ￿   [Karger  et  al.,  2011] i = sign s ˆ j∈∂i Aij yj→i .is understood that when there is a tie we flip a fair coin to make a decision. Iterative Algorithm Input: E, {Aij }(i,j)∈E , kmax ￿ ￿ Output: Estimate s {Aij } ˆ 1: For all (i, j) ∈ E do (0) Initialize yj→i with random Zij ∼ N (1, 1) ; 2: For k = 1, . . . , kmax do (k) ￿ (k−1) For all (i, j) ∈ E do xi→j ← j ￿ ∈∂ij Aij ￿ yj ￿ →i ; (k) ￿ (k) For all (i, j) ∈ E do yj→i ← i￿ ∈∂ji Ai￿ j xi￿ →j ; ￿ (kmax −1) 3: For all i ∈ [m] do xi ← j∈∂i Aij yj→i ￿ ￿ ; 4: Output estimate vector s {Aij } = [sign(xi )] . ˆ We emphasize here that our inference algorithm requires no information about the prior distri-tion of the workers’ quality pj . Our algorithm is inspired by power iteration used to compute
    • +   Aggrega0ng  and  learning  
    • +   Aggrega0ng  and  learning   [Sheng  et  al.,  2008]   n  Use  several  noisy  labels  to  create  labeled  data  for  training  classifiers   n  Training  samples   ￿yi , xi ￿ True  label   Feature  vector   n  Labels  might  be  noisy   100 cal strategy used 90 q=1.0 repeated-labeling, 80 q=0.9 q  =  Probability   labels—so we focu Accuracy 70 q=0.8 of  a  label   strategy. A relate 60 q=0.7 being  correct   assess the generali q=0.6 q=0.5 uncertain labels [2 50 Prior research has 40 a full labeling solu 1 40 80 120 160 200 240 280 Number of examples (Mushroom) as estimating the with uncertain la Figure 1: Learning curves under different quality lev- quickly when they els of training data (q is the probability of a label Repeated-labeli
    • +   Aggrega0ng  and  learning   [Sheng  et  al.,  2008]   n  When  training  a  classifier,  consider  two  op0ons   n  Acquiring  a  new  training  example  ￿yi , xi ￿ n  Get  another  label  for  an  exis0ng  example   n  Compare  two  strategies   n  SL  -­‐  single  labeling:  acquires  addi0onal  examples  (each  with  one  noisy   label)   n  MV  –  majority  vo0ng:  acquire  addi0onal  noisy  labels  for  exis0ng  examples  
    • 95 90 “feature 85 of label Accuracy 80 often th+   75 to the 70 case, c Aggrega0ng  and  learning   65 multip 60 SL To com [Sheng  et  al.,  2008]   55 ML late ge 50 100 1100 2100 3100 4100 5100 data; w Number of labels (mushroom, p=0.6) multipl (a) p = 0.6, #examples = 100, for MV We s 100 100 95 native to single-labeling, even when the cost of acquiring the 95 90 90 “feature” part of an example is negligible compared to the cos to be t 85 85 of label acquisition. However, as described in the introduction (CU · T Accuracy 80 often the cost of (noisy) label acquisition CL is low compared Accuracy 80 75 In · N 75 to the cost CU of acquiring an unlabeled example. (CL this 70 repeate 70 case, clearly repeated-labeling should be considered: using 65 65 We e multiple labels can shift the learning curve up significantly 60 SL 60 labelin 55 55 To compare any two strategies onSL equal footing, we calcu ML MV ing SL 50 50 late generalization performance “per unit cost” of acquired round-r data; we 1550 2300 3050 the different strategies for combining 100 1100 2100 3100 4100 5100 50 800 then compare 3800 4550 5300 Number of labels (mushroom, p=0.6) Number of labels (mushroom, p=0.8) acquire multiple labels, under different individual labeling qualities. (In our (a) p = 0.6, #examples = 100, for MV (b) pWe 0.8, #examples =the data acquisition cost CD : = start by defining 50, for MV 100 labelin 95 Figure 5: Comparing the increaseCU ·accuracy NL the CD = in Tr + CL · for (2) scribed When  labels  are  noisy,  repeated  labeling  +to bewhen the functionan acquiring Tr exam- examples 90 n  85  majority  vo0ng  helps  unlabeled unlabeled CD = ρ mushroom data set sumaof the cost of the number of the as of labels acquired, · T ), plus the cost of acquiring the associated N labels cost of (CU r L Accuracy 80 ple is negligible, N ). CFor single labeling we have N = T , but for i.e., U = 0. Repeated-labeling with 75 (CL · L Otherwise,  acquiring  addi0onal  majority vote (MV ) startsbe  berer   training  samples  might   with an .existing set of ex- L r n  70 repeated-labeling NL > Tr We exa 65 amples and only acquires additional Section 4.2.1 slightly: repeated We extend the setting of labels for them, with m   60 SL and single labeling (SL) acquires and labels new examples; single label labeling now acquires additional examples. labeling 55 Other data setsSL is unchanged. Repeated-labeling again is generalized show similar results. ferent w MV ing 50 as desc round-robin: for each new example acquired, repeated-labeling 50 800 1550 2300 3050 3800 4550 5300 for a fixed labeler quality. Both MV and SL start with the Number of labels (mushroom, p=0.8) acquires a fixed number of labels k, and in this case NL =Perf k·Tr same number(In our experiments, k = 5.) Thus, for round-robin repeated of single-labeled examples. Then, MV starts plots th (b) p = 0.8, #examples = 50, for MV acquiring additional labels onlyexperiments the cost setting candata de labeling, in these for the existing examples, be ac
    • +   Aggrega0ng  and  learning   [Sheng  et  al.,  2008]   n  How  to  select  which  sample  for  re-­‐labeling?   n  Assume  that     j j n  the  annotator  quality  is  independent  from  the  object    P (yi = yi ) = p n  all  annotators  have  the  same  quality   pj = p n  the  annotator  quality  is  unknown,  i.e.,  uniformly  distributed  in  [0,1]   n  Let        L  0      and        L1      denote  the  number  of  labels  equal  to  0  or  1                 assigned  to  an  object   L0 = |{y j |y j = 0}| L1 = |{y j |y j = 1}|
    • +   Aggrega0ng  and  learning   [Sheng  et  al.,  2008]   n  If        y    =    1    is  the  true  label,  the  probability  of  observing    L  0    and      L  1                         labels  is  given  by  the  binomial  distribu0on   ￿ ￿ L0 + L1 L1 P (L0 , L1 |p) = p (1 − p)L0 L1 n  The  posterior  can  be  expressed  as   P (L0 , L1 |p)P (p) P (L0 , L1 |p)P (p) P (p|L1 , L0 ) = = ￿1 P (L0 , L1 ) P (L0 , L1 |s)ds 0 pL1 (1 − p)L0 = = βp (L0 + 1, L1 + 1) B(L0 + 1, L1 + 1) Beta  func0on   Beta  distribu0on  
    • +   Aggrega0ng  and  learning   [Sheng  et  al.,  2008]   n  Let  I  p  (L  0  ,  L  1  )    denote  the  CDF  of  the  beta  distribu0on                         n  The  uncertainty  of  an  object  due  to  noisy  labels  is  defined  as   SLU = min{I0.5 (L0 + 1, L1 + 1), 1 − I0.5 (L0 + 1, L1 + 1)} 1 0.9 0.8 0.7 0.6 CDF 0.5 0.4 0.3 0.2 L0 = 0, L1 = 4 0.1 L0 = 1, L1 = 3 L0 = 2, L1 = 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p
    • 0.7 0.65 ENT ROPY +   GRR 0.6 0 Aggrega0ng  and  learning   400 800 1200 1600 2000 [Sheng  Number2008]   (waveform, p=0.6) et  al.,   of labels 90 even reac 9: What not to do:to  select  the  next  labeling  ac0on   80 n  Different  strategies   data quality improvement Overall, c Accuracyentropy-based selectiveobin   n  GRR:  Generalized  round  r repeated-labeling strat- the best 70. round-robin repeated-labeling.R MU n  Selec0ve  repeated  labeling   60 performs 1 LMU n  LU:  Label  uncertainty   50 n  MU:  Model  uncertainty  (as  in  ac0ve  learning)   generally Table 0.95200 1400 1600 0 400 800 1200 1600 2000) 0.9 n  LMU:  Label  and  model  uncertainty   (expedia) Number of labels type of un over t 0.85 100 one-tailed the be 0.8 90 italics Accuracy 0.75 80 0.7 GRR MU 0.65 70 model h LU LMU 0.6 60 5. CO we shou 1600 0 2000 400 0 800400 800 1200 1200 16001600 2000 2000p) Number of labels (waveform, (mushroom) 100 Number of labels p=0.6) FUT • M
    • +   Aggrega0ng  and  learning   [Donmez  et  al.,  2009]   n  Unlike  [Sheng  et  al.,  2008],  annotators  can  have  different  (unknown)   quali0es   n  IEThresh  (Interval  Es0mate  Threshold):  a  strategy  to  select  the   annotator  with  the  highest  es0mated  labeling  quality   1.  Fit  logis0c  regression  to  training  data   ￿yi , xi ￿, i = 1, . . . , I 2.  Pick  the  most  uncertain  unlabeled  instance   x∗ = arg max(1 − max P (y|xi )) xi y∈{0,1} A  posteriori  probability   computed  by  the   classifier  
    • +   Aggrega0ng  and  learning   [Donmez  et  al.,  2009]   3.  For  each  annotator,     n  Compute  if  she/he  agrees  with  majority  vo0ng   ￿ j j 1 yi = yi V M ri = 0 otherwise n  Compute  the  mean  and  the  sample  standard  devia0on  of  the  agreement,   averaged  over  mul0ple  objects   j j µj = E[ri ] σ j = std[ri ] n  Compute  the  upper  confidence  interval  of  the  annotator   j j j (I j −1) σ UI = µ + tα/2 √ n Cri0cal  value  for  the   Student’s  t-­‐distribu0on   with  I    j      −    1    degrees  of             freedom  
    • +   Aggrega0ng  and  learning   [Donmez  et  al.,  2009]   4.  Choose  all  annotators  with  largest  upper  confidence  interval   {j|U I j ≥ ￿ max U I j } j MV 5.  Compute  the  majority  vote  of  the  selected  annotators   yi 6.  Update  training  data   T = T ∪ ￿yi V , x∗ ￿ M 7.  Repeat  2-­‐6  
    • +   Aggrega0ng  and  learning   [Donmez  et  al.,  2009]   n  Achieves  trade-­‐off  between     n  Explora0on  (at  the  beginning,  to  es0mate  annotator  quali0es)   n  Exploita0on  (once  annotator  quali0es  are  es0mated,  ask  to  more  reliable)   image phoneme 150 150 Itera0on  counts   100   100 #Times Selected #Times Selected 41-­‐150     50 11-­‐40   50   1-­‐10   0   0 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 True Oracle Accuracy True Oracle Accuracy Figure 2: Number of times each oracle is queried vs. the true oracle accuracy. Each oracle corr
    • +   Aggrega0ng  and  learning   [Dekel    and  Shamir,  2009]   n  In  some  cases,  the  number  of  annotators  is  of  the  same  order  as  the   number  of  objects  to  annotate   I/J = Θ(1) n  Majority  vo0ng  cannot  help   n  Es0ma0ng  the  annotator  quali0es  might  be  problema0c   n  Goal:  prune  low-­‐quality  annotators,  when  each  one  annotates  at   most  one  object                                        
    • +   Aggrega0ng  and  learning   [Dekel    and  Shamir,  2009]   n  Consider  a  training  set   ￿yi , xi ￿, i = 1, . . . , I n  Let      f                  x  i  )    denote  a  binary  classifier  that  assign  a  label    {0,  1}  to    xi   (w,                         n  Let      h  j  (xi  )  denote  a  randomized  classifier  which  represents  the  way                   annotator    j      labels  data       n  Let                denote  the  set  of  objects  annotated  by  annotator    j Sj n  Prune  away  any  annotator  for  which   ￿ j i∈S j 1hj (xi )￿=f (w,xi ) ￿ = >T |S j | n  In  words,  the  method  prunes  those  annotators  that  are  in   disagreement  with  a  classifier  trained  based  on  all  annotators  
    • +   Aggrega0ng  and  learning   [Dekel    and  Shamir,  2009]   ￿1 = 1/9 ￿2 = 0 ￿3 = 1
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  Consider  a  training  set   ￿yi , xi ￿, i = 1, . . . , I n  Let      f                  x  i  )    denote  a  binary  classifier  that  assign  a  label    {0,  1}  to    xi   (w,                         n  Consider  the  family  of  linear  classifiers   yi = 1 if wT xi ≥ γ ˆ yi = 0 otherwise ˆ n  The  probability  of  the  posi0ve  class  is  modeled  as  a  logis0c  sigmoid   1 P (yi = 1|xi , w) = σ(wT x) σ(z) = 1 + e−z
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  Similarly  to  [Welinder  and  Perona,  2010]  annotators  quality   dis0nguishes  between  true  posi0ve  and  true  nega0ve  rate   j j P (yi = 1|yi = 1) = α1 j j P (yi = 0|yi = 0) = α0 n  Goal:     n  Given  the  observed  labels  and  the  feature  vectors  D = {xi , yi , . . . , yi }I 1 J i=1 n  Es0mate  the  unknown  parameters   θ = {w, α1 , α0 }
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  The  likelihood  func0on  of  the  parameters            =    {w,    α  1  ,  α            given   θ                                0 } the  observa0ons      D    =    {x  i  ,  y  1  ,    .              y  i    }I            is  factored  as                          i   . . ,     J    i=1   I ￿ 1 J P (D|θ) = P (yi , . . . , yi |xi , θ) i=1 I ￿ 1 J = P (yi , . . . , yi |yi = 1, α1 )P (yi = 1|xi , w) i=1 1 J + P (yi , . . . , yi |yi = 0, α0 )P (yi = 0|xi , w)
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   J ￿ J ￿ j j j j j yi j 1−yi P (yi |yi = 1, α1 ) = [α1 ] [1 − α1 ] j=1 j=1 σ(wT x) I ￿ 1 J P (D|θ) = P (yi , . . . , yi |yi = 1, α1 )P (yi = 1|xi , w) i=1 1 J + P (yi , . . . , yi |yi = 0, α0 )P (yi = 0|xi , w) J ￿ J ￿ j P (yi |yi = j 0, α0 ) = [α0 ] j j 1−yi [1 − j yi α0 ] j 1 − σ(wT x) j=1 j=1
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  The  parameters  are  found  by  maximizing  the  log-­‐likelihood  func0on   {α1 , α0 , w} = arg max log P (D|θ) ˆ ˆ ˆ θ n  The  solu0on  is  based  on  Expecta0on-­‐Maximiza0on   n  Expecta+on  step   µi = P (yi = 1 J 1|yi , . . . , yi , xi , θ) pi = σ(wT xi ) J ￿ 1 J ∝ P (yi , . . . , yi |yi = 1, θ)P (yi = 1|, xi , θ) a1,i = j yij [α1 ] [1 − j j 1−yi α1 ] a1,i pi j=1 = J a1,i pi + a0,i (1 − pi ) ￿ j j 1−yi j yij a0,i = [α0 ] [1 − α0 ] j=1
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  Maximiza+on  step   n  False  posi0ve  and  false  nega0ve  rates  can  be  es0mated  in  closed  form   ￿I j ￿I j j i=1 µi yi j i=1 (1 − µi )(1 − yi ) α1 = ￿I α0 = ￿I i=1 µi i=1 (1 − µi ) n  The  classifier  w  can  be  es0mated  by  means  of  gradient  ascent           wt+1 = wt − ηH−1 g
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  Log-­‐odds   1 J µi P (yi = 1|yi , . . . , yi , xi , θ) logit(µi ) = log = log 1 J 1 − µi P (yi = 0|yi , . . . , yi , xi , θ) J ￿ j j j = c + w T xi + yi [logit(α1 ) + logit(α0 )] i=1 Contribu0on  of  the  annotators:   Contribu0on  of  the  classifier   Weighted  linear  combina0on  of   labels  from  all  annotators  
    • +   Aggrega0ng  and  learning   L EARNING F ROM C ROWDS [Raykar  et  al.,  2010]   ROC Curve for the classifier 1 0.9 0.8 True Positive Rate (sensitivity) 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 0.1 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 0 0 0.2 0.4 0.6 0.8 1 False Positive Rate (1−specifcity) (a) ROC Curve for the estimated true labels 1 0.9 0.8 ty)
    • +   Aggrega0ng  and  learning   [Raykar  et  al.,  2010]   n  Extensions   n  Bayesian  approach,  with  priors  on  true  posi0ve  and  true  nega0ve  rates   n  Adop0on  of  different  types  of  classifiers   n  Mul0-­‐class  classifica0on   yi ∈ {l1 , . . . , lK } n  Ordinal  regression   yi ∈ {l1 , . . . , lK }, l1 < . . . l K n  Regression   yi ∈ R
    • +   Aggrega0ng  and  learning   [Yan  et  al.,  2010b]   n  Seung  similar  to  [Raykar  et  al.,  2010],  with  two  main  differences   n  No  dis0nc0on  between  true  posi0ve  and  true  nega0ve  rates   j j α1 = α0 , j = 1, . . . , J n  The  quality  of  the  annotator  is  dependent  on  the  object   j 1 α (x) = 1 + e−wj x  
    • +   Aggrega0ng  and  learning   [Yan  et  al.,  2010b]   n  Log-­‐odds   ￿J   µi j logit(µi ) = log = w T xi + (−1)(1−yi ) (wj )T xi 1 − µi i=1 J ￿ j = c + w T xi + yi (wj )T xi i=1 Contribu0on  of  the  annotators:   Weighted  linear  combina0on  of  labels  from  all  annotators   Weights  depend  on  object  difficulty  
    • +   Aggrega0ng  and  learning   [Yan  et  al.,  2011]   n  Ac0ve  learning  from  crowds   n  Which  training  point  to  pick?   n  Pick  the  example  that  is  closest  to  the  classifier  separa0ng  hyperplane   i∗ = arg min |wT xi | i n  Which  expert  to  pick?   1 j ∗ = arg min j 1 + e−(wj )T xi∗  
    • +   Crowdsourcing  at  work  
    • +   CrowdSearch   [Yan  et  al.,  2010]   n  CrowdSearch  combines   n  Automated  image  search   n  Local  processing  on  mobile  phones  +  backend  processing   n  Real-­‐0me  human  valida0on  of  search  results   n  Amazon  Mechanical  Turk   n  Studies  the  trade-­‐off  in  terms  of   !"#$%&()*# +),-.-)/#&()*#0 1"23.4)/#&5)3.-)/.6,&7)080man error and bias to maximize accuracy. To balance thesetradeoffs, CrowdSearchDelay   adaptive algorithm that uses n  uses an % $ # " !delay and result prediction models of human responses to ju- )(*+,( &( &( &( &( &( n  Accuracy  diciously use human validation. Once a candidate image is +9 n  Cost  validated, it is returned to the user as a valid search result. % $ # " ! )(*+,( &( -. &( &( &( +<3. CROWDSOURCING FOR SEARCH % $ # " ! In this section, we first provide a background of the Ama- )(*+,( -. -. -. -. -. +;zon Mechanical Turk (AMT). We then discuss several designchoices that we make while using crowdsourcing for image % $ # " !validation including: 1) how to construct tasks such that )(*+,( &( -. -. &( &( +:they are likely to be answered quickly, 2) how to minimizehuman error and bias, and 3) how to price a validation taskto minimize delay. Figure 2: Shown are an image search query, candi-
    • +   CrowdSearch   [Yan  et  al.,  2010]   n  Delay-­‐costs  trade-­‐offs   n  Parallel  pos0ng   n  Minimizes  delay   n  Expensive  in  terms  of  monetary  cost   n  Serial  pos0ng   n  Posts  top-­‐ranked  candidates  for  valida0on   n  Cheap  in  terms  of  monetary  cost   n  Much  higher  delay   n  Adap0ve  strategy  à  CrowdSearch  
    • +   CrowdSearch   [Yan  et  al.,  2010]   n  Example:  a  candidate  image  has  received  the  sequence  of  responses   Si = {‘Y’, ‘N’} n  Enumerate  all  sequences  of  responses,  i.e.,       S (1) = {‘Y’, ‘N’, ‘Y’} i (2) Si = {‘Y’, ‘N’, ‘N’} (3) Si = {‘Y’, ‘N’, ‘Y’, ‘Y’} n  For  each  sequence,  es0mate   ... (j) n  The  probability  of  observing    S                given  Si       i n  Weather  it  would  lead  to  success  under  majority  vo0ng   n  The  probability  of  obtaining  the  responses  before  the  deadline   n  Es0mate  the  probability  of  success.  If    P  succ    <    τ        post  a  new  candidate                              
    • +   CrowdSearch   [Yan  et  al.,  2010]   n  Predic0ng  valida0on  results   n  Training:     n  Enumerate  all  sequences  of  fixed  length  (e.g.,  five)   n  Compute  empirical  probabili0es   n  Example:   !""# n  Observed  sequence   Si = {‘Y’, ‘N’, ‘Y’} $ % n  Sequences  that  lead  to  posi0ve  results     %$ %% P ({‘Y’, ‘N’, ‘Y’, ‘Y’}) = 0.16/0.25 %$$ %$% &-) P ({‘Y’, ‘N’, ‘Y’, ‘N’, ‘Y’}) = 0.03/0.25 %$%$ &&+ %$%% &(* !"!"" !"!"! !"!!" !"!!! &&* &&, &&( &() Figure 5: A SeqTree to Predict Validation Re- sults. The received sequence is ‘YNY’, the two se-
    • +   CrowdSearch   [Yan  et  al.,  2010]   n  Delay  predic0on   (a) Overall delay model (b) Inter-arrival delay model −λ (t−c ) Acceptance   fa (t) = λa e a aFigure 3: Delay models for overall delay and inter-arrival delay. The overall delay is decoupled with acceptanceand submission delay. fi (t) = λi e−λi t Inter-­‐arrival   Submission   fs (t) = λs e−λs (t−cs ) From our inter-arrival delay model, we∗ fs (t) all inter- Overall   fo (t) = fa (t) know that This prediction can be easily done using a sufficiently largearrival times are independent. Thus, we can present the training dataset to study the distribution of all possible re-probability density function of Yi,j as the convolution of the sult sequences. For the case where the number of duplicateinter-arrival times of response pairs from i to j. is set to be 5, there are 25 = 32 different sequence combina-
    • +   CrowdSearch   [Yan  et  al.,  2010]  e search overur categories f automated Figure 8: Precision of automated image search and human validation with four different validation cri-
    • +   The  CUBRIK  project   n  36  month  large-­‐scale   integra0ng  project     n  par0ally  funded  by  the   European  Commission’s  7th   Framework  ICT  Programme   for  Research  and   Technological  Development   n  www.cubrikproject.eu  
    • +   Objec0ves   [Fraternali  et  al.,  2012]   n  The  technical  goal  of  CUbRIK  is  to  build  an  open  search  plaHorm   grounded  on  four  objec0ves:   n  Advance  the  architecture  of  mul0media  search   n  Place  humans  in  the  loop   n  Open  the  search  box   n  Start  up  a  search  business  ecosystem  
    • +   Objec0ve:  Advance  the  architecture  of   mul0media  search   n  Mul0media  search:  coordinated  result  of  three  main   processes:   n  Content  processing:  acquisi0on,  analysis,  indexing  and  knowledge   extrac0on  from  mul0media  content   n  Query  processing:  deriva0on  of  an  informa0on  need  from  a  user   and  produc0on  of  a  sensible  response   n  Feedback  processing:  quality  feedback  on  the  appropriateness  of   search  results  
    • +   Objec0ve:  Advance  the  architecture  of   mul0media  search   n  Objec0ve:     n  Content  processing,  query  processing  and  feedback  processing   phases  will  be  implemented  by  means  of  independent   components   n  Components  are  organized  in  pipelines   n  Each  applica0on  defines  ad-­‐hoc  pipelines  that  provide  unique   mul0media  search  capabili0es  in  that  scenario  
    • +   Objec0ve:  Humans  in  the  loop   n  Problem:  the  uncertainty  of  analysis  algorithms  leads  to  low  confidence   results  and  conflic9ng  opinions  on  automa0cally  extracted  features   n  Solu+on:  humans  have  superior  capacity  for  understanding  the  content   of  audiovisual  material   n  State  of  the  art:  humans  replace  automa+c  feature  extrac+on  processes   (human  annota9ons)   n  Our  contribu0on:  integra+on  of  human  judgment  and  algorithms   n  Goal:  improve  the  performance  of  mul0media  content  processing  
    • 4M17>19:83% 9=4% 1??:8:9J% ;?% 915B5% 9;% C;9489:17% =>@18% 155:38445L% 9;% @1H:@:G4% 9=4% M47;6:9J% ;?% 915B% 5C241A:83% 9=2;>3=% M159% 62;EA5L% ?;2% 4M17>19:83% 9=4% 24C>919:;8% 18A% 92>59% ;?% 618A:A194% 155:38445L% ?;2% A49469:83%@17:6:;>5%F4=1M:;>25%K4D3D%5C1@L%?21>AN%844A%9;%F4%A45:384A%18A%6>59;@:G4A%9;%9=4%5C46:?:6%+   6;894H9%;?%@>79:@4A:1%54126=D% CUbRIK  architecture   B1.1.5. Realising CUBRIK I=4% 1CC2;16=% C>25>4A% FJ% *+,-./% :5% 9=4% A4M47;C@489% ;?% 18% !"#$%&!()#* +(,-#.!(/% ?;2% 54126=% 1CC7:619:;85% A45:38425L% 946=8;7;3J% C2;M:A425L% 18A% 6;89489% ;E8425% 9;% @18134% 9=4% 6;@C74H:9J% ;?% 6;8592>69:83%18A%4M;7M:83%56171F74L%@>79:@;A17L%2417$9:@4%@4A:1%54126=%18A%2492:4M17%1CC7:619:;85%;8% 9;C%;?%1%6;77469:;8%;?%6;@C;84895L%:8A:M:A>175%18A%6;@@>8:9:45D% % Figure 1.2: CUBRIK Architecture !:3>24%QD&%5=;E5%1%=:3=%74M47%;M42M:4E%;?%9=4%*+,-./%126=:9469>24D%*+,-./%247:45%;8%1%?21@4E;2B%?;2% 4H46>9:83%C2;645545%K1B1%"0"#10$#&NL%6;85:59:83%;?%6;77469:;85%;?%915B5%9;%F4%4H46>94A%:8%1%A:592:F>94A% ?15=:;8D%R16=%C:C47:84%:5%A4562:F4A%FJ%1%E;2B?7;E%;?%915B5L%177;6194A%9;%4H46>9;25D%I15B%4H46>9;25%618%
    • 97Trademark  Logo  Detec0on:  problems  in  automa0c  logo  detec0on  n  Problems  in  automa+c  logo  detec+on:   n  Object  recogni0on  is  affected  by  the  quality  of  the  input  set  of  images     n  Uncertain  matches,  i.e.,  the  ones  with  low  matching  score,  could  not   contain  the  searched  logo  
    • 98  Trademark  Logo  Detec0on:  contribu0on  of  human  computa0onn  Contribu+on  in  human  computa+on   n  Filter  the  input  logos,  elimina0ng  the  irrelevant  ones   n  Segment  the  input  logos   n  Validate  the  matching  results  
    • +   Human-­‐powered  logo  detec0on   [Bozzon  et  al.,  2012]   n  Goal:  integrate  human  and  automa0c  computa0on  to  increase   precision  and  recall  w.r.t.  fully  automa0c  solu0ons   Validated Validated Low Validate Results Logo Logo Logo Confidence Low-confidence Images Name Images Results Results Retrieve Logo Validate Match Logo Join Results and Logo Detection + + Images Logo Images Images in Videos Emit Report High Video Confidence collection Results
    • 100  Experimental  evalua0on  n  Three  experimental  seungs:   n  No  human  interven0on   n  Logo  valida0on  performed  by    two  domain  experts   n  Inclusion  of  the  actual  crowd  knowledge  n  Crowd  involvement     n  40  people  involved   n  50  task  instances  generated   n  70  collected  answers    
    • 101  Experimental  evalua0on   1   0.9   0.8   Crowd   0.7   Experts   0.6   Experts  Recall   Experts   Aleve   0.5   0.4   Crowd   Chunky   0.3   Shout   No  Crowd   0.2   Crowd   No  Crowd   0.1   0   No  Crowd   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   Precision  
    • 102  Experimental  evalua0on   1   0.9   0.8   Crowd   Precision  decreases   0.7     Experts   0.6   Reasons  Experts   wrong  inclusion   for  the  Recall   0.5   Experts   •  Geographical  loca0on  of  the  users   Aleve   •  Exper0se  of  the  involved  users   0.4   Crowd   Chunky   0.3   Shout   No  Crowd   0.2   Crowd   No  Crowd   0.1   0   No  Crowd   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   Precision  
    • 103  Experimental  evalua0on   1   Precision  decreases   •  Similarity  between  two   0.9   logos  in  the  data  set   0.8   Crowd   0.7   Experts   0.6   Experts  Recall   Experts   Aleve   0.5   0.4   Crowd   Chunky   0.3   Shout   No  Crowd   0.2   Crowd   No  Crowd   0.1   0   No  Crowd   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   Precision  
    • Open  issues  and  future  direc0ons  n  Reproducibility  and  experiment  design  [Paritosh,  2012]  n  Expert  finding  /  task  alloca0on  n  Beyond  textual  labels  
    • +   Thanks  for  your  aren0on   www.cubrikproject.eu    
    • +   References  1/3   n  [Bozzon  et  al.,  2012]  Alessandro  Bozzon,  Ilio  Catallo,  Eleonora  Ciceri,  Piero  Fraternali,  Davide  Mar0nenghi,  Marco  Tagliasacchi:  A   Framework  for  Crowdsourced  Mul0media  Processing  and  Querying.  CrowdSearch  2012:  42-­‐47   n  [Dawid  and  Skene,  1979]  A.  P.  Dawid  and  A.  M.  Skene,  Maximum  Likelihood  Es0ma0on  of  Observer  Error-­‐Rates  Using  the  EM  Algorithm,   Journal  of  the  Royal  Sta0s0cal  Society.  Series  C  (Applied  Sta0s0cs),  Vol.  28,  No.  1  (1979),  pp.  20-­‐28   n  [Dekel  and  Shamir,  2009]  O.  Dekel  and  O.  Shamir,    Vox  Populi:  Collec0ng  High-­‐Quality  Labels  from  a  Crowd.    ;In  Proceedings  of  COLT.  2009.     n  [Domnez  et  al.,  2009]  Pinar  Donmez,  Jaime  G.  Carbonell,  and  Jeff  Schneider.  2009.  Efficiently  learning  the  accuracy  of  labeling  sources  for   selec0ve  sampling.  In  Proceedings  of  the  15th  ACM  SIGKDD  interna0onal  conference  on  Knowledge  discovery  and  data  mining  (KDD  09)   n  [Fraternali  et  al.,  2012]  Piero  Fraternali,  Marco  Tagliasacchi,  Davide  Mar0nenghi,  Alessandro  Bozzon,  Ilio  Catallo,  Eleonora  Ciceri,   Francesco  Saverio  Nucci,  Vincenzo  Croce,  Ismail  Sengör  Al0ngövde,  Wolf  Siberski,  Fausto  Giunchiglia,  Wolfgang  Nejdl,  Martha  Larson,   Ebroul  Izquierdo,  Petros  Daras,  Oro  Chrons,  Ralph  Traphöner,  Björn  Decker,  John  Lomas,  Patrick  Aichroth,  Jasminko  Novak,  Ghislain   Sillaume,  Fernando  Sánchez-­‐Figueroa,  Carolina  Salas-­‐Parra:  The  CUBRIK  project:  human-­‐enhanced  0me-­‐aware  mul0media  search.  WWW   (Companion  Volume)  2012:  259-­‐262   n  [Freiburg  et  al.  2011]  Bauke  Freiburg,  Jaap  Kamps,  and  Cees  G.M.  Snoek.  2011.  Crowdsourcing  visual  detectors  for  video  search.  In   Proceedings  of  the  19th  ACM  interna0onal  conference  on  Mul0media  (MM  11).  ACM,  New  York,  NY,  USA,  913-­‐916.   n  [Goëau  et  al.,  2011]  H.  Goëau,  A.  Joly,  S.  Selmi,  P.  Bonnet,  E.  Mouysset,  L.  Joyeux,  J.  Molino,  P.  Birnbaum,  D.  Barthelemy,    and  N.   Boujemaa,    Visual-­‐based  plant  species  iden0fica0on  from  crowdsourced  data.    ;In  Proceedings  of  ACM  Mul0media.  2011,  813-­‐814.     n  [Harris,  2012]  Christopher  G.  Harris,  An  Evalua0on  of  Search  Strategies  for  User-­‐Generated  Video  Content,  Proceedings  of  the  First   Interna0onal  Workshop  on  Crowdsourcing  Web  Search,  Lyon,  France,  April  17,  2012   n  [Karger  et  al.,  2011]  D.R.  Karger,  S.  Oh,    and  D.  Shah,    Budget-­‐Op0mal  Task  Alloca0on  for  Reliable  Crowdsourcing  Systems.    ;In  Proceedings   of  CoRR.  2011.     n  [Kumar  and  Lease,  2011]  A.  Kumar  and  M.  Lease.  Modeling  annotator  accuracies  for  supervised  learning.  In  WSDM  Workshop  on   Crowdsourcing  for  Search  and  Data  Mining,  2011.  
    • +   References  2/3   n  [Nowak  and  Ruger,  2010]  Stefanie  Nowak  and  Stefan  Rüger.  2010.  How  reliable  are  annota0ons  via  crowdsourcing:  a  study  about  inter-­‐ annotator  agreement  for  mul0-­‐label  image  annota0on.  In  Proceedings  of  the  interna0onal  conference  on  Mul0media  informa0on   retrieval  (MIR  10).  ACM,  New  York,  NY,  USA,  557-­‐566.   n  [Paritosh,  2012]  Praveen  Paritosh,  Human  Computa0on  Must  Be  Reproducible,  Proceedings  of  the  First  Interna0onal  Workshop  on   Crowdsourcing  Web  Search,  Lyon,  France,  April  17,  2012   n  [Raykar  et  al.,  2010]  Vikas  C.  Raykar,  Shipeng  Yu,  Linda  H.  Zhao,  Gerardo  Hermosillo  Valadez,  Charles  Florin,  Luca  Bogoni,  and  Linda  Moy.   2010.  Learning  From  Crowds.  J.  Mach.  Learn.  Res.  99  (August  2010),  1297-­‐1322.   n  [Sheng  et  al.,  2008]  Victor  S.  Sheng,  Foster  Provost,  and  Panagio0s  G.  Ipeiro0s.  2008.  Get  another  label?  improving  data  quality  and  data   mining  using  mul0ple,  noisy  labelers.  In  Proceedings  of  the  14th  ACM  SIGKDD  interna0onal  conference  on  Knowledge  discovery  and  data   mining  (KDD  08).  ACM,  New  York,  NY,  USA,  614-­‐622.   n  [Snoek  et  al.,  2010]  Cees  G.M.  Snoek,  Bauke  Freiburg,  Johan  Oomen,  and  Roeland  Ordelman.  Crowdsourcing  rock  n  roll  mul0media   retrieval.  In  Proceedings  of  the  interna0onal  conference  on  Mul0media  (MM  10).  ACM,  New  York,  NY,  USA,  1535-­‐1538.   n  [Snow  et  al.,  2008]  Rion  Snow,  Brendan  OConnor,  Daniel  Jurafsky,  and  Andrew  Y.  Ng.  2008.  Cheap  and  fast-­‐-­‐-­‐but  is  it  good?:  evalua0ng   non-­‐expert  annota0ons  for  natural  language  tasks.  In  Proceedings  of  the  Conference  on  Empirical  Methods  in  Natural  Language   Processing  (EMNLP  08).  Associa0on  for  Computa0onal  Linguis0cs,  Stroudsburg,  PA,  USA,  254-­‐263.   n  [Soleymani  and  Larson,  2010]  Soleymani,  M.  and  Larson,  M.  Crowdsourcing  for  Affec0ve  Annota0on  of  Video:  Development  of  a  Viewer-­‐ reported  Boredom  Corpus.  In  Proceedings  of  the  SIGIR  2010  Workshop  on  Crowdsourcing  for  Search  Evalua0on  (CSE  2010)   n  [Sorokin  and  Forsyth,  2008]  Sorokin,  A.;  Forsyth,  D.;  ,  "U0lity  data  annota0on  with  Amazon  Mechanical  Turk,"  Computer  Vision  and   Parern  Recogni0on  Workshops,  2008.  CVPRW  08.  IEEE  Computer  Society  Conference  on  ,  vol.,  no.,  pp.1-­‐8,  23-­‐28  June  2008   n  [Steiner  et  al.,  2011]  Thomas  Steiner,  Ruben  Verborgh,  Rik  Van  de  Walle,  Michael  Hausenblas,  and  Joaquim  Gabarró  Vallés,  Crowdsourcing   Event  Detec0on  in  YouTube  Videos,  Proceedings  of  the  1st  Workshop  on  Detec0on,  Representa0on,  and  Exploita0on  of  Events  in  the   Seman0c  Web,  2011   n  [Tang  and  Lease,  2011]  Wei  Tang  and  Marhew  Lease.  Semi-­‐Supervised  Consensus  Labeling  for  Crowdsourcing.  In  ACM  SIGIR  Workshop  on   Crowdsourcing  for  Informa0on  Retrieval  (CIR),  2011.  
    • +   References  3/3   n  [Urbano  et  al.,  2010]  J.  Urbano,  J.  Morato,  M.  Marrero,  and  D.  Mar0n.  Crowdsourcing  preference  judgments  for  evalua0on  of  music   similarity  tasks.  In  Proceedings  of  the  ACM  SIGIR  2010  Workshop  on  Crowdsourcing  for  Search  Evalua0on  (CSE  2010),  pages  9-­‐-­‐16,   Geneva,  Switzerland,  July  2010.   n  [Vondrick  et  al.,  2010]  Carl  Vondrick,  Deva  Ramanan,  and  Donald  Parerson.  2010.  Efficiently  scaling  up  video  annota0on  with   crowdsourced  marketplaces.  In  Proceedings  of  the  11th  European  conference  on  Computer  vision:  Part  IV  (ECCV10)   n  [Welinder  and  Perona,  2010]  Welinder,  P.,  Perona,  P.  Online  crowdsourcing:  ra0ng  annotators  and  obtaining  cost-­‐effec0ve  labels.   Workshop  on  Advancing  Computer  Vision  with  Humans  in  the  Loop  at  CVPR.  2010   n  [Whitehill  et  al.,  2009]  Jacob  Whitehill,  Paul  Ruvolo,  Jacob  Bergsma,  Tingfan  Wu,  and  Javier  Movellan,  "Whose  Vote  Should  Count  More:   Op0mal  Integra0on  of  Labels  from  Labelers  of  Unknown  Exper0se",  Advances  in  Neural  Informa0on  Processing  Systems  (forthcoming),   2009.   n  [Yan  et  al.,  2010]  Tingxin  Yan,  Vikas  Kumar,  and  Deepak  Ganesan.  2010.  CrowdSearch:  exploi0ng  crowds  for  accurate  real-­‐0me  image   search  on  mobile  phones.  In  Proceedings  of  the  8th  interna0onal  conference  on  Mobile  systems,  applica0ons,  and  services  (MobiSys  10).   ACM,  New  York,  NY,  USA,  77-­‐90.   n  [Yan  et  al.,  2010b]  Y.  Yan,  R.  Rosales,  G.  Fung,  M.W.  Schmidt,  G.H.  Valadez,  L.  Bogoni,  L.  Moy,    and  J.G.  Dy,    Modeling  annotator  exper0se:   Learning  when  everybody  knows  a  bit  of  something.    ;In  Proceedings  of  Journal  of  Machine  Learning  Research  -­‐  Proceedings  Track.  2010,   932-­‐939.     n  [Yan  et  al.,  2011]  Y.  Yan,  R.  Rosales,  G.  Fung,    and  J.G.  Dy,    Ac0ve  Learning  from  Crowds.    ;In  Proceedings  of  ICML.  2011,  1161-­‐1168.