Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Crowdsourcing	  search	  relevance	       evalua2on	  at	  eBay	  	               Brian	  Johnson	            September	  ...
Agenda	  •    Why	  •    What	  •    How	  •    Cost	  •    Quality	  •    Measurement	  
Why	  Ask	  Real	  Humans	  •  They’re	  our	  customers	      –  Some2mes	  asking	  is	  the	  best	  way	  to	  find	  o...
Why	  Crowdsourcing	  •  Fast	      –  1-­‐3	  days	  •  Low	  Cost	      –  pennies	  per	  judgment	  •  High	  Quality	...
Judgment	  Volume	  by	  Day	  
Cost	  Judgments	   Cost	            1	           $0.01	  	           10	           $0.10	  	          100	           $1.0...
Who	  are	  these	  workers	  •  Crowdflower	     –  Mechanical	  Turk	     –  Gambit/Facebook	     –  TrialPay	     –  Sam...
What	  Can	  We	  Evaluate	  •  Search	  Ranking	      –  Query	  >	  Item	  •  Item/Image	  Similarity	      –  Item	  >	...
Crowdsourced	  Search	  Relevance	                Evalua2on	  •  What	  are	  we	  measuring	     –  Relevance	  •  What	 ...
Industry	  Standard	  Sample	  •  As	  in	  the	  original	  DCG	  formula2on,	  we’ll	  be	     using	  a	  four-­‐point	...
eBay	  Search	  Relevance	  Crowdsourcing	  
Great	  Match	  
Good	  Match	  
Not	  Matching	  
Quality	  •  Tes2ng	     –  Train/test	  workers	  before	  they	  start	     –  Mix	  test	  ques2ons	  into	  the	  work...
eBay	  @	  SIGIR	  ’10	                                    Ensuring	  quality	  in	  crowdsourced	  search	  relevance	  e...
Metrics	  •  There	  are	  standard	  industry	  metrics	  •  Designed	  to	  measure	  value	  to	  the	  end	  user	  • ...
Judgment	  Scale	  Granularity	          Binary	                           Web	  Search	                              SigI...
Rank	  Discount	                                      Rank	  Discount	  d	  1/r^constant	  1.00	  0.90	  0.80	  0.70	  0.6...
Cumula2ve	  Gain	  Metrics	                                                                                               ...
Con2nuous	  Produc2on	  Evalua2on	  •  Daily	  query	  sampling/scraping	  to	  facilitate	     ongoing	  monitoring,	  QA...
Human	  Judgment	  >	  Query	  List	  
Best	  Match	  Variant	  Comparison	  
Best	  Match	  Variant	  Comparison	  
Measuring	  a	  Ranked	  List	  Huan	  Liu,	  Lei	  Tang	  and	  Ni2n	  Agarwal.	  Tutorial	  on	  Community	  Detec1on	  ...
Ranking	  Evalua2on	  hcp://research.microsox.com/en-­‐us/um/people/kevynct/files/ECIR-­‐2010-­‐ML-­‐Tutorial-­‐FinalToPrin...
NDCG	  -­‐	  Example	  Huan	  Liu,	  Lei	  Tang	  and	  Ni2n	  Agarwal.	  Tutorial	  on	  Community	  Detec1on	  and	  Beh...
Open	  Ques2ons	  •  Discrete	  vs.	  Con2nuous	  relevance	  scale	  •  #	  of	  workers	  •  Distribu2on	  of	  test	  q...
References	  •  Discounted	  Cumula2ve	  Gain	     –  hcp://en.wikipedia.org/wiki/      Discounted_cumula2ve_gain	  •  hcp...
Upcoming SlideShare
Loading in …5
×

2011 Crowdsourcing Search Evaluation

687 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

2011 Crowdsourcing Search Evaluation

  1. 1. Crowdsourcing  search  relevance   evalua2on  at  eBay     Brian  Johnson   September  28,  2011    
  2. 2. Agenda  •  Why  •  What  •  How  •  Cost  •  Quality  •  Measurement  
  3. 3. Why  Ask  Real  Humans  •  They’re  our  customers   –  Some2mes  asking  is  the  best  way  to  find  out  what  you   want  to  know   –  Provide  ground  truth  for  automated  metrics  •  Provide  data  for   –  Experimental  Evalua2on   •  complements  A/B  tes2ng,  surveys   –  Query  Diagnosis   –  Judged  Test  Corpus   •  Machine  Learning   •  Offline  evalua2on     –  Produc2on  Quality  Control  
  4. 4. Why  Crowdsourcing  •  Fast   –  1-­‐3  days  •  Low  Cost   –  pennies  per  judgment  •  High  Quality   –  Mul2ple  workers   –  Worker  evalua2on  (test  ques2ons  &  inter-­‐worker   agreement)  •  Flexible   –  Ask  anything  
  5. 5. Judgment  Volume  by  Day  
  6. 6. Cost  Judgments   Cost   1   $0.01     10   $0.10     100   $1.00     1,000   $10.00     10,000   $100.00     100,000   $1,000.00     1,000,000   $10,000.00    
  7. 7. Who  are  these  workers  •  Crowdflower   –  Mechanical  Turk   –  Gambit/Facebook   –  TrialPay   –  SamaSource  •  LiveOps  •  CloudCrowd   –  Facebook  
  8. 8. What  Can  We  Evaluate  •  Search  Ranking   –  Query  >  Item  •  Item/Image  Similarity   –  Item  >  Item  •  Merchandising   –  Query  >  Item   –  Category  >  Item   –  Item  >  Item  •  Product  Tagging   –  Item  >  Product  •  Category  Recommenda2ons   –  Item  (Title)  >  Category  
  9. 9. Crowdsourced  Search  Relevance   Evalua2on  •  What  are  we  measuring   –  Relevance  •  What  are  we  not  measuring   –  Value   –  Purchase  metrics   –  Revenue  
  10. 10. Industry  Standard  Sample  •  As  in  the  original  DCG  formula2on,  we’ll  be   using  a  four-­‐point  scale  for  relevance   assessment:    •  Irrelevant  document  (0)    •  Marginally  relevant  document  (1)    •  Fairly  relevant  document  (2)    •  Highly  relevant  document  (3)     hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf  
  11. 11. eBay  Search  Relevance  Crowdsourcing  
  12. 12. Great  Match  
  13. 13. Good  Match  
  14. 14. Not  Matching  
  15. 15. Quality  •  Tes2ng   –  Train/test  workers  before  they  start   –  Mix  test  ques2ons  into  the  work  mix   –  Discard  data  from  unreliable  workers  •  Redundancy   –  Cost  is  low  >  Ask  mul2ple  workers   –  Monitor  inter-­‐worker  agreement   –  Have  trusted  workers  monitor  new  workers   –  Track  worker  “feedback”  over  2me  
  16. 16. eBay  @  SIGIR  ’10   Ensuring  quality  in  crowdsourced  search  relevance  evalua8on:     The  effects  of  training  ques8on  distribu8on     John  Le,  Andy  Edmonds,  Vaughn  Hester,  Lukas  Biewald    The  use  of  crowdsourcing  plaiorms  like  Amazon  Mechanical  Turk  for  evalua2ng  the  relevance  of  search  results   has   become   an   effec2ve   strategy   that   yields   results   quickly   and   inexpensively.   One   approach   to  ensure   quality   of   worker   judgments   is   to   include   an   ini2al   training   period   and   subsequent   sporadic  inser2on   of   predefined   gold   standard   data   (training   data).   Workers   are   no2fied   or   rejected   when   they   err  on  the  training  data,  and  trust  and  quality  ra2ngs  are  adjusted  accordingly.  In  this  paper,  we  assess  how  this  type  of  dynamic  learning  environment  can  affect  the  workers  results  in  a  search  relevance  evalua2on  task   completed   on   Amazon   Mechanical   Turk.   Specifically,   we   show   how   the   distribu2on   of   training   set  answers   impacts   training   of   workers   and   aggregate   quality   of   worker   results.   We   conclude   that   in   a  relevance  categoriza2on  task,  a  uniform  distribu2on  of  labels  across  training  data  labels  produces  op2mal  peaks  in  1)  individual  worker  precision  and  2)  majority  vo2ng  aggregate  result  accuracy.     SIGIR  ’10,  July  19-­‐23,  2010,  Geneva,  Switzerland  
  17. 17. Metrics  •  There  are  standard  industry  metrics  •  Designed  to  measure  value  to  the  end  user  •  Older  metrics   –  Precision  &  recall  (binary  relevance,  no  no2on  of   posi2on)  •  Current  metrics   –  Cumula2ve  Gain  (overall  value  of  results  on  a  non-­‐ binary  relevance  scale)   –  Discounted  (adjusted  for  posi2on  value)   –  Normalized  (common  0-­‐1  scale)  
  18. 18. Judgment  Scale  Granularity   Binary   Web  Search   SigIR   3  Point   4  Point                Offensive           -­‐1    Spam   -­‐2    Spam                           -­‐1    Off  Topic   -­‐2    Off  Topic   0    Irrelevant        Off  Topic        Irrelevant   0    Not  Matching   -­‐1    Not  Matching                Relevant        Marginally  Relevant                   1    Relevant        Useful        Fairly  Relevant   1    Matching   1    Good  Match                Vital        Highly  Relevant           2    Great  Match  
  19. 19. Rank  Discount   Rank  Discount  d  1/r^constant  1.00  0.90  0.80  0.70  0.60  0.50  0.40  0.30  0.20  0.10  0.00   1   2   3   4   5   6   7   8   9   10  
  20. 20. Cumula2ve  Gain  Metrics   Normalized   Normalized   Discounted   Discounted   Discounted   Ideal  Rank   Cumula8ve   Ideal  Rank   Cumula8ve   Human   Cumula8ve   Rank   Cumula8ve   Order   Ideal  DCG   Gain   Order   Ideal  DCG   Gain  Rank   Judgment   Gain   Discount   Gain   Observed   Observed   Observed   Theore8cal   Theore8cal   Theore8cal   r   j   cg   d   dcg   io   idcgo   ndcgo   it   idcgt   ndcgt   dcg(n-­‐1)  +  j  *   dcg(n-­‐1)  +  io   dcg(n)  /   dcg(n-­‐1)  +  it   dcg(n)  /  idcgt   0-­‐1    +=  j   1  /  r^c   d   sort(j)   *  d   idcgo(n)   1   *  d   (n)   1   1.0   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   2   1.0   2.00   0.53   1.53   1.00   1.53   1.00   1.00   1.53   1.00   3   0.8   2.80   0.37   1.83   1.00   1.90   0.96   1.00   1.90   0.96   4   0.0   2.80   0.28   1.83   1.00   2.18   0.84   1.00   2.18   0.84   5   1.0   3.80   0.23   2.06   0.80   2.37   0.87   1.00   2.41   0.85   6   0.2   4.00   0.20   2.10   0.50   2.47   0.85   1.00   2.61   0.80   7   0.2   4.20   0.17   2.13   0.20   2.50   0.85   1.00   2.78   0.77   8   0.5   4.70   0.15   2.21   0.20   2.53   0.87   1.00   2.93   0.75   9   1.0   5.70   0.14   2.34   0.00   2.53   0.93   1.00   3.07   0.76   10   0.0   5.70   0.12   2.34   0.00   2.53   0.93   1.00   3.19   0.73  
  21. 21. Con2nuous  Produc2on  Evalua2on  •  Daily  query  sampling/scraping  to  facilitate   ongoing  monitoring,  QA,  triage,  and  post-­‐hoc   business  analysis   NDCG   Time   By  Site,  Category,  Query  …  
  22. 22. Human  Judgment  >  Query  List  
  23. 23. Best  Match  Variant  Comparison  
  24. 24. Best  Match  Variant  Comparison  
  25. 25. Measuring  a  Ranked  List  Huan  Liu,  Lei  Tang  and  Ni2n  Agarwal.  Tutorial  on  Community  Detec1on  and  Behavior  Study  for  Social  Compu1ng.   Presented  in  The  1st  IEEE  Interna2onal  Conference  on  Social  Compu2ng  (SocialCom’09),  2009.   hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-­‐tutorial.pdf  
  26. 26. Ranking  Evalua2on  hcp://research.microsox.com/en-­‐us/um/people/kevynct/files/ECIR-­‐2010-­‐ML-­‐Tutorial-­‐FinalToPrint.pdf  
  27. 27. NDCG  -­‐  Example  Huan  Liu,  Lei  Tang  and  Ni2n  Agarwal.  Tutorial  on  Community  Detec1on  and  Behavior  Study  for  Social  Compu1ng.   Presented  in  The  1st  IEEE  Interna2onal  Conference  on  Social  Compu2ng  (SocialCom’09),  2009.   hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-­‐tutorial.pdf  
  28. 28. Open  Ques2ons  •  Discrete  vs.  Con2nuous  relevance  scale  •  #  of  workers  •  Distribu2on  of  test  ques2ons  •  Genera2on  of  test  ques2ons  •  Qualifica2on  (demographics,  interests,  region)  •  Dynamic  worker  assignment  based  on   qualifica2on  •  Mobile  workers  (untapped  pool)  
  29. 29. References  •  Discounted  Cumula2ve  Gain   –  hcp://en.wikipedia.org/wiki/ Discounted_cumula2ve_gain  •  hcp://crowdflower.com/  •  hcp://www.cloudcrowd.com/  •  hcp://www.trialpay.com  

×