SlideShare a Scribd company logo
1 of 196
Download to read offline
Managing Crowdsourced Human Computation
Managing Crowdsourced Human Computation

        Panos Ipeirotis, New York University



 Slides from the WWW2011 tutorial, 29 March 2011
 Slides from the WWW2011 tutorial 29 March 2011
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Human Computation, Round 1
            • Humans were the first 
              “computers,” used for 
               computers, used for
              math computations




                     Grier, When computers were human, 2005
                                      Grier, IEEE Annals 1998
Human Computation, Round 1
            • Humans were the first 
              “computers,” used for 
              math computations

            • Organized computation:
              – Clairaut, astronomy, 1758: 
                                  y
                Computed the Halley’s 
                comet orbit (three‐body 
                problem) dividing the 
                problem) dividing the
                labor of numeric 
                computations across 3 
                astronomers
                     Grier, When computers were human, 2005
                                      Grier, IEEE Annals 1998
Human Computation, Round 1
            • Organized computation:
               – Maskelyne, astronomical almanac 
                 with moon positions, used for 
                             p       ,
                 navigation, 1760. Quality 
                 assurance by doing calculations 
                 twice and compared by third 
                 verifier.
                 verifier

               – De Prony, 1794, hires hairdressers 
                 (unemployed after French 
                 (unemployed after French
                 revolution; knew only addition 
                 and subtraction) to create 
                 logarithmic and trigonometric 
                 tables. He managed the process 
                 tables He managed the process
                 by splitting the work into very 
                 detailed workflows. (Hairdressers better 
                 than mathematicians in arithmetic!)


                           Grier, When computers were human, 2005
                                            Grier, IEEE Annals 1998
Human Computation, Round 1
            • Organized computation:
              –   Clairaut, astronomy, 1758
              –   Maskelyne, 1760
              –   De Prony, log/trig tables, 1794
              –   Galton, biology, 1893
              –   Pearson, biology, 1899
              –   …
              –   Cowles, stock market, 1929
              –   Math Tables Project, unskilled 
                  Math Tables Project unskilled
                  labor, 1938



                         Grier, When computers were human, 2005
                                          Grier, IEEE Annals 1998
Human Computation, Round 1

            • Patterns emerging
              Patterns emerging
              – Division of labor
              – Mass production
                Mass production
              – Professional managers


            • Then we got the 
              “automatic computers” 
Human Computation, Round 2
            • Now we need humans 
              again for the “AI‐complete” 
                 i f th “AI         l t ”
              tasks

                 – Tag images [ESP Game: voh Ahn and 
                    Dabbish 2004, ImageNet]
                 – Determine if page relevant
                   Determine if page relevant 
                    [Alonso et al., 2011]
                 – Determine song genre
                 – Check page for offensive
                   Check page for offensive 
                   content
                 –…


            ImageNet: http://www.image‐net.org/about‐publication
Focus of the tutorial
        Focus of the tutorial
 Examine cases where humans interact with 
 Examine cases where humans interact with
computers in order to solve a computational 
  problem (usually too hard to be solved by
          (usually too hard to be solved by 
             computers alone)
Crowdsourcing and human computation

• Crowdsourcing: From macro to micro
              g
  –   Netflix, Innocentive
  –   Quirky, Threadless
  –   oDesk, Guru, eLance, vWorker
       D k G         L      W k
  –   Wikipedia et al.
  –   ESP Game, FoldIt, Phylo, …
       S Ga e, o d t, y o,
  –   Mechanical Turk, CloudCrowd, …

• Crowdsourcing greatly facilitates human 
  computation (but they are not equivalent)
Micro‐Crowdsourcing Example:
       Labeling Images
       L b li I
    using the ESP Game                         Luis von Ahn
                                       MacArthur Fellowship
                                             "genius grant"




 • Two player online game
 • Partners don’t know each other and can’t 
   communicate
 • Object of the game: type the same word
 • The only thing in common is an image
   The only thing in common is an image
PLAYER 1           PLAYER 2




    GUESSING: CAR      GUESSING: BOY

    GUESSING: HAT      GUESSING: CAR

GUESSING: KID       SUCCESS!
                    YOU AGREE ON CAR
SUCCESS!
YOU AGREE ON CAR
Paid Crowdsourcing: Amazon Mechanical Turk
Demographics of MTurk workers
                http://bit.ly/mturk‐demographics
                http://bit.ly/mturk demographics


Country of residence
Country of residence
• United States: 46.80%
• India: 34.00%
• Miscellaneous: 19.20%
Demographics of MTurk workers
     http://bit.ly/mturk‐demographics
Demographics of MTurk workers
     http://bit.ly/mturk‐demographics
     http://bit.ly/mturk demographics
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Managing quality for simple tasks
    Managing quality for simple tasks
• Quality through redundancy: Combining votes
  Quality through redundancy: Combining votes
    – Majority vote
    – Quality adjusted vote
      Quality‐adjusted vote
    – Managing dependencies
•   Quality through gold data
    Q lit th       h ld d t
•   Estimating worker quality (Redundancy + Gold)
•   Joint estimation of worker quality and difficulty
•   Active data collection
    Active data collection
Example: Build an “Adult Web Site” Classifier

• Need a large number of hand‐labeled sites
• G t
  Get people to look at sites and classify them as:
           l t l k t it         d l if th
G (general audience) PG (parental guidance)  R (restricted) X (porn)
22
Example: Build an “Adult Web Site” Classifier

 • Need a large number of hand‐labeled sites
 • G t
   Get people to look at sites and classify them as:
            l t l k t it         d l if th
 G (general audience) PG (parental guidance)  R (restricted) X (porn)




Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
Example: Build an “Adult Web Site” Classifier

 • Need a large number of hand‐labeled sites
 • G t
   Get people to look at sites and classify them as:
            l t l k t it         d l if th
 G (general audience) PG (parental guidance)  R (restricted) X (porn)




Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers! 




            Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
Majority Voting and Label Quality
     Ask multiple labelers, keep majority label as “true” label
     Quality is probability of being correct

                                                               1
                                                                    p=1.0
                                                              0.9
                                                                    p=0.9
                                                                      09
                                                              0.8               p=0.7

                                  Quality for Majority Vote
                                                                    p=0.8
                                                              0.7               p=0.6
          p is probability                                    0.6
          of individual labeler
          of individual labeler                                                 p=0.5
                                                                                p=0 5
                                              M


                                                              0.5
          being correct
                                                              0.4               p=0.4
                                                              0.3
                                                              0.2
                                                                       1    3           5        7           9   11   13
                                                                                        Number of labelers



                                                                                        Binary classification
                                                                                        Binary classification


                                                                                                                           26
Kuncheva et al., PA&A, 2003
What if qualities of workers are different?

                                 3 workers, qualities: p‐d, p, p+d



                                             Region where majority better




• Majority vote works best when workers have similar quality
                      j p
• Otherwise better to just pick the vote of the best worker
• …or model worker qualities and combine [coming next]
Combining votes with different quality




               Clemen and Winkler, 1990
What happens if we have dependencies?




                           Clemen and Winkler, 1985

   Positive dependencies decrease the number of effective labelers
    Positive dependencies decrease the number of effective labelers
What happens if we have dependencies?




                                                               Yule’s Q
                                                               Y l ’ Q
                                                         measure of correlation



                     Kuncheva et al., PA&A, 2003



   Positive dependencies decrease the number of effective labelers
    Positive dependencies decrease the number of effective labelers
   Negative dependencies can improve results (unlikely both workers 
    to be wrong at the same time)
Vote combination: Meta studies
  Vote combination: Meta‐studies
• Simple averages tend to work well
  Simple averages tend to work well

• C
  Complex models slightly better but less robust 
      l     d l li h l b         b l        b




           [Clemen and Winkler, 1999, Ariely et al. 2000]
From aggregate labels to worker quality

   Look at our spammer friend ATAMRO447HWJQ
        h     ih h 9        k
   together with other 9 workers




After aggregation, we compute confusion matrix for each worker

      After majority vote, confusion matrix for ATAMRO447HWJQ
                       P[G → G]=100% P[G → X]=0%
                       P[X → G]=100% P[X → X]=0%
Algorithm of Dawid & Skene, 1979

     Iterative process to estimate worker error rates
1. Initialize by aggregating labels for each object (e.g., use majority vote)
2. Estimate confusion matrix for workers (using aggregate
   labels)
   l b l )
3. Estimate aggregate labels (using confusion matrix)
   • Keep labels for “gold data unchanged
                        gold data”
4. Go to Step 2 and iterate until convergence




  Confusion matrix for ATAMRO447HWJQ
                                                  Our f i d
                                                  O friend ATAMRO447HWJQ
  P[G → G]=99.947%      P[G → X]=0.053%
                                                  marked almost all sites as G.
  P[X → G]=99.153%      P[X → X]=0.847%
                                                   Seems like a spammer…
And many variations…
            And many variations…
• van der Linden et al, 1997: Item‐Response Theory
   a de       de et a , 99 : te       espo se eo y
• Uebersax, Biostatistics 1993: Ordered categories
• Uebersax, JASA 1993: Ordered categories, with worker
  Uebersax, JASA 1993: Ordered categories, with worker 
  expertise and bias, item difficulty
• Carpenter, 2008: Hierarchical Bayesian versions
     p      ,                       y

And more recently at NIPS:
                   y
• Whitehill et al., 2009: Adding item difficulty
• Welinder et al., 2010: Adding worker expertise
            et al., 2010: Adding worker expertise
Challenge: From Confusion Matrixes to Quality Scores

  All the algorithms will generate “confusion matrixes” for workers



                     Confusion
                     C f i matrix for
                               ti f
                      ATAMRO447HWJQ
            P[X → X]=0.847%        P[X → G]=99.153%
            P[G → X]=0.053%        P[G → G]=99.947%



     How to check if a worker is a spammer
           using the confusion matrix?
                g
           (hint: error rate not enough)
Challenge 1: 
           Spammers are lazy and smart!
           Spammers are lazy and smart!
Confusion matrix for spammer         Confusion matrix for good worker
   P[X → X]=0% P[X → G]=100%
          X] 0%       G] 100%           P[X → X]=80%       P[X → G]=20%
   P[G → X]=0% P[G → G]=100%           P[G → X]=20%       P[G → G]=80%


      • Spammers figure out how to fly under the radar…

      • I
        In reality, we have 85% G sites and 15% X sites
              lit      h    85% G it      d 15% X it

      • Errors of spammer = 0% * 85% + 100% * 15% = 15%
        Errors of spammer 0%  85% + 100% 15% 15%
      • Error rate of good worker = 85% * 20% + 85% * 20% = 20%


      False negatives: Spam workers pass as legitimate
Challenge 2: 
                     Humans are biased!
                     Humans are biased!
Error rates for CEO of AdSafe

    P[G → G]=20.0%       P[G → P]=80.0%   P[G → R]=0.0%     P[G → X]=0.0%
    P[P → G]=0.0%        P[P → P]=0.0%    P[P → R]=100.0%   P[P → X]=0.0%
    P[R → G]=0.0%        P[R → P]=0.0%    P[R → R]=100.0%   P[R → X]=0.0%
    P[X → G] 0 0%
          G]=0.0%        P[X → P] 0 0%
                               P]=0.0%    P[X → R] 0 0%
                                                R]=0.0%     P[X → X] 100 0%
                                                                  X]=100.0%

   In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites

   Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%
   Error rate of biased worker = 80% * 85% + 100% * 5% = 73%



False positives: Legitimate workers appear to be spammers
Solution: Reverse errors first, compute 
                          f       d
              error rate afterwards
Error Rates for CEO of AdSafe
    P[G → G]=20 0%
          G]=20.0%        P[G → P]=80 0%
                                P]=80.0%   P[G → R]=0 0%
                                                 R]=0.0%     P[G → X]=0 0%
                                                                   X]=0.0%
    P[P → G]=0.0%         P[P → P]=0.0%    P[P → R]=100.0%   P[P → X]=0.0%
    P[R → G]=0.0%         P[R → P]=0.0%    P[R → R]=100.0%   P[R → X]=0.0%
    P[X → G]=0.0%
     [     ]              P[X → P]=0.0%
                           [     ]         P[X → R]=0.0%
                                            [     ]          P[X → X]=100.0%
                                                              [     ]


         •   When biased worker says G, it is 100% G
         •   When biased worker says P, it is 100% G
             Wh bi d        k        P it i 100% G
         •   When biased worker says R, it is 50% P, 50% R
         •   When biased worker says X, it is 100% X

         Small ambiguity for “R‐rated” votes but other than that, fine!
Solution: Reverse errors first, compute 
                          f       d
              error rate afterwards
Error Rates for spammer: ATAMRO447HWJQ
   P[G → G]=100.0%            P[G → P]=0.0%   P[G → R]=0.0%   P[G → X]=0.0%
   P[P → G]=100.0%            P[P → P]=0.0%   P[P → R]=0.0%   P[P → X]=0.0%
   P[R → G]=100.0%            P[R → P]=0.0%   P[R → R]=0.0%   P[R → X]=0.0%
   P[X → G]=100 0%
         G]=100.0%            P[X → P]=0 0%
                                    P]=0.0%   P[X → R]=0 0%
                                                    R]=0.0%   P[X → X]=0 0%
                                                                    X]=0.0%




    • When spammer says G, it is 25% G, 25% P, 25% R, 25% X
    • When spammer says P, it is 25% G, 25% P, 25% R, 25% X
    • When spammer says R, it is 25% G, 25% P, 25% R, 25% X
    • When spammer says X, it is 25% G, 25% P, 25% R, 25% X
    [note: assume equal priors]


    The results are highly ambiguous. No information provided!
     h      l       h hl     b             f
Quality Scores



• High cost when “soft” labels have probability spread across classes
• Low cost when “soft” labels have probability mass concentrated in one class


         Assigned Label   “Soft” Label                             Cost
         G                <G: 25%, P: 25%, R: 25%, X: 25%>         0.75

         G                <G: 99%, P: 1%, R: 0%, X: 0%>            0.0198

        [***Assume equal misclassification costs]



                                                             Ipeirotis, Provost, Wang, HCOMP 2010
Quality Score
• A spammer is a worker who always assigns labels randomly, 
  regardless of what the true class is.


       QualityScore = 1 ‐ ExpCost(Worker)/ExpCost(Spammer)




• Q li S
  QualityScore i
               is useful for the purpose of blocking bad workers and 
                     f lf h               f bl ki b d       k      d
  rewarding good ones
• Essentially a multi class cost sensitive AUC metric
  Essentially a multi‐class, cost‐sensitive AUC metric
   • AUC = area under the ROC curve
What about Gold testing?


 Naturally integrated into the latent class model
1.                                                 (e.g.,
1 Initialize by aggregating labels for each object (e g use majority vote)
2. Estimate error rates for workers (using aggregate labels)
3. Estimate aggregate labels (using error rates, weight worker
   votes according to quality)
   • Keep labels for “gold data” unchanged
4 Go to Step 2 and iterate until convergence
4.
•   3 labels per example
                                                    •   2 categories, 50/50

     Gold Testing
     Gold Testing                                   •
                                                    •
                                                        Quality range: 0.55:0.05:1.0
                                                        200 labelers
                                                             l b l




                No significant advantage under “good conditions” 
                    (balanced datasets, good worker quality)
http://bit.ly/gold‐or‐repeated
Wang, Ipeirotis, Provost, WCBI 2011
•   5 labels per example
                                       •   2 categories, 50/50

Gold Testing
Gold Testing                           •
                                       •
                                           Quality range: 0.55:1.0
                                           200 labelers
                                                l b l




   No significant advantage under “good conditions” 
       (balanced datasets, good worker quality)
•   10 labels per example
                                       •   2 categories, 50/50

Gold Testing
Gold Testing                           •
                                       •
                                           Quality range: 0.55:1.0
                                           200 labelers
                                                l b l




   No significant advantage under “good conditions” 
       (balanced datasets, good worker quality)
•   10 labels per example
                                  •   2 categories, 90/10

Gold Testing
Gold Testing                      •
                                  •
                                      Quality range: 0.55:0.1.0
                                           l b l
                                      200 labelers




      Advantage under imbalanced datasets
•   5 labels per example
                                  •   2 categories, 50/50

Gold Testing
Gold Testing                      •
                                  •
                                      Quality range: 0.55:0.65
                                      200 labelers
                                           l b l




       Advantage with bad worker quality
•   10 labels per example
                                      •   2 categories, 90/10

Gold Testing?
Gold Testing?                         •
                                      •
                                          Quality range: 0.55:0.65
                                          200 labelers
                                               l b l




    Significant advantage under “bad conditions” 
      (imbalanced datasets, bad worker quality)
Testing workers
              Testing workers
• An exploration‐exploitation scheme:
  An exploration exploitation
  – Explore: Learn about the quality of the workers
  – Exploit: Label new examples using the quality
    Exploit: Label new examples using the quality
Testing workers
                     Testing workers
• An exploration‐exploitation scheme:
  An exploration exploitation
  – Assign gold labels when benefit in learning better 
    quality of worker outweighs the loss for labeling a 
    quality of worker outweighs the loss for labeling a
    gold (known label) example [Wang et al, WCBI 2011]
  – Assign an already labeled example (by other
    Assign an already labeled example (by other 
    workers) and see if it agrees with majority [Donmez et 
    al., KDD 2009]

  – If worker quality changes over time, assume 
     f    k      l     h
    accuracy given by HMM and φ(τ) = φ(τ‐1) + Δ 
    [
    [Donmez et al., SDM 2010]
                  ,         ]
Example: Build an “Adult Web Site” Classifier
     Example: Build an  Adult Web Site

     Get people to look at sites and classify them as:
         p p                                y
   G (general audience) PG (parental guidance)  R (restricted) X (porn)


But we are not going to label the whole Internet…
Expensive
Slow
Integrating with Machine Learning
Integrating with Machine Learning
• Crowdsourcing is cheap but not free
  Crowdsourcing is cheap but not free
  – Cannot scale to web without help


• Solution: Build automatic classification models 
  using crowdsourced data
Simple solution

• Humans label training data
  Humans label training data
• Use training data to build model

                   Data from existing
                 crowdsourced answers



N
New C
    Case          Automatic Model           Automatic
               (through machine learning)    Answer
Quality and Classification Performance

       Noisy labels lead to degraded task performance
       Labeling quality increases  classification quality increases
       Labeling quality increases  classification quality increases

                                                                                                            Quality = 100%
                100
                                                                                                            Quality  80%
                                                                                                            Quality = 80%
                90
                80
          AUC




                                                                                                             Quality = 60%
                70
                60
                                                                                                              Quality = 50%
                50
                40
                                               100

                                                     120

                                                           140

                                                                 160

                                                                       180

                                                                             200

                                                                                   220

                                                                                         240

                                                                                               260

                                                                                                     280

                                                                                                            300
                      1

                          20

                                40

                                     60

                                          80




                               Number of examples ("Mushroom" data set)                                    Single‐labeler quality 
                                                                                                           (p
                                                                                                           (probability of assigning 
                                                                                                                      y        g 54 g
                                                                                                           correctly a binary label)
http://bit.ly/gold‐or‐repeated
Sheng, Provost, Ipeirotis, KDD 2008
Tradeoffs for Machine Learning Models
 • Get more data  Improve model accuracy
 • Improve data quality  Improve classification
     p          q     y     p

                                                                     Data Quality = 100%
                100
                                                                     Data Quality = 80%
                                                                          Q li      80%
                90

                80
        uracy




                70                                                   Data Quality = 60%
                                                                          Q     y
     Accu




                60

                50                                                    Data Quality = 50%

                40
                                        0
                                        0
                                        0
                                        0
                                        0
                                        0

                                        0
                                        0
                                        0

                                        0
                                        0
                  1
                      20
                           40
                                60
                                      80
                                     10
                                     12
                                     14
                                     16
                                     18
                                     20

                                     22
                                     24
                                     26
                                     28
                                     30
                                     Number of examples (Mushroom)




55
Tradeoffs for Machine Learning Models
 • Get more data: Active Learning, select which 
   unlabeled example to label [Settles, http://active‐learning.net/]
   unlabeled example to label [S ttl htt // ti l i t/]

 • Impro e data q alit
   Improve data quality: 
   Repeated Labeling, label again an already labeled 
   example [Sheng et al. 2008, Ipeirotis et al, 2010]
   example [Sheng et al 2008 Ipeirotis et al 2010]




56
Scaling Crowdsourcing: Iterative training
• Use model when confident, humans otherwise
• Retrain with new human input → improve
  Retrain with new human input → improve 
  model → reduce need for humans
                                           Automatic
                                             Answer


New Case         Automatic Model
              (through machine learning)




              Data from existing           Get human(s) to
                                                answer
            crowdsourced answers
Rule of Thumb Results
            Rule of Thumb Results
• With high quality labelers (80% and above): One 
         g q      y          (             )
  worker per case (more data better)


• With low quality labelers (~60%)
  Multiple workers per case (to improve quality)
  Multiple workers per case (to improve quality)

[Sheng et al KDD 2008; Kumar and Lease CSDM 2011]
       et al, KDD 2008; Kumar and Lease, CSDM 2011]




                                                      58
Dawid & Skene meets a Classifier
        & Skene meets a Classifier
• [Raykar et al. JMLR 2010]: Use the
          et al. JMLR 2010]: Use the 
  Dawid&Skene scheme but add a classifier as 
  an additional worker
  an additional worker


• Classifier in each iteration learns from the 
  consensus labeling



                                                  59
Selective Repeated Labeling
         Selective Repeated‐Labeling
• We do not need to label everything same number of times
• Key observation: we have additional information to guide 
  selection of data for repeated labeling
                          p             g
   the current multiset of labels 
• Example: {+ ‐ + ‐ ‐ +} vs {+ + + + + +}
  Example:  {+,‐,+,‐,‐,+} vs. {+,+,+,+,+,+}




                                                        60
Label Uncertainty: Focus on uncertainty

• If we know worker qualities, we can estimate log‐odds for each 
                    q        ,                   g
  example:




• Assign labels first to examples that are most uncertain (log‐
  odds close to 0 for binary case)
+ +       ‐ ‐‐ ‐
                                                            + +      + +
                                                                         + + ‐ ‐‐ ‐ ‐ + ‐ ‐
                                                                                  ‐
Model Uncertainty (MU)                                        + +
                                                                  + +
                                                                    + +
                                                                    ++
                                                                           + +‐ ‐ ‐ ‐‐ ‐ ‐
                                                                      + +
                                                                        + +   ‐‐‐‐ ‐‐‐‐
                                                                                    ‐‐
                                                                                  ‐‐

• Learning models of the data provides an 
  alternative source of information about label 
  certainty


• M d l
  Model uncertainty: get more labels for instances 
              t i t    t       l b l f i t                                   Examples

  that cause model uncertainty
• Intuition?
                                                                              Models
   – for modeling: why improve training data quality if         “Self‐healing” process
     model already is certain there?                              [Brodley et al, JAIR 1999]
                                                                 [Ipeirotis et al NYU 2010]
                                                                            et al, NYU 2010] 
   – for data quality, low‐certainty “regions” may be due to 
     incorrect labeling of corresponding instances
                                                                                        62
Adult content classification




                                   Round Robin




Selective labeling                       63
Too much theory?
                 Too much theory?
              Open source implementation available at:
               p             p
            http://code.google.com/p/get‐another‐label/
• Input: 
   – Labels from Mechanical Turk
   – Cost of incorrect labelings (e.g., XG costlier than GX)
• Output:
  Output: 
   – Corrected labels
   – Worker error rates
   – Ranking of workers according to their quality
Learning from imperfect data
      Learning from imperfect data
                                               100


• With inherently noisy
  With inherently noisy                        90

                                               80




                                    Accuracy
  data, good to have                           70

                                               60

  learning algorithms that 
  learning algorithms that                     50


  are robust to noise.                         40




                                                                       0

                                                                       0
                                                                       0

                                                                       0

                                                                       0

                                                                       0

                                                                       0
                                                                       0

                                                                       0

                                                                       0

                                                                       0
                                                 1
                                                     20

                                                          40

                                                               60

                                                                     80




                                                                    18

                                                                    20

                                                                    22
                                                                    24

                                                                    26

                                                                    28

                                                                    30
                                                                    10

                                                                    12

                                                                    14
                                                                    16
                                                                    Number of examples (Mushroom)




• Or use techniques 
  designed to handle 
  d i d h dl
  explicitly noisy data
                 [Lugosi 1992; Smyth, 1995, 1996]
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
How to handle free‐form answers?
• Q: “My task does not have discrete answers….”
• A: Break into two HITs: 
   – “C t ” HIT
     “Create”
   – “Vote” HIT
                             Creation HIT                Voting HIT:
                     (e.g. find a URL about a topic)    Correct or not?

• Vote HIT controls quality of Creation HIT
  Vote HIT controls quality of Creation HIT
• Redundancy controls quality of Voting HIT

• Catch: If “creation” very good, in voting workers just vote “yes”
   – Solution: Add some random noise (e.g. add typos)


                             Example: Collect URLs
But my free‐form is                                        Describe this

    j      g          g
not just right or wrong…

• “Create” HIT
• “Improve” HIT
       p
• “Compare” HIT
      Creation HIT
 ( g
 (e.g. describe the image)
                       g )




      Improve HIT                          Compare HIT (voting)

(e.g. improve description)                    Which is better?




        TurkIt toolkit [Little et al., UIST 2010]: http://groups.csail.mit.edu/uid/turkit/
version 1:
    A parial view of a pocket calculator together with 
    some coins and a pen.
version 2:
version 2:
     A view of personal items a calculator, and some gold and 
     copper coins, and a round tip pen, these are all pocket
     and wallet sized item used for business, writting, calculating 
     prices or solving math problems and purchasing items.
version 3:
     A close‐up photograph of the following items:  A CASIO 
     multi‐function calculator. A ball point pen, uncapped. 
     Various coins, apparently European, both copper and gold. 
     Seems to be a theme illustration for a brochure or document 
                b    h      ll          f    b h         d
     cover treating finance, probably personal finance.
version 4:
     …Various British coins; two of £1 value, three of 20p value 
     and one of 1p value. …
     and one of 1p value



version 8: 
    “A close‐up photograph of the following items: A 
    CASIO multi‐function, solar powered scientific 
    calculator. A blue ball point pen with a blue rubber 
    grip and the tip extended. Six British coins; two of £1 
    value, three of 20p value and one of 1p value. Seems 
    to be a  theme illustration for a brochure or 
    document cover treating finance ‐ probably personal 
    finance."
Independence or Not?
                    Independence or Not?




    • Building iteratively (lack of independent) allows better 
      outcomes for image description task…
    • In the FoldIt game workers built on each other’s results
      In the FoldIt game, workers built on each other s results

[Little et al, HCOMP 2010]
Independence or Not?




    • But lack of independence
      But lack of independence 
      may cause high 
      dependence on starting 
      conditions and create 
      conditions and create
      groupthink
                 p
    • …but also prevents 
      disasters
[Little et al, HCOMP 2010]
Independence or Not?
                              Collective Problem Solving


    • Exploration / exploitation tradeoff 
         – Can accelerate learning, by sharing good solutions
         – But can lead to premature convergence on 
           suboptimal solution




[Mason and Watts, submitted to Science, 2011]
Individual search strategy affects group success
Individual search strategy affects group success

                         • More players copying
                           More players copying 
                           each other (i.e., fewer 
                           exploring) in current 
                           round
                                     
                           Lower probability of 
                           finding peak on next 
                           roundd
The role of Communication Networks
The role of Communication Networks
• Examine various “neighbor” structures 
                      g
  (who talks to whom about the oil levels)
Network structure affects individual search strategy

• Higher clustering 
              
  Higher probability of 
  neighbors guessing in 
  neighbors guessing in
  identical location

• More neighbors guessing 
  in identical location
              
  Higher probability of 
  copying
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Diffusion of Best Solution
Individual search strategy affects group success
Individual search strategy affects group success

• No significant
  No significant 
  differences in % of 
  games in which peak 
  was found

• Network affects 
  willingness to explore
Network structure affects group success
Network structure affects group success
TurKontrol: Decision Theoretic Modeling
TurKontrol: Decision‐Theoretic Modeling




• Optimizing workflow execution using decision
  Optimizing workflow execution using decision‐
  theoretic approaches [Dai et al, AAAI 2010; Kern et al. 2010]
• Si ifi
  Significant work in control theory [Montgomery, 2007]
                 ki          l h
http://www.workflowpatterns.com


         Common Workflow Patterns
         Common Workflow Patterns
Basic Control Flow
Basic Control Flow      • Iteration
• Sequence              • Arbitrary Cycles 
                          (goto)
• Parallel Split        • Structured Loop (for, 
                          while, repeat)
                          while, repeat)
• Synchronization       • Recursion

• Exclusive Choice

• Simple Merge
Soylent
• Word processor with crowd embedded [Bernstein et al, UIST 2010]

• “Proofread paper”: Ask workers to proofread each paragraph
   –LLazy Turker: Fixes the minimum possible (e.g., single typo)
          T k Fi          th    i i          ibl (      i l t  )
   – Eager Beaver: Fixes way beyond the necessary but adds 
     extra errors (e.g., inline suggestions on writing style)

• Find‐Fix‐Verify pattern
   – Separate Find and Fix, does not allow Lazy Turker
   – Separate Fix‐Verify ensured quality
Find     “Identify at least one area
         t at can
         that ca be s o te ed
                       shortened
         without changing the
         meaning of the
         paragraph.”
                         Independent agreement to identify patches


Fix      “Edit the highlighted
                     g g
         section to shorten its
         length without changing           Soylent, a prototype...
         the meaning
         of the paragraph.”
                       Randomize order of suggestions


Verify   “Choose at least one
         rewrite that has style
         errors, and
         at least one rewrite that
         changes the meaning
         of the sentence.”
Crowd‐created Workflows: CrowdForge
• Map‐Reduce framework for crowds [Kittur et al, CHI 
   2011]


                     – Identify sights worth checking
                       out (
                           (one tip per worker))
                         • Vote and rank
                     – Brief tips for each monument
                       (
                       (one tip p worker)
                               p per       )
                         • Vote and rank
                     – Aggregate tips in meaningful
                       summary
                         • It t to improve…
                            Iterate t i




My Boss is a Robot (mybossisarobot.com),  Nikki Kittur (CMU) + Jim Giles (New Scientist)
Crowd‐created Workflows: TurkoMatic
• Crowd creates workflows

• Turkomatic [Kalkani et al, CHI 2011]:
   1. Ask workers to decompose task into steps (Map)
   2. Can step be completed within 10 minutes?
       1. Yes: solve it.
       2.
       2 No: decompose further (recursion)
   3. Given all partial solutions, solve big problem (Reduce)
Crowdsourcing Patterns
        Crowdsourcing Patterns
• Generate / Create
   i d
• Find                   Creation
                         C ti
• Improve / Edit / Fix

• Vote for accept‐reject
                                         Q
                                         Quality 
                                               y
• Vote up, vote down, to generate rank
                                         Control
• Vote for best / select top‐k

• Split task
• Aggregate      Flow Control
                 Flow Control
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Defining Task Parameters
      Defining Task Parameters
Three main goals:
Three main goals:

• Minimize Cost (cheap)
   i i i C ( h        )
• Maximize Quality (good)
• Minimize Completion Time (fast)
Effect of Payment: Quality
        Effect of Payment: Quality
• Cost does not affect quality [Mason and Watts, 2009, AdSafe]
• Similar results for bigger tasks [Ariely et al, 2009]
                                             0.45

                                             0.40

                                             0.35

                                             0.30



                                Error Rate
                                             0.25                   2cents
                                                                    5cents
                                             0.20                   10cents
                                             0.15
                                             0

                                             0.10

                                             0.05

                                             0.00
                                                    0    10           20      30
                                                        Number of Labelers
Effect of Payment: #Tasks
      Effect of Payment: #Tasks
• Payment incentives increase speed, though




               [Mason and Watts, 2009]
Predicting Completion Time
• Model timing of individual task 
  [Yan, Kumar, Ganesan, 2010]
   – Assume rate of task completion λ
   – Exponential distribution for 
     single task
        g
   – Erlang distribution for sequential 
     tasks
   – On the fly estimation of λ for
     On‐the‐fly estimation of λ for 
     parallel
• Optimize using early 
  acceptance/termination 
   – Sequential experiment setting
   – Stop early if confident
     Stop early if confident
Prediction Completion Time
    Prediction Completion Time




• For Freebase, workers use log‐normal time to 
  complete a task [Kochhar et al, HCOMP 2010]
Predicting Completion Time
• Exponential assumption usually not realistic
• H       il d di ib i
  Heavy‐tailed distribution [Ipeirotis, XRDS 2010]
Effect of #HITs: Monotonic, but sublinear

                     h(t) = 0.998^#HITs




•   10 HITs  2% slower than 1 HIT
•   100 HITs  19% slower than 1 HIT 
•   1000 HITs  87% slower than 1 HIT
    1000 HITs  87% slower than 1 HIT 
    or, 1 group of 1000  7 times faster than 1000 sequential groups of 1
                                                                [Wang et al, CSDM 2011]
HIT Topics
topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade

topic 2:  d
   i 2 data  collection  search  image  entry  listings  website  review  survey  opinion
               ll i           h i              li i        bi        i              i i

topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion

topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link

topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles

topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul

topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  answer




                                                                         [Wang et al, CSDM 2011]
Effect of Topic: The CastingWords Effect




      topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade
      topic 2:  data  collection  search  image  entry  listings  website  review  survey  opinion
      topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion
      topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link
      topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles
        p       q                                                            g             g
      topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul
      topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  query  question  answer


                                                                                           [Wang et al, CSDM 2011]
Effect of Topic: Surveys=fast (even with redundancy!)




       topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade
       topic 2:  data  collection  search  image  entry  listings  website  review  survey  opinion
       topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion
       topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link
       topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles
         p       q                                                            g             g
       topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul
       topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  query  question  answer


                                                                                            [Wang et al, CSDM 2011]
Effect of Topic: Writing takes time




     topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade
     topic 2:  data  collection  search  image  entry  listings  website  review  survey  opinion
     topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion
     topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link
     topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles
       p       q                                                            g             g
     topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul
     topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  query  question  answer


                                                                                          [Wang et al, CSDM 2011]
Optimizing Completion Time
    Optimizing Completion Time
• Workers pick tasks that have large number of
  Workers pick tasks that have large number of 
  HITs or are recent [Chilton et al., HCOMP 2010]
• VizWiz optimizations [[Bingham, UIST 2011] :
          optimizations  i h            S 20 ]

  – Posts HITs continuously (to be recent) 
  – Mk bi HIT
    Mkes big HIT groups (to be large)
                         ( b l         )
  – HITs are “external HITs” (i.e., IFRAME hosted)
  – HITs populated when the worker accepts them
Optimizing Completion Time
     Optimizing Completion Time
• Completion rate varies with
  Completion rate varies with 
  time of day, depending on 
  the audience location (India 
  the audience location (India
  vs US vs Middle East)


• Quality tends to remain the 
  same, independent of 
  same independent of
  completion time 
  [Huang et al., HCOMP 2010]
  [Huang et al., HCOMP 2010]
Other Optimizations
                  Other Optimizations
• Qurk [Markus et al., CIDR 2011] and CrowdDB [Franklin et al., SIGMOD 2011]: 
        [                       ]
  Treat humans as uncertain UDFs + apply relational 
  optimization, plus the “GoodEnough” and “StopAfter” 
  operator.
  operator

• CrowdFlow [Quinn et al ]: Integrate crowd with machine
              [Quinn et al.]: Integrate crowd with machine 
  learning to reach balance of speed, quality, cost

• Ask humans for directions in a graph: [Parameswaran et 
  al., VLDB 2011]. See also [Kleinberg, Nature 2000; 
  Mitzenmacher, XRDS 2010; Deng, ECCV 2010]
  Mitzenmacher XRDS 2010; Deng ECCV 2010]
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
            d
Incentives
• Monetary
• Self‐serving
• Altruistic
   l i i
Incentives: Money
                    Incentives: Money
• Money does not improve quality but (generally)
  Money does not improve quality but  (generally) 
  increase participation [Ariely, 2009; Mason & Watts, 2009]

• But workers may be “target earners” (stop after 
  reaching their daily goal) [Horton & Chilton, 2010 for MTurk; 
      hi    h i d il      l)
  Camerer et al. 1997, Farber 2008, for taxi drivers; Fehr and Goette 2007]
Incentives: Money and Trouble
     Incentives: Money and Trouble
• Careful: Paying a little often worse than paying 
             y g                            p y g
  nothing! 
   – “Pay enough or not at all” [Gneezy et al, 2000]
   – Small pay now locks future pay
     Small pay now locks future pay
   – Payment replaces internal motivation (paying kids to collect 
     donations decreased enthusiasm; spam classification; “thanks for 
     dinner, here is $100 )
     dinner here is $100”)
   – Lesson: Be the Tom Sawyer (“how I like painting the 
     fence”), not the scrooge‐y boss…

• Paying a lot is a counter‐incentive: 
   – People focus on the reward and not on the task
     People focus on the reward and not on the task
   – On MTurk spammers routinely attack highly‐paying tasks
Incentives
• Monetary
• Self‐serving
• Altruistic
   l i i
Incentives: Leaderboards
         Incentives: Leaderboards
• Leaderboards (“top participants”) frequent
               ( top participants ) frequent 
  motivator
  – Should motivate correct behavior not just
    Should motivate correct behavior, not just 
    measurable behavior
  – Newcomers should have hope of reaching top
    Newcomers should have hope of reaching top
  – Whatever is measured, workers will optimize for 
    this (e.g., Orkut country leaderboard; complaints for quality score drops)
    this (e.g., Orkut country leaderboard; complaints for quality score drops)
  – Design guideline: Christmas‐tree dashboard (Green / Red lights only)


                             [Farmer and Glass, 2010]
Incentives: Purpose of Work
     Incentives: Purpose of Work
• Contrafreeloading: Rats and animals prefer to
  Contrafreeloading: Rats and animals prefer to 
  “earn” their food

• Destroying work after production demotivates
  workers. [Ariely et al, 2008]
     k     [A i l      l 2008]

• Showing result of “completed task” improves 
  satisfaction
Incentives: Purpose of Work
           Incentives: Purpose of Work
• Workers enjoy learning new skills (oft cited reason for 
             j y       g            (
  Mturk participation)

• Design tasks to be educational
   – DuoLingo: Translate while learning new language [von Ahn et al, 
     duolingo.com]
   – Galaxy Zoo, Clickworkers: Classify astronomical objects 
     [Raddick et al, 2010; http://en.wikipedia.org/wiki/Clickworkers]
   – Citizen Science: Learn about biology   gy
     [http://www.birds.cornell.edu/citsci/]
   – National Geographic “Field Expedition: Mongolia”, tag 
     potential archeological sites, learn about archeology
     potential archeological sites, learn about archeology
Incentives: Credit and Participation
Incentives: Credit and Participation
• Public credit contributes to sense of
  Public credit contributes to sense of 
  participation
• Credit also a form of reputation
  Credit also a form of reputation

• (Anonymity of MTurk‐like settings discourage this factor)
Incentives
• Monetary
• Self‐serving
• Altruistic
   l i i
Incentive: Altruism
            Incentive: Altruism
• Contributing back (tit for tat): Early reviewers
  Contributing back (tit for tat): Early reviewers 
  writing reviews because read other useful 
  review

• Effect amplified in social networks: “If all my
  Effect amplified in social networks:  If all my 
  friends do it…” or “Since all my friends will see 
  this…”

• Contributing to shared goal
  Contributing to shared goal
Incentives: Altruism and Purpose
  Incentives: Altruism and Purpose
• On MTurk [Chandler and Kapelner 2010]
  On MTurk [Chandler and Kapelner, 2010]
   – Americans [older, more leisure‐driven] work 
     harder for  meaningful work
     harder for “meaningful work”
   – Indians [more income‐driven] were not affected 
   – Quality unchanged for both groups
     Quality unchanged for both groups
Incentives: Fair share
          Incentives: Fair share
• Anecdote: Same HIT (spam classification)
  Anecdote: Same HIT (spam classification)
  – Case 1: Requester doing as side‐project, to “clean 
    the market would be out‐of‐pocket expense no
    the market”, would be out of pocket expense, no 
    pay to workers
  – Case 2: Requester researcher at university, spam
    Case 2: Requester researcher at university, spam 
    classification now a university research project, 
    payment to workers

               What setting worked best?
Incentives: FUN!
                    Incentives: FUN!
• Game‐ify the task (design details later)
• Examples
       p
   – ESP Game: Given an image, type the same 
     word (generated image descriptions)
   – Phylo: aligned color blocks (used for genome
     Phylo: aligned color blocks (used for genome 
     alignment)
   – FoldIt: fold structures to optimize energy 
     (protein folding)
     (protein folding)

•   Fun factors [Malone 1980, 1982]:
     – ti d
       timed response
     – score keeping
     – player skill level
     – high‐score lists
     – and randomness
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Market Design Organizes the Crowd
Market Design Organizes the Crowd
• Reputation Mechanisms 
   eputat o   ec a s s
   – Seller‐side: Ensure worker quality 
   – Buy‐side: Ensure employee trustworthiness

• Task organization for task discovery (worker finds 
  employer/task)
           /     )

• Worker expertise recording for task assignment 
  (employer/task finds worker)
Lack of Reputation and Market for Lemons
Lack of Reputation and Market for Lemons
• “When quality of sold good is uncertain and hidden before 
  transaction, prize goes to value of lowest valued good
  transaction, prize goes to value of lowest valued good” 
  [Akerlof, 1970; Nobel prize winner]

Market evolution steps:
1. Employers pays $10 to good worker, $0.1 to bad worker
2. 50% good workers, 50% bad; indistinguishable from each other
2 50%      d     k    50% b d d           h bl f        h h
3. Employer offers price in the middle: $5
4. Some good workers leave the market (pay too low)
4 Some good workers leave the market (pay too low)
5. Employer revised prices downwards as % of bad increased
6. More good workers leave the market… death spiral 
        g                                          p

              http://en.wikipedia.org/wiki/The_Market_for_Lemons
Lack of Reputation and Market for Lemons
    Lack of Reputation and Market for Lemons
• Market for lemons also on the employer side:
      – Workers distrust (good) newcomer employers: Charge risk premium, 
         or work only for little bit. Good newcomers get disappointed
      – Bad newcomers have no downside (will not pay), continue to offer 
         work.
         work
      – Market floods with bad employers
•    TurkOpticon, external reputation system
•    “Mechanical Turk: Now with 40.92% spam” http://bit.ly/ew6vg4 

• Gresham's Law: the bad drives out the good
• No‐trade equilibrium: no good employer offers work in a
  No trade equilibrium: no good employer offers work in a 
  market with bad workers, no good worker wants to work for 
  bad employers…
• In reality, we need to take into consideration that this is a
  In reality, we need to take into consideration that this is a 
  repeated game (but participation follows a heavy tail…)
                  http://en.wikipedia.org/wiki/The_Market_for_Lemons
Reputation systems
            Reputation systems
• Significant number of reputation mechanisms
  Significant number of reputation mechanisms 
  [Dellarocas et al, 2007]

• Link analysis techniques [TrustRank, EigenTrust, 
  NodeRanking, NetProbe, Snare] often applicable
                                 f       li bl
Challenges in the Design of Reputation Systems


  • Insufficient participation
                 p      p

  • Overwhelmingly positive feedback

  • Dishonest reports

  • Identity changes
    Identity changes

  • Value imbalance exploitation (“milking the 
    reputation”)
Insufficient Participation

      • Free‐riding: feedback constitutes a public good. Once available, 
        everyone can costless‐ly benefit from it.
      • Disadvantage of early evaluators: provision of feedback 
        p
        presupposes that the rater will assume the risks of transacting with 
              pp                                                      g
        the ratee (competitive advantage to others).




       • [A
         [Avery et al. 1999] propose a mechanism whereby early 
                    l 1999]                h i      h b        l
         evaluators are paid to provide information and later evaluators 
         pay to balance the budget.
Overwhelmingly Positive Feedback (I)

         More than 99% of all feedbacks posted on eBay are positive. 
         H         I            i            d f 16% f ll
         However, Internet auctions accounted for 16% of all consumer 
         fraud complaints received by the Federal Trade Commission in 
         2004. (http://www.consumer .gov/sentinel/)
                                                     Reporting Bias

      The perils of reciprocity:
      • Reciprocity: Seller evaluated buyer, buyer evaluates seller
      • Exchange of courtesies
      • Positive reciprocity: positive ratings are given in the hope 
        of getting a positive rating in return
      • Negative reciprocity: negative ratings are avoided because 
        of fear of retaliation from the other party
Overwhelmingly Positive Feedback (II)

       “The sound of silence”: No news, bad news…
       • [Dellarocas and Wood 2008] Explore the frequency of 
         different feedback patterns and use the non‐reports to 
         compensate for reporting bias.
          • eBay traders are more likely to post feedback when satisfied 
            than when dissatisfied
            than when dissatisfied
          • Support presence of positive and negative reciprocation 
            among eBay traders.
Dishonest Reports

       • “Ballot stuffing” (unfairly high ratings): a seller colludes with a 
         group of buyers in order to be given unfairly high ratings by them.
         group of buyers in order to be given unfairly high ratings by them
       • “Bad‐mouthing” (unfairly low ratings): Sellers can collude with 
         buyers in order to “bad‐mouth” other sellers that they want to drive 
           y                                                  y
         out the market.



      • Design incentive‐compatible mechanism to elicit honest feedbacks
        [
        [Jurca and Faltings 2003: pay rater if report matches next; 
                         g         p y           p                  ;
        Miller et al. 2005: use a proper scoring rule to price value of report;
        Papaioannou and Stamoulis 2005: delay next transaction over time]
      • U “l t t l ”
        Use “latent class” models described earlier in the tutorial 
                             d l d      ib d   li i th t t i l
        (reputation systems is a form of crowdsourcing after all…)
Identity Changes


        • “Cheap pseudonyms”: easy to disappear and re‐
                 pp        y        y        pp
          register under a new identity with almost zero cost. 
          [Friedman and Resnick 2001]

        • I
          Introduce opportunities to misbehave without paying 
               d             ii        ib h     ih        i
          reputational consequences.  



      • Increase the difficulty of online identity changes
      • Impose upfront costs to new entrants: allow new identities 
        (forget the past) but make it costly to create them
Value Imbalance Exploitation

         Three men attempt to sell a fake painting on eBay for $US 
         135,805. The sale was abandoned just prior to purchase when 
         135 805 Th     l       b d       dj      i         h      h
         the buyer became suspicious.(http://news.cnet.com/2100‐
         1017‐253848.html)



    • Reputation can be seen as an asset not only to
      Reputation can be seen as an asset, not only to 
      promote oneself, but also as something that can be 
      cashed in through a fraudulent transaction with high 
                     g                                  g
      gain.

                              “The Market for Evaluations”
The Market for Positive Feedbacks

        A selling strategy that eBay users are actually using the 
                 g      gy         y                  y     g
                feedback market for gains in other markets.
                 “Riddle for a PENNY! No shipping‐Positive Feedback”
                  Riddle for a PENNY! No shipping Positive Feedback



    • 29‐cent loss even in the event of a successful sale
    • Price low, speed feedback accumulation

Possible solutions:
    • Make the details of the transaction (besides the feedback itself) visible to other users
    • T
      Transaction‐weighted reputational statistics 
             ti     i ht d     t ti   l t ti ti

                                                                                 [Brown 2006]
Challenges for Crowdsourcing Markets (I)
• Two‐sided opportunistic behavior
   • Reciprocal systems worse than one‐side evaluation. In e‐commerce 
     markets, only sellers are likely to behave opportunistically. No need for 
     markets only sellers are likely to behave opportunistically No need for
     reciprocal evaluation!
   • In crowdsourcing markets, both sides can be fraudulent. Reciprocal 
     systems are fraught with problems, though!


• I
  Imperfect monitoring and heavy‐tailed participation
       f t     it i      dh      t il d    ti i ti
   • In e‐commerce markets, buyers can assess the product quality directly 
     upon receiving.
   • In crowdsourcing markets, verifying the answers is sometimes as costly as 
     providing them.
   • Sampling often does not work due to heavy tailed participation
     Sampling often does not work, due to heavy tailed participation 
     distribution (lognormal, according to self‐reported surveys)
Challenges for Crowdsourcing Markets (II)

 • Constrained capacity of workers
    • In e‐commerce markets sellers usually have unlimited supply of
      In e commerce markets, sellers usually have unlimited supply of 
      products.
    • In crowdsourcing, workers have constrained capacity (cannot be 
      recommended continuously)


 • No “price premium” for high quality workers
   No “price premium” for high‐quality workers
    • In e‐commerce markets, sellers with high reputation can sell their 
      products at a relatively high price (premium).
    • In crowdsourcing, it is the requester who set the prices, which are 
      generally the same for all the workers.
Market Design Organizes the Crowd
Market Design Organizes the Crowd
• Reputation Mechanisms 
   eputat o   ec a s s
   – Seller‐side: Ensure worker quality 
   – Buy‐side: Ensure employee trustworthiness

• Task organization for task discovery (worker finds 
  employer/task)
           /     )

• Worker expertise recording for task assignment 
  (employer/task finds worker)
The Importance of Task Discovery




  Heavy tailed distribution of completion times. Why?
• Heavy tailed distribution of completion times Why?
      [Ipeirotis, “Analyzing the Amazon Mechanical Turk marketplace”, XRDS 2010]
The Importance and Danger of Priorities
The Importance and Danger of Priorities
• [Barabasi, Nature 2005] showed that human actions 
  [        ,            ]
  have power‐law completion times
   – Mainly result of prioritization
   – Wh t k
     When tasks ranked by priorities, power‐law results
                     k db       i iti       l       lt

• [Cobham 1954] If queuing system completes tasks
  [Cobham, 1954] If queuing system completes tasks 
  with two priority queues, and λ=μ, then power‐law 
  completion times

• [Chilton et al., HCOMP 2010] Workers on Turk pick 
  tasks from  most HITs or most recent queues
  tasks from “most HITs” or “most recent” queues
The UI hurts the market!

• Practitioners know that HITs in 3rd page and after, 
                                      p g           ,
  are not picked by workers. 
• Many such HITs are left to expire after months
  Many such HITs are left to expire after months, 
  never completed.

• Badly designed task discovery interface hurts every 
  participant in the market! (and the reason for scientific modeling…)
     ti i   t i th      k t!
• Better modeling as a queuing system may 
  demonstrate other such improvements
Market Design Organizes the Crowd
Market Design Organizes the Crowd
• Reputation Mechanisms 
   eputat o   ec a s s
   – Seller‐side: Ensure worker quality 
   – Buy‐side: Ensure employee trustworthiness

• Task organization for task discovery (worker finds 
  employer/task)
           /     )

• Worker expertise recording for task assignment 
  (employer/task finds worker)
Expert Search
                         Expert Search
• Find the best worker for a task (or within a task)
  Find the best worker for a task, (or within a task)

• For a task:
           k
   – Significant amount of research in the topic of expert 
     search [TREC track; Macdonald and Ounis, 2006]
           h
   – Check quality of workers across tasks
     http://url‐annotator.appspot.com/Admin/WorkersReport
     http://url annotator appspot com/Admin/WorkersReport


• Within a task: [Donmez et al., 2009; Welinder, 2010]
    t    a tas : [ o e et a , 009; e de , 0 0]
Directions for future research
     Directions for future research
• Optimize allocation of tasks to worker based on completion 
  time and expected quality

• Explicitly take into consideration competition in market and
  Explicitly take into consideration competition in market and 
  switch task for worker only when benefit outweighs 
  switching overhead (task switching in CPU from O/S)

• Recommender system for tasks (“workers like you 
  performed well in…”)

• Create a market with dynamic pricing for tasks, following 
  the pricing model of the stock market (prices increase for 
  task when work supply low, and vice versa)
  task when work supply low and vice versa)
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Human Computation
• Humans are not perfect mathematical models
  Humans are not perfect mathematical models

• They exhibit noisy, stochastic behavior…
   h     hibi    i        h i b h i

• And exhibit common and systematic biases
Score the following from 1 to 10 
           1: not particularly bad or wrong
           1    t    ti l l b d
                  10: extremely evil 

a) Stealing a towel from a hotel 
b) Keeping a dime you find on the ground 
       p g         y              g
c) Poisoning a barking dog




                                              [Parducci, 1968]
Score the following from 1 to 10 
            1: not particularly bad or wrong
            1    t    ti l l b d
                   10: extremely evil 

a) Testifying falsely for pay
        gg                g
b) Using guns on striking workers
c) Poisoning a barking dog




                                               [Parducci, 1968]
Anchoring 
                      Anchoring
• “Humans start with a first approximation (anchor) and 
                       f      pp           (      )
  then make adjustments to that number based on 
  additional information.” [Tversky & Kahneman, 1974]

• [Paolacci et al, 2010]
   – Q1a: More or less than 65 African countries in UN?
     Q1a: More or less than 65 African countries in UN?
   – Q1b: More or less than 12 African countries in UN?

   – Q2: How many countries in Africa?
   – Group A mean: 42.6
   – Group B mean: 18 5
     Group B mean: 18.5
Anchoring 
                   Anchoring
• Write down the last digit of their social security 
                        g f                        y
  number before placing bid for wine bottles. Users 
  with lower SSN numbers bid lower…

• In the Netflix contest, user with high ratings early 
  on, biased towards higher ratings later in a 
  on, biased towards higher ratings later in a
  session…

• Crowdsourcing tasks can be affected by 
  anchoring. [Moren et al, NIPS 2010] describe 
  techniques for removing effects
  techniques for removing effects
Priming
• Exposure to one stimulus influences another
  Exposure to one stimulus influences another
• Stereotypes: 
  – Asian americans perform better in math
    Asian‐americans perform better in math
  – Women perform worse in math


• [Shih et al., 1999] asked Asian‐American women:
  – Q ti
    Questions about race: They did better in math test
               b t        Th did b tt i        th t t
  – Questions about gender: They did worse in math test
Exposure Effect
               Exposure Effect
• Familiarity leads to liking
  Familiarity leads to liking...

• [S
  [Stone and Alonso, 2010]: Evaluators of Bing 
           d l       20 0]      l         f i
  search engine increase their ratings of 
  relevance over time, for the same results
     l            i    f h                l
Framing
• Presenting the same option in different
  Presenting the same option in different 
  formats leads to different formats. People 
  avert options that imply loss [Tversky and 
  avert options that imply loss [Tversky and
  Kahneman (1981)]
Framing: 
             600 people affected by deadly disease
             600 people affected by deadly disease
Room 1
a) save 200 people's lives
   save 200 people s lives
b) 33% chance of saving all 600 people and a 66% chance saving no one

• 72% of participants chose option A
  72% of participants chose option A
• 28% of participants chose option B

Room 2
Room 2
c) 400 people die
d) 33% chance that no people will die; a 66% chance that all 600 will die

• 78% of participants chose option D (equivalent to option B)
• 22% of participants chose option C (equivalent to option A)

                    People avert options that imply loss 
Very long list of cognitive biases…
 Very long list of cognitive biases…
• http://en.wikipedia.org/wiki/List_of_cognitive_biases
     p //       p       g/    /          g

• [Mozer et al., 2010] try to learn and remove sequential effects 
  from human computation data…
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Games with a Purpose
       [Luis von Ahn and Laura Dabbish, CACM 2008]


Three generic game structures

• Output agreement: 
  – Type same output
• I
  Input agreement: 
  – Decide if having same input
• Inversion problem:
  Inversion problem: 
  – P1 generates output from input
  – P2 looks at P1‐output and guesses P1‐input
Output Agreement: ESP Game
  Output Agreement: ESP Game
• Players look at common input
  Players look at common input
• Need to agree on output
Improvements
• Game‐theoretic analysis indicates that players
  Game theoretic analysis indicates that players 
  will converge to easy words [Jain and Parkes]
• Solution 1: Add “Taboo words” to prevent  
  Solution 1: Add  Taboo words to prevent
  guessing easy words
• S l i 2 Ki Ki B
  Solution 2: KissKissBan, third player tries to 
                            hi d l            i
  guess (and block) agreement
Input Agreement: TagATune
       p    g           g
• Sometimes difficult to type identical output 
  (e.g., “describe this song”)
                             p ,
• Show same of different input, let users 
  describe, ask players if they have same input
Inversion Problem: Peekaboom
    Inversion Problem: Peekaboom
•   Non symmetric players
    Non‐symmetric players
•   Input: Image with word
•   Player 1 slowly reveals pic
      l       l l         l i
•   Player 2 tries to guess word
HINT
HINT
HINT
HINT
BUSH
HINT
Protein folding
                 Protein folding
• Protein folding: Proteins fold from long chains into 
  small balls, each in a very specific shape

• Shape is the lower energy setting which the most
  Shape is the lower‐energy setting, which the most 
  stable

• Fold shape is very important to understand interactions
  with out molecules

• Extremely expensive computationally! (too many 
  degrees of freedom)
FoldIt Game
• Humans are very good at reducing the search
  Humans are very good at reducing the search 
  space

• Humans try to fold the protein into a minimal 
  energy state. 

• Can leave protein unfinished and let others try 
  from there…
Outline
•   Introduction: Human computation and crowdsourcing
•   Managing quality for simple tasks
•   Complex tasks using workflows
•   Task optimization
•   Incentivizing the crowd
    Incentivizing the crowd
•   Market design
•   Behavioral aspects and cognitive biases
    Behavioral aspects and cognitive biases
•   Game design
•   Case studies
             d
Case Study: Freebase
Case Study: Freebase




Praveen Paritosh, Google
Crowdsourcing Case Study
        AdSafe
177
178
179
A few of the tasks in the past
• Detect pages that discuss swine flu
  – Pharmaceutical firm had drug “treating” (off-label) swine flu
  – FDA prohibited pharmaceutical company to display drug ad in pages
    about swine flu
  – Two days to build and go live


• Big fast-food chain does not want ad to
  appear:
  – In pages that discuss the brand (99% negative sentiment)
  – In pages discussing obesity
  – Three days to build and go live

                                                                 180
Need to build models fast
          Need to build models fast

     • T diti
       Traditionally, modeling teams have invested substantial 
                  ll     d li t       h     i    t d b t ti l
       internal resources in data formulation, information 
       extraction, cleaning, and other preprocessing
            No time for such things…
     • However, now, we can outsource preprocessing tasks, such 
       as labeling, feature extraction, verifying information 
          l b li f t          t ti         if i i f      ti
       extraction, etc.
        – using Mechanical Turk, oDesk, etc.
        – quality may be lower than expert labeling (much?) 
        – but low costs can allow massive scale
18
1
AdSafe workflow
 • Find URLs for a given topic (hate speech, gambling, alcohol 
   abuse, guns, bombs, celebrity gossip, etc etc)
    b            b b       l bi        i        )
   http://url‐collector.appspot.com/allTopics.jsp
 • Classify URLs into appropriate categories
   Classify URLs into appropriate categories 
   http://url‐annotator.appspot.com/AdminFiles/Categories.jsp 
 • Mesure quality of the labelers and remove spammers
   http://qmturk.appspot.com/
   htt // t k            t     /
 • Get humans to “beat” the classifier by providing cases where 
   the classifier fails
   http://adsafe‐beatthemachine.appspot.com/


18
2
Case Study: OCR and ReCAPTCHA
Case Study: OCR and ReCAPTCHA
Scaling Crowdsourcing: Use Machine Learning

 Need to scale crowdsourcing
 Basic idea: Build a machine learning model and use it
  instead of humans



New case              Automatic Model           Automatic
                   (through machine learning)    Answer



                   Existing data (through
                        crowdsourcing)
Scaling Crowdsourcing: Iterative training
 Ti
  Triage:
   – machine when confident
   – humans when not confident
 Retrain using the new human input
                                               Automatic
  → improve model
                                                 Answer
   → reduce need for human input

New Case             Automatic Model
                  (through machine learning)




                  Data from existing           Get human(s) to
                                                    answer
                crowdsourced answers
Scaling Crowdsourcing: Iterative training, with noise
 Machine when confident, humans otherwise
 Ask as many humans as necessary to ensure quality

                                                           Automatic
                                                            Answer

New Case             Automatic Model
                                                                Not confident
                  (through machine l
                  (th    h    hi learning)
                                      i )
                                                                     for quality?



             Data from existing                            Get human(s) to
           crowdsourced answers                                 answer
                                  Confident for quality?
Scaling Crowdsourcing: Iterative training, with noise
 Machine when confident, humans otherwise
 Ask as many humans as necessary to ensure quality
  – Or even get other machines…
                      machines
                                                             Automatic
                                                              Answer

New Case             Automatic Model
                                                                 Not confident
                  (through machine l
                  (th    h    hi learning)
                                      i )
                                                                   about quality?



             Data from existing                               Get human(s) or
           crowdsourced answers                                other machines
                                  Confident about quality?        to answer
Example: ReCAPTCHA + Google Books




                                            portion distinguished



   Fixes errors of Optical Character Recognition (OCR ~ 1% error rate 20%
                                                                   rate, 20%-
    30% for 18th and 19th century books, according to today’s NY Times article)
   Improves further the OCR algorithm, reducing error rate
   “40 million R CAPTCHA every d ” (2008) Fi i 40 000 b k a day
          illi ReCAPTCHAs             day”          Fixing 40,000 books d
     – [Unofficial quote from Luis]: 400M/day (2010)
     – All books ever written:100 million books (~12yrs??)
Thank you!
Thank you!
Managing Crowdsourced Human Computation: A Tutorial
Managing Crowdsourced Human Computation: A Tutorial
Managing Crowdsourced Human Computation: A Tutorial
Managing Crowdsourced Human Computation: A Tutorial
Managing Crowdsourced Human Computation: A Tutorial
Managing Crowdsourced Human Computation: A Tutorial
Managing Crowdsourced Human Computation: A Tutorial

More Related Content

What's hot

Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
How to use Map() Filter() and Reduce() functions in Python | Edureka
How to use Map() Filter() and Reduce() functions in Python | EdurekaHow to use Map() Filter() and Reduce() functions in Python | Edureka
How to use Map() Filter() and Reduce() functions in Python | EdurekaEdureka!
 
Client Side Monitoring With Prometheus
Client Side Monitoring With PrometheusClient Side Monitoring With Prometheus
Client Side Monitoring With PrometheusWeaveworks
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?Wojciech Barczyński
 
Netflix API - Presentation to PayPal
Netflix API - Presentation to PayPalNetflix API - Presentation to PayPal
Netflix API - Presentation to PayPalDaniel Jacobson
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Introduction to matlab lecture 1 of 4
Introduction to matlab lecture 1 of 4Introduction to matlab lecture 1 of 4
Introduction to matlab lecture 1 of 4Randa Elanwar
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
 
Web API testing : A quick glance
Web API testing : A quick glanceWeb API testing : A quick glance
Web API testing : A quick glanceDhanalaxmi K
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For OperatorsKevin Brockhoff
 
APIdays London 2019 - Selecting the best API Governance for your organisation...
APIdays London 2019 - Selecting the best API Governance for your organisation...APIdays London 2019 - Selecting the best API Governance for your organisation...
APIdays London 2019 - Selecting the best API Governance for your organisation...apidays
 
Cloud Observability mit Loki, Prometheus, Tempo und Grafana
Cloud Observability mit Loki, Prometheus, Tempo und GrafanaCloud Observability mit Loki, Prometheus, Tempo und Grafana
Cloud Observability mit Loki, Prometheus, Tempo und GrafanaQAware GmbH
 
Logging, Metrics, and APM: The Operations Trifecta (P)
Logging, Metrics, and APM: The Operations Trifecta (P)Logging, Metrics, and APM: The Operations Trifecta (P)
Logging, Metrics, and APM: The Operations Trifecta (P)Elasticsearch
 
Matlab-Data types and operators
Matlab-Data types and operatorsMatlab-Data types and operators
Matlab-Data types and operatorsLuckshay Batra
 
OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)Sebastian Poxhofer
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaEdureka!
 
What are Tableau Functions? Edureka
What are Tableau Functions? EdurekaWhat are Tableau Functions? Edureka
What are Tableau Functions? EdurekaEdureka!
 

What's hot (20)

Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
How to use Map() Filter() and Reduce() functions in Python | Edureka
How to use Map() Filter() and Reduce() functions in Python | EdurekaHow to use Map() Filter() and Reduce() functions in Python | Edureka
How to use Map() Filter() and Reduce() functions in Python | Edureka
 
Client Side Monitoring With Prometheus
Client Side Monitoring With PrometheusClient Side Monitoring With Prometheus
Client Side Monitoring With Prometheus
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Netflix API - Presentation to PayPal
Netflix API - Presentation to PayPalNetflix API - Presentation to PayPal
Netflix API - Presentation to PayPal
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Grafana
GrafanaGrafana
Grafana
 
Introduction to matlab lecture 1 of 4
Introduction to matlab lecture 1 of 4Introduction to matlab lecture 1 of 4
Introduction to matlab lecture 1 of 4
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 
MATLAB INTRODUCTION
MATLAB INTRODUCTIONMATLAB INTRODUCTION
MATLAB INTRODUCTION
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Web API testing : A quick glance
Web API testing : A quick glanceWeb API testing : A quick glance
Web API testing : A quick glance
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
 
APIdays London 2019 - Selecting the best API Governance for your organisation...
APIdays London 2019 - Selecting the best API Governance for your organisation...APIdays London 2019 - Selecting the best API Governance for your organisation...
APIdays London 2019 - Selecting the best API Governance for your organisation...
 
Cloud Observability mit Loki, Prometheus, Tempo und Grafana
Cloud Observability mit Loki, Prometheus, Tempo und GrafanaCloud Observability mit Loki, Prometheus, Tempo und Grafana
Cloud Observability mit Loki, Prometheus, Tempo und Grafana
 
Logging, Metrics, and APM: The Operations Trifecta (P)
Logging, Metrics, and APM: The Operations Trifecta (P)Logging, Metrics, and APM: The Operations Trifecta (P)
Logging, Metrics, and APM: The Operations Trifecta (P)
 
Matlab-Data types and operators
Matlab-Data types and operatorsMatlab-Data types and operators
Matlab-Data types and operators
 
OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
What are Tableau Functions? Edureka
What are Tableau Functions? EdurekaWhat are Tableau Functions? Edureka
What are Tableau Functions? Edureka
 

Similar to Managing Crowdsourced Human Computation: A Tutorial

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Game theory for neural networks
Game theory for neural networksGame theory for neural networks
Game theory for neural networksDavid Balduzzi
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceAbhishek Upadhyay
 
Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Utah Code Camp 2014 - Learning from Data by Thomas HollowayUtah Code Camp 2014 - Learning from Data by Thomas Holloway
Utah Code Camp 2014 - Learning from Data by Thomas HollowayThomas Holloway
 
Human computation and participatory systems
Human computation and participatory systems Human computation and participatory systems
Human computation and participatory systems Piero Fraternali
 
Intro xmania counting
 Intro xmania counting Intro xmania counting
Intro xmania countingkdtanker
 
Assessing computational thinking
Assessing computational thinkingAssessing computational thinking
Assessing computational thinkingDaniel Duckworth
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine LearningAngelo Simone Scotto
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deckMikeBegley
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine LearningPranav Ainavolu
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptxSaharA84
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Introduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theoryIntroduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theoryAbner Chih Yi Huang
 

Similar to Managing Crowdsourced Human Computation: A Tutorial (20)

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Game theory for neural networks
Game theory for neural networksGame theory for neural networks
Game theory for neural networks
 
14 turing wics
14 turing wics14 turing wics
14 turing wics
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
 
Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Utah Code Camp 2014 - Learning from Data by Thomas HollowayUtah Code Camp 2014 - Learning from Data by Thomas Holloway
Utah Code Camp 2014 - Learning from Data by Thomas Holloway
 
Human computation and participatory systems
Human computation and participatory systems Human computation and participatory systems
Human computation and participatory systems
 
DeepLearning
DeepLearningDeepLearning
DeepLearning
 
Intro xmania counting
 Intro xmania counting Intro xmania counting
Intro xmania counting
 
Artificial Intelligence Literacy
Artificial Intelligence LiteracyArtificial Intelligence Literacy
Artificial Intelligence Literacy
 
AI.ppt
AI.pptAI.ppt
AI.ppt
 
Assessing computational thinking
Assessing computational thinkingAssessing computational thinking
Assessing computational thinking
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deck
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Introduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theoryIntroduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theory
 

More from Panos Ipeirotis

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users
Quizz: Targeted Crowdsourcing with a Billion (Potential) UsersQuizz: Targeted Crowdsourcing with a Billion (Potential) Users
Quizz: Targeted Crowdsourcing with a Billion (Potential) UsersPanos Ipeirotis
 
Humanities and Technology Unite
Humanities and Technology UniteHumanities and Technology Unite
Humanities and Technology UnitePanos Ipeirotis
 
The Market for Intellect: Discovering economically-rewarding education paths
The Market for Intellect: Discovering economically-rewarding education pathsThe Market for Intellect: Discovering economically-rewarding education paths
The Market for Intellect: Discovering economically-rewarding education pathsPanos Ipeirotis
 
On Mice and Men: The Role of Biology in Crowdsourcing
On Mice and Men: The Role of Biology in CrowdsourcingOn Mice and Men: The Role of Biology in Crowdsourcing
On Mice and Men: The Role of Biology in CrowdsourcingPanos Ipeirotis
 
Crowdsourcing using Mechanical Turk: Quality Management and Scalability
Crowdsourcing using Mechanical Turk: Quality Management and ScalabilityCrowdsourcing using Mechanical Turk: Quality Management and Scalability
Crowdsourcing using Mechanical Turk: Quality Management and ScalabilityPanos Ipeirotis
 
Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...
Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...
Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...Panos Ipeirotis
 
Crowdsourcing: Lessons from Henry Ford
Crowdsourcing: Lessons from Henry FordCrowdsourcing: Lessons from Henry Ford
Crowdsourcing: Lessons from Henry FordPanos Ipeirotis
 
New York Mechanical Turk Meetup
New York Mechanical Turk MeetupNew York Mechanical Turk Meetup
New York Mechanical Turk MeetupPanos Ipeirotis
 

More from Panos Ipeirotis (8)

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users
Quizz: Targeted Crowdsourcing with a Billion (Potential) UsersQuizz: Targeted Crowdsourcing with a Billion (Potential) Users
Quizz: Targeted Crowdsourcing with a Billion (Potential) Users
 
Humanities and Technology Unite
Humanities and Technology UniteHumanities and Technology Unite
Humanities and Technology Unite
 
The Market for Intellect: Discovering economically-rewarding education paths
The Market for Intellect: Discovering economically-rewarding education pathsThe Market for Intellect: Discovering economically-rewarding education paths
The Market for Intellect: Discovering economically-rewarding education paths
 
On Mice and Men: The Role of Biology in Crowdsourcing
On Mice and Men: The Role of Biology in CrowdsourcingOn Mice and Men: The Role of Biology in Crowdsourcing
On Mice and Men: The Role of Biology in Crowdsourcing
 
Crowdsourcing using Mechanical Turk: Quality Management and Scalability
Crowdsourcing using Mechanical Turk: Quality Management and ScalabilityCrowdsourcing using Mechanical Turk: Quality Management and Scalability
Crowdsourcing using Mechanical Turk: Quality Management and Scalability
 
Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...
Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...
Big Data, Stupid Decisions / Strata Jumpstart 2011 / Panos Ipeirotis / http:/...
 
Crowdsourcing: Lessons from Henry Ford
Crowdsourcing: Lessons from Henry FordCrowdsourcing: Lessons from Henry Ford
Crowdsourcing: Lessons from Henry Ford
 
New York Mechanical Turk Meetup
New York Mechanical Turk MeetupNew York Mechanical Turk Meetup
New York Mechanical Turk Meetup
 

Recently uploaded

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligencePrecisely
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 

Recently uploaded (20)

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 

Managing Crowdsourced Human Computation: A Tutorial

  • 1. Managing Crowdsourced Human Computation Managing Crowdsourced Human Computation Panos Ipeirotis, New York University Slides from the WWW2011 tutorial, 29 March 2011 Slides from the WWW2011 tutorial 29 March 2011
  • 2. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 3. Human Computation, Round 1 • Humans were the first  “computers,” used for  computers, used for math computations Grier, When computers were human, 2005 Grier, IEEE Annals 1998
  • 4. Human Computation, Round 1 • Humans were the first  “computers,” used for  math computations • Organized computation: – Clairaut, astronomy, 1758:  y Computed the Halley’s  comet orbit (three‐body  problem) dividing the  problem) dividing the labor of numeric  computations across 3  astronomers Grier, When computers were human, 2005 Grier, IEEE Annals 1998
  • 5. Human Computation, Round 1 • Organized computation: – Maskelyne, astronomical almanac  with moon positions, used for  p , navigation, 1760. Quality  assurance by doing calculations  twice and compared by third  verifier. verifier – De Prony, 1794, hires hairdressers  (unemployed after French  (unemployed after French revolution; knew only addition  and subtraction) to create  logarithmic and trigonometric  tables. He managed the process  tables He managed the process by splitting the work into very  detailed workflows. (Hairdressers better  than mathematicians in arithmetic!) Grier, When computers were human, 2005 Grier, IEEE Annals 1998
  • 6. Human Computation, Round 1 • Organized computation: – Clairaut, astronomy, 1758 – Maskelyne, 1760 – De Prony, log/trig tables, 1794 – Galton, biology, 1893 – Pearson, biology, 1899 – … – Cowles, stock market, 1929 – Math Tables Project, unskilled  Math Tables Project unskilled labor, 1938 Grier, When computers were human, 2005 Grier, IEEE Annals 1998
  • 7. Human Computation, Round 1 • Patterns emerging Patterns emerging – Division of labor – Mass production Mass production – Professional managers • Then we got the  “automatic computers” 
  • 8. Human Computation, Round 2 • Now we need humans  again for the “AI‐complete”  i f th “AI l t ” tasks – Tag images [ESP Game: voh Ahn and  Dabbish 2004, ImageNet] – Determine if page relevant Determine if page relevant  [Alonso et al., 2011] – Determine song genre – Check page for offensive Check page for offensive  content –… ImageNet: http://www.image‐net.org/about‐publication
  • 9. Focus of the tutorial Focus of the tutorial Examine cases where humans interact with  Examine cases where humans interact with computers in order to solve a computational  problem (usually too hard to be solved by (usually too hard to be solved by  computers alone)
  • 10. Crowdsourcing and human computation • Crowdsourcing: From macro to micro g – Netflix, Innocentive – Quirky, Threadless – oDesk, Guru, eLance, vWorker D k G L W k – Wikipedia et al. – ESP Game, FoldIt, Phylo, … S Ga e, o d t, y o, – Mechanical Turk, CloudCrowd, … • Crowdsourcing greatly facilitates human  computation (but they are not equivalent)
  • 11. Micro‐Crowdsourcing Example: Labeling Images L b li I using the ESP Game Luis von Ahn MacArthur Fellowship "genius grant" • Two player online game • Partners don’t know each other and can’t  communicate • Object of the game: type the same word • The only thing in common is an image The only thing in common is an image
  • 12.
  • 13. PLAYER 1 PLAYER 2 GUESSING: CAR GUESSING: BOY GUESSING: HAT GUESSING: CAR GUESSING: KID SUCCESS! YOU AGREE ON CAR SUCCESS! YOU AGREE ON CAR
  • 15.
  • 16. Demographics of MTurk workers http://bit.ly/mturk‐demographics http://bit.ly/mturk demographics Country of residence Country of residence • United States: 46.80% • India: 34.00% • Miscellaneous: 19.20%
  • 17. Demographics of MTurk workers http://bit.ly/mturk‐demographics
  • 18. Demographics of MTurk workers http://bit.ly/mturk‐demographics http://bit.ly/mturk demographics
  • 19. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 20. Managing quality for simple tasks Managing quality for simple tasks • Quality through redundancy: Combining votes Quality through redundancy: Combining votes – Majority vote – Quality adjusted vote Quality‐adjusted vote – Managing dependencies • Quality through gold data Q lit th h ld d t • Estimating worker quality (Redundancy + Gold) • Joint estimation of worker quality and difficulty • Active data collection Active data collection
  • 21. Example: Build an “Adult Web Site” Classifier • Need a large number of hand‐labeled sites • G t Get people to look at sites and classify them as: l t l k t it d l if th G (general audience) PG (parental guidance)  R (restricted) X (porn)
  • 22. 22
  • 23. Example: Build an “Adult Web Site” Classifier • Need a large number of hand‐labeled sites • G t Get people to look at sites and classify them as: l t l k t it d l if th G (general audience) PG (parental guidance)  R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr
  • 24. Example: Build an “Adult Web Site” Classifier • Need a large number of hand‐labeled sites • G t Get people to look at sites and classify them as: l t l k t it d l if th G (general audience) PG (parental guidance)  R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr
  • 25. Bad news: Spammers!  Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)
  • 26. Majority Voting and Label Quality  Ask multiple labelers, keep majority label as “true” label  Quality is probability of being correct 1 p=1.0 0.9 p=0.9 09 0.8 p=0.7 Quality for Majority Vote p=0.8 0.7 p=0.6 p is probability 0.6 of individual labeler of individual labeler p=0.5 p=0 5 M 0.5 being correct 0.4 p=0.4 0.3 0.2 1 3 5 7 9 11 13 Number of labelers Binary classification Binary classification 26 Kuncheva et al., PA&A, 2003
  • 27. What if qualities of workers are different? 3 workers, qualities: p‐d, p, p+d Region where majority better • Majority vote works best when workers have similar quality j p • Otherwise better to just pick the vote of the best worker • …or model worker qualities and combine [coming next]
  • 28. Combining votes with different quality Clemen and Winkler, 1990
  • 29. What happens if we have dependencies? Clemen and Winkler, 1985  Positive dependencies decrease the number of effective labelers Positive dependencies decrease the number of effective labelers
  • 30. What happens if we have dependencies? Yule’s Q Y l ’ Q measure of correlation Kuncheva et al., PA&A, 2003  Positive dependencies decrease the number of effective labelers Positive dependencies decrease the number of effective labelers  Negative dependencies can improve results (unlikely both workers  to be wrong at the same time)
  • 31. Vote combination: Meta studies Vote combination: Meta‐studies • Simple averages tend to work well Simple averages tend to work well • C Complex models slightly better but less robust  l d l li h l b b l b [Clemen and Winkler, 1999, Ariely et al. 2000]
  • 32. From aggregate labels to worker quality Look at our spammer friend ATAMRO447HWJQ h ih h 9 k together with other 9 workers After aggregation, we compute confusion matrix for each worker After majority vote, confusion matrix for ATAMRO447HWJQ P[G → G]=100% P[G → X]=0% P[X → G]=100% P[X → X]=0%
  • 33. Algorithm of Dawid & Skene, 1979 Iterative process to estimate worker error rates 1. Initialize by aggregating labels for each object (e.g., use majority vote) 2. Estimate confusion matrix for workers (using aggregate labels) l b l ) 3. Estimate aggregate labels (using confusion matrix) • Keep labels for “gold data unchanged gold data” 4. Go to Step 2 and iterate until convergence Confusion matrix for ATAMRO447HWJQ Our f i d O friend ATAMRO447HWJQ P[G → G]=99.947% P[G → X]=0.053% marked almost all sites as G. P[X → G]=99.153% P[X → X]=0.847% Seems like a spammer…
  • 34. And many variations… And many variations… • van der Linden et al, 1997: Item‐Response Theory a de de et a , 99 : te espo se eo y • Uebersax, Biostatistics 1993: Ordered categories • Uebersax, JASA 1993: Ordered categories, with worker Uebersax, JASA 1993: Ordered categories, with worker  expertise and bias, item difficulty • Carpenter, 2008: Hierarchical Bayesian versions p , y And more recently at NIPS: y • Whitehill et al., 2009: Adding item difficulty • Welinder et al., 2010: Adding worker expertise et al., 2010: Adding worker expertise
  • 35. Challenge: From Confusion Matrixes to Quality Scores All the algorithms will generate “confusion matrixes” for workers Confusion C f i matrix for ti f ATAMRO447HWJQ P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947% How to check if a worker is a spammer using the confusion matrix? g (hint: error rate not enough)
  • 36. Challenge 1:  Spammers are lazy and smart! Spammers are lazy and smart! Confusion matrix for spammer Confusion matrix for good worker  P[X → X]=0% P[X → G]=100% X] 0% G] 100%  P[X → X]=80% P[X → G]=20%  P[G → X]=0% P[G → G]=100%  P[G → X]=20% P[G → G]=80% • Spammers figure out how to fly under the radar… • I In reality, we have 85% G sites and 15% X sites lit h 85% G it d 15% X it • Errors of spammer = 0% * 85% + 100% * 15% = 15% Errors of spammer 0%  85% + 100% 15% 15% • Error rate of good worker = 85% * 20% + 85% * 20% = 20% False negatives: Spam workers pass as legitimate
  • 37. Challenge 2:  Humans are biased! Humans are biased! Error rates for CEO of AdSafe P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0% P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% P[X → G] 0 0% G]=0.0% P[X → P] 0 0% P]=0.0% P[X → R] 0 0% R]=0.0% P[X → X] 100 0% X]=100.0%  In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites  Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%  Error rate of biased worker = 80% * 85% + 100% * 5% = 73% False positives: Legitimate workers appear to be spammers
  • 38. Solution: Reverse errors first, compute  f d error rate afterwards Error Rates for CEO of AdSafe P[G → G]=20 0% G]=20.0% P[G → P]=80 0% P]=80.0% P[G → R]=0 0% R]=0.0% P[G → X]=0 0% X]=0.0% P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0% P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% P[X → G]=0.0% [ ] P[X → P]=0.0% [ ] P[X → R]=0.0% [ ] P[X → X]=100.0% [ ] • When biased worker says G, it is 100% G • When biased worker says P, it is 100% G Wh bi d k P it i 100% G • When biased worker says R, it is 50% P, 50% R • When biased worker says X, it is 100% X Small ambiguity for “R‐rated” votes but other than that, fine!
  • 39. Solution: Reverse errors first, compute  f d error rate afterwards Error Rates for spammer: ATAMRO447HWJQ P[G → G]=100.0% P[G → P]=0.0% P[G → R]=0.0% P[G → X]=0.0% P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0% P[R → G]=100.0% P[R → P]=0.0% P[R → R]=0.0% P[R → X]=0.0% P[X → G]=100 0% G]=100.0% P[X → P]=0 0% P]=0.0% P[X → R]=0 0% R]=0.0% P[X → X]=0 0% X]=0.0% • When spammer says G, it is 25% G, 25% P, 25% R, 25% X • When spammer says P, it is 25% G, 25% P, 25% R, 25% X • When spammer says R, it is 25% G, 25% P, 25% R, 25% X • When spammer says X, it is 25% G, 25% P, 25% R, 25% X [note: assume equal priors] The results are highly ambiguous. No information provided! h l h hl b f
  • 40. Quality Scores • High cost when “soft” labels have probability spread across classes • Low cost when “soft” labels have probability mass concentrated in one class Assigned Label “Soft” Label Cost G <G: 25%, P: 25%, R: 25%, X: 25%> 0.75 G <G: 99%, P: 1%, R: 0%, X: 0%> 0.0198 [***Assume equal misclassification costs] Ipeirotis, Provost, Wang, HCOMP 2010
  • 41. Quality Score • A spammer is a worker who always assigns labels randomly,  regardless of what the true class is. QualityScore = 1 ‐ ExpCost(Worker)/ExpCost(Spammer) • Q li S QualityScore i is useful for the purpose of blocking bad workers and  f lf h f bl ki b d k d rewarding good ones • Essentially a multi class cost sensitive AUC metric Essentially a multi‐class, cost‐sensitive AUC metric • AUC = area under the ROC curve
  • 42. What about Gold testing? Naturally integrated into the latent class model 1. (e.g., 1 Initialize by aggregating labels for each object (e g use majority vote) 2. Estimate error rates for workers (using aggregate labels) 3. Estimate aggregate labels (using error rates, weight worker votes according to quality) • Keep labels for “gold data” unchanged 4 Go to Step 2 and iterate until convergence 4.
  • 43. 3 labels per example • 2 categories, 50/50 Gold Testing Gold Testing • • Quality range: 0.55:0.05:1.0 200 labelers l b l No significant advantage under “good conditions”  (balanced datasets, good worker quality) http://bit.ly/gold‐or‐repeated Wang, Ipeirotis, Provost, WCBI 2011
  • 44. 5 labels per example • 2 categories, 50/50 Gold Testing Gold Testing • • Quality range: 0.55:1.0 200 labelers l b l No significant advantage under “good conditions”  (balanced datasets, good worker quality)
  • 45. 10 labels per example • 2 categories, 50/50 Gold Testing Gold Testing • • Quality range: 0.55:1.0 200 labelers l b l No significant advantage under “good conditions”  (balanced datasets, good worker quality)
  • 46. 10 labels per example • 2 categories, 90/10 Gold Testing Gold Testing • • Quality range: 0.55:0.1.0 l b l 200 labelers Advantage under imbalanced datasets
  • 47. 5 labels per example • 2 categories, 50/50 Gold Testing Gold Testing • • Quality range: 0.55:0.65 200 labelers l b l Advantage with bad worker quality
  • 48. 10 labels per example • 2 categories, 90/10 Gold Testing? Gold Testing? • • Quality range: 0.55:0.65 200 labelers l b l Significant advantage under “bad conditions”  (imbalanced datasets, bad worker quality)
  • 49. Testing workers Testing workers • An exploration‐exploitation scheme: An exploration exploitation – Explore: Learn about the quality of the workers – Exploit: Label new examples using the quality Exploit: Label new examples using the quality
  • 50. Testing workers Testing workers • An exploration‐exploitation scheme: An exploration exploitation – Assign gold labels when benefit in learning better  quality of worker outweighs the loss for labeling a  quality of worker outweighs the loss for labeling a gold (known label) example [Wang et al, WCBI 2011] – Assign an already labeled example (by other Assign an already labeled example (by other  workers) and see if it agrees with majority [Donmez et  al., KDD 2009] – If worker quality changes over time, assume  f k l h accuracy given by HMM and φ(τ) = φ(τ‐1) + Δ  [ [Donmez et al., SDM 2010] , ]
  • 51. Example: Build an “Adult Web Site” Classifier Example: Build an  Adult Web Site Get people to look at sites and classify them as: p p y G (general audience) PG (parental guidance)  R (restricted) X (porn) But we are not going to label the whole Internet… Expensive Slow
  • 52. Integrating with Machine Learning Integrating with Machine Learning • Crowdsourcing is cheap but not free Crowdsourcing is cheap but not free – Cannot scale to web without help • Solution: Build automatic classification models  using crowdsourced data
  • 53. Simple solution • Humans label training data Humans label training data • Use training data to build model Data from existing crowdsourced answers N New C Case Automatic Model Automatic (through machine learning) Answer
  • 54. Quality and Classification Performance Noisy labels lead to degraded task performance Labeling quality increases  classification quality increases Labeling quality increases  classification quality increases Quality = 100% 100 Quality  80% Quality = 80% 90 80 AUC Quality = 60% 70 60 Quality = 50% 50 40 100 120 140 160 180 200 220 240 260 280 300 1 20 40 60 80 Number of examples ("Mushroom" data set) Single‐labeler quality  (p (probability of assigning  y g 54 g correctly a binary label) http://bit.ly/gold‐or‐repeated Sheng, Provost, Ipeirotis, KDD 2008
  • 55. Tradeoffs for Machine Learning Models • Get more data  Improve model accuracy • Improve data quality  Improve classification p q y p Data Quality = 100% 100 Data Quality = 80% Q li 80% 90 80 uracy 70 Data Quality = 60% Q y Accu 60 50 Data Quality = 50% 40 0 0 0 0 0 0 0 0 0 0 0 1 20 40 60 80 10 12 14 16 18 20 22 24 26 28 30 Number of examples (Mushroom) 55
  • 56. Tradeoffs for Machine Learning Models • Get more data: Active Learning, select which  unlabeled example to label [Settles, http://active‐learning.net/] unlabeled example to label [S ttl htt // ti l i t/] • Impro e data q alit Improve data quality:  Repeated Labeling, label again an already labeled  example [Sheng et al. 2008, Ipeirotis et al, 2010] example [Sheng et al 2008 Ipeirotis et al 2010] 56
  • 57. Scaling Crowdsourcing: Iterative training • Use model when confident, humans otherwise • Retrain with new human input → improve Retrain with new human input → improve  model → reduce need for humans Automatic Answer New Case Automatic Model (through machine learning) Data from existing Get human(s) to answer crowdsourced answers
  • 58. Rule of Thumb Results Rule of Thumb Results • With high quality labelers (80% and above): One  g q y ( ) worker per case (more data better) • With low quality labelers (~60%) Multiple workers per case (to improve quality) Multiple workers per case (to improve quality) [Sheng et al KDD 2008; Kumar and Lease CSDM 2011] et al, KDD 2008; Kumar and Lease, CSDM 2011] 58
  • 59. Dawid & Skene meets a Classifier & Skene meets a Classifier • [Raykar et al. JMLR 2010]: Use the et al. JMLR 2010]: Use the  Dawid&Skene scheme but add a classifier as  an additional worker an additional worker • Classifier in each iteration learns from the  consensus labeling 59
  • 60. Selective Repeated Labeling Selective Repeated‐Labeling • We do not need to label everything same number of times • Key observation: we have additional information to guide  selection of data for repeated labeling p g the current multiset of labels  • Example: {+ ‐ + ‐ ‐ +} vs {+ + + + + +} Example:  {+,‐,+,‐,‐,+} vs. {+,+,+,+,+,+} 60
  • 61. Label Uncertainty: Focus on uncertainty • If we know worker qualities, we can estimate log‐odds for each  q , g example: • Assign labels first to examples that are most uncertain (log‐ odds close to 0 for binary case)
  • 62. + + ‐ ‐‐ ‐ + + + + + + ‐ ‐‐ ‐ ‐ + ‐ ‐ ‐ Model Uncertainty (MU) + + + + + + ++ + +‐ ‐ ‐ ‐‐ ‐ ‐ + + + + ‐‐‐‐ ‐‐‐‐ ‐‐ ‐‐ • Learning models of the data provides an  alternative source of information about label  certainty • M d l Model uncertainty: get more labels for instances  t i t t l b l f i t Examples that cause model uncertainty • Intuition? Models – for modeling: why improve training data quality if  “Self‐healing” process model already is certain there? [Brodley et al, JAIR 1999] [Ipeirotis et al NYU 2010] et al, NYU 2010]  – for data quality, low‐certainty “regions” may be due to  incorrect labeling of corresponding instances 62
  • 63. Adult content classification Round Robin Selective labeling 63
  • 64. Too much theory? Too much theory? Open source implementation available at: p p http://code.google.com/p/get‐another‐label/ • Input:  – Labels from Mechanical Turk – Cost of incorrect labelings (e.g., XG costlier than GX) • Output: Output:  – Corrected labels – Worker error rates – Ranking of workers according to their quality
  • 65. Learning from imperfect data Learning from imperfect data 100 • With inherently noisy With inherently noisy  90 80 Accuracy data, good to have  70 60 learning algorithms that  learning algorithms that 50 are robust to noise. 40 0 0 0 0 0 0 0 0 0 0 0 1 20 40 60 80 18 20 22 24 26 28 30 10 12 14 16 Number of examples (Mushroom) • Or use techniques  designed to handle  d i d h dl explicitly noisy data [Lugosi 1992; Smyth, 1995, 1996]
  • 66. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 67. How to handle free‐form answers? • Q: “My task does not have discrete answers….” • A: Break into two HITs:  – “C t ” HIT “Create” – “Vote” HIT Creation HIT Voting HIT: (e.g. find a URL about a topic) Correct or not? • Vote HIT controls quality of Creation HIT Vote HIT controls quality of Creation HIT • Redundancy controls quality of Voting HIT • Catch: If “creation” very good, in voting workers just vote “yes” – Solution: Add some random noise (e.g. add typos) Example: Collect URLs
  • 68. But my free‐form is  Describe this j g g not just right or wrong… • “Create” HIT • “Improve” HIT p • “Compare” HIT Creation HIT ( g (e.g. describe the image) g ) Improve HIT Compare HIT (voting) (e.g. improve description) Which is better? TurkIt toolkit [Little et al., UIST 2010]: http://groups.csail.mit.edu/uid/turkit/
  • 69. version 1: A parial view of a pocket calculator together with  some coins and a pen. version 2: version 2: A view of personal items a calculator, and some gold and  copper coins, and a round tip pen, these are all pocket and wallet sized item used for business, writting, calculating  prices or solving math problems and purchasing items. version 3: A close‐up photograph of the following items:  A CASIO  multi‐function calculator. A ball point pen, uncapped.  Various coins, apparently European, both copper and gold.  Seems to be a theme illustration for a brochure or document  b h ll f b h d cover treating finance, probably personal finance. version 4: …Various British coins; two of £1 value, three of 20p value  and one of 1p value. … and one of 1p value version 8:  “A close‐up photograph of the following items: A  CASIO multi‐function, solar powered scientific  calculator. A blue ball point pen with a blue rubber  grip and the tip extended. Six British coins; two of £1  value, three of 20p value and one of 1p value. Seems  to be a  theme illustration for a brochure or  document cover treating finance ‐ probably personal  finance."
  • 70. Independence or Not? Independence or Not? • Building iteratively (lack of independent) allows better  outcomes for image description task… • In the FoldIt game workers built on each other’s results In the FoldIt game, workers built on each other s results [Little et al, HCOMP 2010]
  • 71. Independence or Not? • But lack of independence But lack of independence  may cause high  dependence on starting  conditions and create  conditions and create groupthink p • …but also prevents  disasters [Little et al, HCOMP 2010]
  • 72. Independence or Not? Collective Problem Solving • Exploration / exploitation tradeoff  – Can accelerate learning, by sharing good solutions – But can lead to premature convergence on  suboptimal solution [Mason and Watts, submitted to Science, 2011]
  • 73.
  • 74.
  • 75.
  • 76.
  • 77. Individual search strategy affects group success Individual search strategy affects group success • More players copying More players copying  each other (i.e., fewer  exploring) in current  round  Lower probability of  finding peak on next  roundd
  • 78. The role of Communication Networks The role of Communication Networks • Examine various “neighbor” structures  g (who talks to whom about the oil levels)
  • 79. Network structure affects individual search strategy • Higher clustering   Higher probability of  neighbors guessing in  neighbors guessing in identical location • More neighbors guessing  in identical location  Higher probability of  copying
  • 80. Diffusion of Best Solution Diffusion of Best Solution
  • 81. Diffusion of Best Solution Diffusion of Best Solution
  • 82. Diffusion of Best Solution Diffusion of Best Solution
  • 83. Diffusion of Best Solution Diffusion of Best Solution
  • 84. Diffusion of Best Solution Diffusion of Best Solution
  • 85. Diffusion of Best Solution Diffusion of Best Solution
  • 86. Diffusion of Best Solution Diffusion of Best Solution
  • 87. Diffusion of Best Solution Diffusion of Best Solution
  • 88. Individual search strategy affects group success Individual search strategy affects group success • No significant No significant  differences in % of  games in which peak  was found • Network affects  willingness to explore
  • 89. Network structure affects group success Network structure affects group success
  • 90. TurKontrol: Decision Theoretic Modeling TurKontrol: Decision‐Theoretic Modeling • Optimizing workflow execution using decision Optimizing workflow execution using decision‐ theoretic approaches [Dai et al, AAAI 2010; Kern et al. 2010] • Si ifi Significant work in control theory [Montgomery, 2007] ki l h
  • 91. http://www.workflowpatterns.com Common Workflow Patterns Common Workflow Patterns Basic Control Flow Basic Control Flow • Iteration • Sequence • Arbitrary Cycles  (goto) • Parallel Split • Structured Loop (for,  while, repeat) while, repeat) • Synchronization • Recursion • Exclusive Choice • Simple Merge
  • 92. Soylent • Word processor with crowd embedded [Bernstein et al, UIST 2010] • “Proofread paper”: Ask workers to proofread each paragraph –LLazy Turker: Fixes the minimum possible (e.g., single typo) T k Fi th i i ibl ( i l t ) – Eager Beaver: Fixes way beyond the necessary but adds  extra errors (e.g., inline suggestions on writing style) • Find‐Fix‐Verify pattern – Separate Find and Fix, does not allow Lazy Turker – Separate Fix‐Verify ensured quality
  • 93. Find “Identify at least one area t at can that ca be s o te ed shortened without changing the meaning of the paragraph.” Independent agreement to identify patches Fix “Edit the highlighted g g section to shorten its length without changing Soylent, a prototype... the meaning of the paragraph.” Randomize order of suggestions Verify “Choose at least one rewrite that has style errors, and at least one rewrite that changes the meaning of the sentence.”
  • 94. Crowd‐created Workflows: CrowdForge • Map‐Reduce framework for crowds [Kittur et al, CHI  2011] – Identify sights worth checking out ( (one tip per worker)) • Vote and rank – Brief tips for each monument ( (one tip p worker) p per ) • Vote and rank – Aggregate tips in meaningful summary • It t to improve… Iterate t i My Boss is a Robot (mybossisarobot.com),  Nikki Kittur (CMU) + Jim Giles (New Scientist)
  • 95. Crowd‐created Workflows: TurkoMatic • Crowd creates workflows • Turkomatic [Kalkani et al, CHI 2011]: 1. Ask workers to decompose task into steps (Map) 2. Can step be completed within 10 minutes? 1. Yes: solve it. 2. 2 No: decompose further (recursion) 3. Given all partial solutions, solve big problem (Reduce)
  • 96. Crowdsourcing Patterns Crowdsourcing Patterns • Generate / Create i d • Find Creation C ti • Improve / Edit / Fix • Vote for accept‐reject Q Quality  y • Vote up, vote down, to generate rank Control • Vote for best / select top‐k • Split task • Aggregate Flow Control Flow Control
  • 97. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 98. Defining Task Parameters Defining Task Parameters Three main goals: Three main goals: • Minimize Cost (cheap) i i i C ( h ) • Maximize Quality (good) • Minimize Completion Time (fast)
  • 99. Effect of Payment: Quality Effect of Payment: Quality • Cost does not affect quality [Mason and Watts, 2009, AdSafe] • Similar results for bigger tasks [Ariely et al, 2009] 0.45 0.40 0.35 0.30 Error Rate 0.25 2cents 5cents 0.20 10cents 0.15 0 0.10 0.05 0.00 0 10 20 30 Number of Labelers
  • 100. Effect of Payment: #Tasks Effect of Payment: #Tasks • Payment incentives increase speed, though [Mason and Watts, 2009]
  • 101. Predicting Completion Time • Model timing of individual task  [Yan, Kumar, Ganesan, 2010] – Assume rate of task completion λ – Exponential distribution for  single task g – Erlang distribution for sequential  tasks – On the fly estimation of λ for On‐the‐fly estimation of λ for  parallel • Optimize using early  acceptance/termination  – Sequential experiment setting – Stop early if confident Stop early if confident
  • 102. Prediction Completion Time Prediction Completion Time • For Freebase, workers use log‐normal time to  complete a task [Kochhar et al, HCOMP 2010]
  • 103. Predicting Completion Time • Exponential assumption usually not realistic • H il d di ib i Heavy‐tailed distribution [Ipeirotis, XRDS 2010]
  • 104. Effect of #HITs: Monotonic, but sublinear h(t) = 0.998^#HITs • 10 HITs  2% slower than 1 HIT • 100 HITs  19% slower than 1 HIT  • 1000 HITs  87% slower than 1 HIT 1000 HITs  87% slower than 1 HIT  or, 1 group of 1000  7 times faster than 1000 sequential groups of 1 [Wang et al, CSDM 2011]
  • 105. HIT Topics topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade topic 2:  d i 2 data  collection  search  image  entry  listings  website  review  survey  opinion ll i h i li i bi i i i topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  answer [Wang et al, CSDM 2011]
  • 106. Effect of Topic: The CastingWords Effect topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade topic 2:  data  collection  search  image  entry  listings  website  review  survey  opinion topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles p q g g topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  query  question  answer [Wang et al, CSDM 2011]
  • 107. Effect of Topic: Surveys=fast (even with redundancy!) topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade topic 2:  data  collection  search  image  entry  listings  website  review  survey  opinion topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles p q g g topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  query  question  answer [Wang et al, CSDM 2011]
  • 108. Effect of Topic: Writing takes time topic 1 : cw castingwords  podcast  transcribe  english  mp3  edit  confirm  snippet  grade topic 2:  data  collection  search  image  entry  listings  website  review  survey  opinion topic 3:  categorization  product  video  page  smartsheet web  comment  website  opinion topic 4:  easy  quick  survey  money  research  fast  simple  form  answers  link topic 5:  question  answer  nanonano dinkle article  write  writing  review  blog  articles p q g g topic 6:  writing  answer  article  question  opinion  short  advice  editing  rewriting  paul topic 7:  transcribe  transcription  improve  retranscribe edit  answerly voicemail  query  question  answer [Wang et al, CSDM 2011]
  • 109. Optimizing Completion Time Optimizing Completion Time • Workers pick tasks that have large number of Workers pick tasks that have large number of  HITs or are recent [Chilton et al., HCOMP 2010] • VizWiz optimizations [[Bingham, UIST 2011] : optimizations  i h S 20 ] – Posts HITs continuously (to be recent)  – Mk bi HIT Mkes big HIT groups (to be large) ( b l ) – HITs are “external HITs” (i.e., IFRAME hosted) – HITs populated when the worker accepts them
  • 110. Optimizing Completion Time Optimizing Completion Time • Completion rate varies with Completion rate varies with  time of day, depending on  the audience location (India  the audience location (India vs US vs Middle East) • Quality tends to remain the  same, independent of  same independent of completion time  [Huang et al., HCOMP 2010] [Huang et al., HCOMP 2010]
  • 111. Other Optimizations Other Optimizations • Qurk [Markus et al., CIDR 2011] and CrowdDB [Franklin et al., SIGMOD 2011]:  [ ] Treat humans as uncertain UDFs + apply relational  optimization, plus the “GoodEnough” and “StopAfter”  operator. operator • CrowdFlow [Quinn et al ]: Integrate crowd with machine [Quinn et al.]: Integrate crowd with machine  learning to reach balance of speed, quality, cost • Ask humans for directions in a graph: [Parameswaran et  al., VLDB 2011]. See also [Kleinberg, Nature 2000;  Mitzenmacher, XRDS 2010; Deng, ECCV 2010] Mitzenmacher XRDS 2010; Deng ECCV 2010]
  • 112. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 114. Incentives: Money Incentives: Money • Money does not improve quality but (generally) Money does not improve quality but  (generally)  increase participation [Ariely, 2009; Mason & Watts, 2009] • But workers may be “target earners” (stop after  reaching their daily goal) [Horton & Chilton, 2010 for MTurk;  hi h i d il l) Camerer et al. 1997, Farber 2008, for taxi drivers; Fehr and Goette 2007]
  • 115. Incentives: Money and Trouble Incentives: Money and Trouble • Careful: Paying a little often worse than paying  y g p y g nothing!  – “Pay enough or not at all” [Gneezy et al, 2000] – Small pay now locks future pay Small pay now locks future pay – Payment replaces internal motivation (paying kids to collect  donations decreased enthusiasm; spam classification; “thanks for  dinner, here is $100 ) dinner here is $100”) – Lesson: Be the Tom Sawyer (“how I like painting the  fence”), not the scrooge‐y boss… • Paying a lot is a counter‐incentive:  – People focus on the reward and not on the task People focus on the reward and not on the task – On MTurk spammers routinely attack highly‐paying tasks
  • 117. Incentives: Leaderboards Incentives: Leaderboards • Leaderboards (“top participants”) frequent ( top participants ) frequent  motivator – Should motivate correct behavior not just Should motivate correct behavior, not just  measurable behavior – Newcomers should have hope of reaching top Newcomers should have hope of reaching top – Whatever is measured, workers will optimize for  this (e.g., Orkut country leaderboard; complaints for quality score drops) this (e.g., Orkut country leaderboard; complaints for quality score drops) – Design guideline: Christmas‐tree dashboard (Green / Red lights only) [Farmer and Glass, 2010]
  • 118. Incentives: Purpose of Work Incentives: Purpose of Work • Contrafreeloading: Rats and animals prefer to Contrafreeloading: Rats and animals prefer to  “earn” their food • Destroying work after production demotivates workers. [Ariely et al, 2008] k [A i l l 2008] • Showing result of “completed task” improves  satisfaction
  • 119. Incentives: Purpose of Work Incentives: Purpose of Work • Workers enjoy learning new skills (oft cited reason for  j y g ( Mturk participation) • Design tasks to be educational – DuoLingo: Translate while learning new language [von Ahn et al,  duolingo.com] – Galaxy Zoo, Clickworkers: Classify astronomical objects  [Raddick et al, 2010; http://en.wikipedia.org/wiki/Clickworkers] – Citizen Science: Learn about biology  gy [http://www.birds.cornell.edu/citsci/] – National Geographic “Field Expedition: Mongolia”, tag  potential archeological sites, learn about archeology potential archeological sites, learn about archeology
  • 120. Incentives: Credit and Participation Incentives: Credit and Participation • Public credit contributes to sense of Public credit contributes to sense of  participation • Credit also a form of reputation Credit also a form of reputation • (Anonymity of MTurk‐like settings discourage this factor)
  • 122. Incentive: Altruism Incentive: Altruism • Contributing back (tit for tat): Early reviewers Contributing back (tit for tat): Early reviewers  writing reviews because read other useful  review • Effect amplified in social networks: “If all my Effect amplified in social networks:  If all my  friends do it…” or “Since all my friends will see  this…” • Contributing to shared goal Contributing to shared goal
  • 123. Incentives: Altruism and Purpose Incentives: Altruism and Purpose • On MTurk [Chandler and Kapelner 2010] On MTurk [Chandler and Kapelner, 2010] – Americans [older, more leisure‐driven] work  harder for  meaningful work harder for “meaningful work” – Indians [more income‐driven] were not affected  – Quality unchanged for both groups Quality unchanged for both groups
  • 124. Incentives: Fair share Incentives: Fair share • Anecdote: Same HIT (spam classification) Anecdote: Same HIT (spam classification) – Case 1: Requester doing as side‐project, to “clean  the market would be out‐of‐pocket expense no the market”, would be out of pocket expense, no  pay to workers – Case 2: Requester researcher at university, spam Case 2: Requester researcher at university, spam  classification now a university research project,  payment to workers What setting worked best?
  • 125. Incentives: FUN! Incentives: FUN! • Game‐ify the task (design details later) • Examples p – ESP Game: Given an image, type the same  word (generated image descriptions) – Phylo: aligned color blocks (used for genome Phylo: aligned color blocks (used for genome  alignment) – FoldIt: fold structures to optimize energy  (protein folding) (protein folding) • Fun factors [Malone 1980, 1982]: – ti d timed response – score keeping – player skill level – high‐score lists – and randomness
  • 126. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 127. Market Design Organizes the Crowd Market Design Organizes the Crowd • Reputation Mechanisms  eputat o ec a s s – Seller‐side: Ensure worker quality  – Buy‐side: Ensure employee trustworthiness • Task organization for task discovery (worker finds  employer/task) / ) • Worker expertise recording for task assignment  (employer/task finds worker)
  • 128. Lack of Reputation and Market for Lemons Lack of Reputation and Market for Lemons • “When quality of sold good is uncertain and hidden before  transaction, prize goes to value of lowest valued good transaction, prize goes to value of lowest valued good”  [Akerlof, 1970; Nobel prize winner] Market evolution steps: 1. Employers pays $10 to good worker, $0.1 to bad worker 2. 50% good workers, 50% bad; indistinguishable from each other 2 50% d k 50% b d d h bl f h h 3. Employer offers price in the middle: $5 4. Some good workers leave the market (pay too low) 4 Some good workers leave the market (pay too low) 5. Employer revised prices downwards as % of bad increased 6. More good workers leave the market… death spiral  g p http://en.wikipedia.org/wiki/The_Market_for_Lemons
  • 129. Lack of Reputation and Market for Lemons Lack of Reputation and Market for Lemons • Market for lemons also on the employer side: – Workers distrust (good) newcomer employers: Charge risk premium,  or work only for little bit. Good newcomers get disappointed – Bad newcomers have no downside (will not pay), continue to offer  work. work – Market floods with bad employers • TurkOpticon, external reputation system • “Mechanical Turk: Now with 40.92% spam” http://bit.ly/ew6vg4  • Gresham's Law: the bad drives out the good • No‐trade equilibrium: no good employer offers work in a No trade equilibrium: no good employer offers work in a  market with bad workers, no good worker wants to work for  bad employers… • In reality, we need to take into consideration that this is a In reality, we need to take into consideration that this is a  repeated game (but participation follows a heavy tail…) http://en.wikipedia.org/wiki/The_Market_for_Lemons
  • 130. Reputation systems Reputation systems • Significant number of reputation mechanisms Significant number of reputation mechanisms  [Dellarocas et al, 2007] • Link analysis techniques [TrustRank, EigenTrust,  NodeRanking, NetProbe, Snare] often applicable f li bl
  • 131. Challenges in the Design of Reputation Systems • Insufficient participation p p • Overwhelmingly positive feedback • Dishonest reports • Identity changes Identity changes • Value imbalance exploitation (“milking the  reputation”)
  • 132. Insufficient Participation • Free‐riding: feedback constitutes a public good. Once available,  everyone can costless‐ly benefit from it. • Disadvantage of early evaluators: provision of feedback  p presupposes that the rater will assume the risks of transacting with  pp g the ratee (competitive advantage to others). • [A [Avery et al. 1999] propose a mechanism whereby early  l 1999] h i h b l evaluators are paid to provide information and later evaluators  pay to balance the budget.
  • 133. Overwhelmingly Positive Feedback (I) More than 99% of all feedbacks posted on eBay are positive.  H I i d f 16% f ll However, Internet auctions accounted for 16% of all consumer  fraud complaints received by the Federal Trade Commission in  2004. (http://www.consumer .gov/sentinel/) Reporting Bias The perils of reciprocity: • Reciprocity: Seller evaluated buyer, buyer evaluates seller • Exchange of courtesies • Positive reciprocity: positive ratings are given in the hope  of getting a positive rating in return • Negative reciprocity: negative ratings are avoided because  of fear of retaliation from the other party
  • 134. Overwhelmingly Positive Feedback (II) “The sound of silence”: No news, bad news… • [Dellarocas and Wood 2008] Explore the frequency of  different feedback patterns and use the non‐reports to  compensate for reporting bias. • eBay traders are more likely to post feedback when satisfied  than when dissatisfied than when dissatisfied • Support presence of positive and negative reciprocation  among eBay traders.
  • 135. Dishonest Reports • “Ballot stuffing” (unfairly high ratings): a seller colludes with a  group of buyers in order to be given unfairly high ratings by them. group of buyers in order to be given unfairly high ratings by them • “Bad‐mouthing” (unfairly low ratings): Sellers can collude with  buyers in order to “bad‐mouth” other sellers that they want to drive  y y out the market. • Design incentive‐compatible mechanism to elicit honest feedbacks [ [Jurca and Faltings 2003: pay rater if report matches next;  g p y p ; Miller et al. 2005: use a proper scoring rule to price value of report; Papaioannou and Stamoulis 2005: delay next transaction over time] • U “l t t l ” Use “latent class” models described earlier in the tutorial  d l d ib d li i th t t i l (reputation systems is a form of crowdsourcing after all…)
  • 136. Identity Changes • “Cheap pseudonyms”: easy to disappear and re‐ pp y y pp register under a new identity with almost zero cost.  [Friedman and Resnick 2001] • I Introduce opportunities to misbehave without paying  d ii ib h ih i reputational consequences.   • Increase the difficulty of online identity changes • Impose upfront costs to new entrants: allow new identities  (forget the past) but make it costly to create them
  • 137. Value Imbalance Exploitation Three men attempt to sell a fake painting on eBay for $US  135,805. The sale was abandoned just prior to purchase when  135 805 Th l b d dj i h h the buyer became suspicious.(http://news.cnet.com/2100‐ 1017‐253848.html) • Reputation can be seen as an asset not only to Reputation can be seen as an asset, not only to  promote oneself, but also as something that can be  cashed in through a fraudulent transaction with high  g g gain. “The Market for Evaluations”
  • 138. The Market for Positive Feedbacks A selling strategy that eBay users are actually using the  g gy y y g feedback market for gains in other markets. “Riddle for a PENNY! No shipping‐Positive Feedback” Riddle for a PENNY! No shipping Positive Feedback • 29‐cent loss even in the event of a successful sale • Price low, speed feedback accumulation Possible solutions: • Make the details of the transaction (besides the feedback itself) visible to other users • T Transaction‐weighted reputational statistics  ti i ht d t ti l t ti ti [Brown 2006]
  • 139. Challenges for Crowdsourcing Markets (I) • Two‐sided opportunistic behavior • Reciprocal systems worse than one‐side evaluation. In e‐commerce  markets, only sellers are likely to behave opportunistically. No need for  markets only sellers are likely to behave opportunistically No need for reciprocal evaluation! • In crowdsourcing markets, both sides can be fraudulent. Reciprocal  systems are fraught with problems, though! • I Imperfect monitoring and heavy‐tailed participation f t it i dh t il d ti i ti • In e‐commerce markets, buyers can assess the product quality directly  upon receiving. • In crowdsourcing markets, verifying the answers is sometimes as costly as  providing them. • Sampling often does not work due to heavy tailed participation Sampling often does not work, due to heavy tailed participation  distribution (lognormal, according to self‐reported surveys)
  • 140. Challenges for Crowdsourcing Markets (II) • Constrained capacity of workers • In e‐commerce markets sellers usually have unlimited supply of In e commerce markets, sellers usually have unlimited supply of  products. • In crowdsourcing, workers have constrained capacity (cannot be  recommended continuously) • No “price premium” for high quality workers No “price premium” for high‐quality workers • In e‐commerce markets, sellers with high reputation can sell their  products at a relatively high price (premium). • In crowdsourcing, it is the requester who set the prices, which are  generally the same for all the workers.
  • 141. Market Design Organizes the Crowd Market Design Organizes the Crowd • Reputation Mechanisms  eputat o ec a s s – Seller‐side: Ensure worker quality  – Buy‐side: Ensure employee trustworthiness • Task organization for task discovery (worker finds  employer/task) / ) • Worker expertise recording for task assignment  (employer/task finds worker)
  • 142. The Importance of Task Discovery Heavy tailed distribution of completion times. Why? • Heavy tailed distribution of completion times Why? [Ipeirotis, “Analyzing the Amazon Mechanical Turk marketplace”, XRDS 2010]
  • 143. The Importance and Danger of Priorities The Importance and Danger of Priorities • [Barabasi, Nature 2005] showed that human actions  [ , ] have power‐law completion times – Mainly result of prioritization – Wh t k When tasks ranked by priorities, power‐law results k db i iti l lt • [Cobham 1954] If queuing system completes tasks [Cobham, 1954] If queuing system completes tasks  with two priority queues, and λ=μ, then power‐law  completion times • [Chilton et al., HCOMP 2010] Workers on Turk pick  tasks from  most HITs or most recent queues tasks from “most HITs” or “most recent” queues
  • 144. The UI hurts the market! • Practitioners know that HITs in 3rd page and after,  p g , are not picked by workers.  • Many such HITs are left to expire after months Many such HITs are left to expire after months,  never completed. • Badly designed task discovery interface hurts every  participant in the market! (and the reason for scientific modeling…) ti i t i th k t! • Better modeling as a queuing system may  demonstrate other such improvements
  • 145. Market Design Organizes the Crowd Market Design Organizes the Crowd • Reputation Mechanisms  eputat o ec a s s – Seller‐side: Ensure worker quality  – Buy‐side: Ensure employee trustworthiness • Task organization for task discovery (worker finds  employer/task) / ) • Worker expertise recording for task assignment  (employer/task finds worker)
  • 146. Expert Search Expert Search • Find the best worker for a task (or within a task) Find the best worker for a task, (or within a task) • For a task: k – Significant amount of research in the topic of expert  search [TREC track; Macdonald and Ounis, 2006] h – Check quality of workers across tasks http://url‐annotator.appspot.com/Admin/WorkersReport http://url annotator appspot com/Admin/WorkersReport • Within a task: [Donmez et al., 2009; Welinder, 2010] t a tas : [ o e et a , 009; e de , 0 0]
  • 147. Directions for future research Directions for future research • Optimize allocation of tasks to worker based on completion  time and expected quality • Explicitly take into consideration competition in market and Explicitly take into consideration competition in market and  switch task for worker only when benefit outweighs  switching overhead (task switching in CPU from O/S) • Recommender system for tasks (“workers like you  performed well in…”) • Create a market with dynamic pricing for tasks, following  the pricing model of the stock market (prices increase for  task when work supply low, and vice versa) task when work supply low and vice versa)
  • 148. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 149. Human Computation • Humans are not perfect mathematical models Humans are not perfect mathematical models • They exhibit noisy, stochastic behavior… h hibi i h i b h i • And exhibit common and systematic biases
  • 150. Score the following from 1 to 10  1: not particularly bad or wrong 1 t ti l l b d 10: extremely evil  a) Stealing a towel from a hotel  b) Keeping a dime you find on the ground  p g y g c) Poisoning a barking dog [Parducci, 1968]
  • 151. Score the following from 1 to 10  1: not particularly bad or wrong 1 t ti l l b d 10: extremely evil  a) Testifying falsely for pay gg g b) Using guns on striking workers c) Poisoning a barking dog [Parducci, 1968]
  • 152. Anchoring  Anchoring • “Humans start with a first approximation (anchor) and  f pp ( ) then make adjustments to that number based on  additional information.” [Tversky & Kahneman, 1974] • [Paolacci et al, 2010] – Q1a: More or less than 65 African countries in UN? Q1a: More or less than 65 African countries in UN? – Q1b: More or less than 12 African countries in UN? – Q2: How many countries in Africa? – Group A mean: 42.6 – Group B mean: 18 5 Group B mean: 18.5
  • 153. Anchoring  Anchoring • Write down the last digit of their social security  g f y number before placing bid for wine bottles. Users  with lower SSN numbers bid lower… • In the Netflix contest, user with high ratings early  on, biased towards higher ratings later in a  on, biased towards higher ratings later in a session… • Crowdsourcing tasks can be affected by  anchoring. [Moren et al, NIPS 2010] describe  techniques for removing effects techniques for removing effects
  • 154. Priming • Exposure to one stimulus influences another Exposure to one stimulus influences another • Stereotypes:  – Asian americans perform better in math Asian‐americans perform better in math – Women perform worse in math • [Shih et al., 1999] asked Asian‐American women: – Q ti Questions about race: They did better in math test b t Th did b tt i th t t – Questions about gender: They did worse in math test
  • 155. Exposure Effect Exposure Effect • Familiarity leads to liking Familiarity leads to liking... • [S [Stone and Alonso, 2010]: Evaluators of Bing  d l 20 0] l f i search engine increase their ratings of  relevance over time, for the same results l i f h l
  • 156. Framing • Presenting the same option in different Presenting the same option in different  formats leads to different formats. People  avert options that imply loss [Tversky and  avert options that imply loss [Tversky and Kahneman (1981)]
  • 157. Framing:  600 people affected by deadly disease 600 people affected by deadly disease Room 1 a) save 200 people's lives save 200 people s lives b) 33% chance of saving all 600 people and a 66% chance saving no one • 72% of participants chose option A 72% of participants chose option A • 28% of participants chose option B Room 2 Room 2 c) 400 people die d) 33% chance that no people will die; a 66% chance that all 600 will die • 78% of participants chose option D (equivalent to option B) • 22% of participants chose option C (equivalent to option A) People avert options that imply loss 
  • 158. Very long list of cognitive biases… Very long list of cognitive biases… • http://en.wikipedia.org/wiki/List_of_cognitive_biases p // p g/ / g • [Mozer et al., 2010] try to learn and remove sequential effects  from human computation data…
  • 159. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 160. Games with a Purpose [Luis von Ahn and Laura Dabbish, CACM 2008] Three generic game structures • Output agreement:  – Type same output • I Input agreement:  – Decide if having same input • Inversion problem: Inversion problem:  – P1 generates output from input – P2 looks at P1‐output and guesses P1‐input
  • 161. Output Agreement: ESP Game Output Agreement: ESP Game • Players look at common input Players look at common input • Need to agree on output
  • 162. Improvements • Game‐theoretic analysis indicates that players Game theoretic analysis indicates that players  will converge to easy words [Jain and Parkes] • Solution 1: Add “Taboo words” to prevent   Solution 1: Add  Taboo words to prevent guessing easy words • S l i 2 Ki Ki B Solution 2: KissKissBan, third player tries to  hi d l i guess (and block) agreement
  • 163. Input Agreement: TagATune p g g • Sometimes difficult to type identical output  (e.g., “describe this song”) p , • Show same of different input, let users  describe, ask players if they have same input
  • 164. Inversion Problem: Peekaboom Inversion Problem: Peekaboom • Non symmetric players Non‐symmetric players • Input: Image with word • Player 1 slowly reveals pic l l l l i • Player 2 tries to guess word
  • 165.
  • 166. HINT
  • 167. HINT
  • 168. HINT
  • 169. HINT
  • 171. Protein folding Protein folding • Protein folding: Proteins fold from long chains into  small balls, each in a very specific shape • Shape is the lower energy setting which the most Shape is the lower‐energy setting, which the most  stable • Fold shape is very important to understand interactions with out molecules • Extremely expensive computationally! (too many  degrees of freedom)
  • 172. FoldIt Game • Humans are very good at reducing the search Humans are very good at reducing the search  space • Humans try to fold the protein into a minimal  energy state.  • Can leave protein unfinished and let others try  from there…
  • 173. Outline • Introduction: Human computation and crowdsourcing • Managing quality for simple tasks • Complex tasks using workflows • Task optimization • Incentivizing the crowd Incentivizing the crowd • Market design • Behavioral aspects and cognitive biases Behavioral aspects and cognitive biases • Game design • Case studies d
  • 176.
  • 177. 177
  • 178. 178
  • 179. 179
  • 180. A few of the tasks in the past • Detect pages that discuss swine flu – Pharmaceutical firm had drug “treating” (off-label) swine flu – FDA prohibited pharmaceutical company to display drug ad in pages about swine flu – Two days to build and go live • Big fast-food chain does not want ad to appear: – In pages that discuss the brand (99% negative sentiment) – In pages discussing obesity – Three days to build and go live 180
  • 181. Need to build models fast Need to build models fast • T diti Traditionally, modeling teams have invested substantial  ll d li t h i t d b t ti l internal resources in data formulation, information  extraction, cleaning, and other preprocessing No time for such things… • However, now, we can outsource preprocessing tasks, such  as labeling, feature extraction, verifying information  l b li f t t ti if i i f ti extraction, etc. – using Mechanical Turk, oDesk, etc. – quality may be lower than expert labeling (much?)  – but low costs can allow massive scale 18 1
  • 182. AdSafe workflow • Find URLs for a given topic (hate speech, gambling, alcohol  abuse, guns, bombs, celebrity gossip, etc etc) b b b l bi i ) http://url‐collector.appspot.com/allTopics.jsp • Classify URLs into appropriate categories Classify URLs into appropriate categories  http://url‐annotator.appspot.com/AdminFiles/Categories.jsp  • Mesure quality of the labelers and remove spammers http://qmturk.appspot.com/ htt // t k t / • Get humans to “beat” the classifier by providing cases where  the classifier fails http://adsafe‐beatthemachine.appspot.com/ 18 2
  • 184. Scaling Crowdsourcing: Use Machine Learning  Need to scale crowdsourcing  Basic idea: Build a machine learning model and use it instead of humans New case Automatic Model Automatic (through machine learning) Answer Existing data (through crowdsourcing)
  • 185. Scaling Crowdsourcing: Iterative training  Ti Triage: – machine when confident – humans when not confident  Retrain using the new human input Automatic → improve model Answer → reduce need for human input New Case Automatic Model (through machine learning) Data from existing Get human(s) to answer crowdsourced answers
  • 186. Scaling Crowdsourcing: Iterative training, with noise  Machine when confident, humans otherwise  Ask as many humans as necessary to ensure quality Automatic Answer New Case Automatic Model Not confident (through machine l (th h hi learning) i ) for quality? Data from existing Get human(s) to crowdsourced answers answer Confident for quality?
  • 187. Scaling Crowdsourcing: Iterative training, with noise  Machine when confident, humans otherwise  Ask as many humans as necessary to ensure quality – Or even get other machines… machines Automatic Answer New Case Automatic Model Not confident (through machine l (th h hi learning) i ) about quality? Data from existing Get human(s) or crowdsourced answers other machines Confident about quality? to answer
  • 188. Example: ReCAPTCHA + Google Books portion distinguished  Fixes errors of Optical Character Recognition (OCR ~ 1% error rate 20% rate, 20%- 30% for 18th and 19th century books, according to today’s NY Times article)  Improves further the OCR algorithm, reducing error rate  “40 million R CAPTCHA every d ” (2008) Fi i 40 000 b k a day illi ReCAPTCHAs day” Fixing 40,000 books d – [Unofficial quote from Luis]: 400M/day (2010) – All books ever written:100 million books (~12yrs??)