SlideShare a Scribd company logo
Human computation, crowdsourcing 
and social: An industrial perspective 
Omar Alonso 
Microsoft 
12 November 2014
Disclaimer 
The views, opinions, positions, or strategies expressed in 
this talk are mine and do not necessarily reflect the 
official policy or position of Microsoft.
Introduction 
• Crowdsourcing is hot 
• Lots of interest in the research community 
– Articles showing good results 
– Journals special issues (IR, IEEE Internet Computing, etc.) 
– Workshops and tutorials (SIGIR, NACL, WSDM, WWW, CHI, 
RecSys, VLDB, etc.) 
– HCOMP 
– CrowdConf 
• Large companies leveraging crowdsourcing 
• Big data 
• Start-ups 
• Venture capital investment
Crowdsourcing 
• Crowdsourcing is the act of taking a 
job traditionally performed by a 
designated agent (usually an 
employee) and outsourcing it to an 
undefined, generally large group of 
people in the form of an open call. 
• The application of Open Source 
principles to fields outside of 
software. 
• Most successful story: Wikipedia
HUMAN COMPUTATION
Human computation 
• Not a new idea 
• Computers before 
computers 
• You are a human 
computer
Some definitions 
• Human computation is a computation 
that is performed by a human 
• Human computation system is a system 
that organizes human efforts to carry 
out computation 
• Crowdsourcing is a tool that a human 
computation system can use to 
distribute tasks. 
Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
More examples 
• ESP game 
• Captcha: 200M every day 
• ReCaptcha: 750M to date
Data is king 
• Massive free Web data 
changed how we train 
learning systems 
• Crowds provide new access 
to cheap & labeled big data 
• But quality also matters 
M. Banko and E. Brill. “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001. 
A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems 2009.
Traditional Data Collection 
• Setup data collection software / harness 
• Recruit participants / annotators / assessors 
• Pay a flat fee for experiment or hourly wage 
• Characteristics 
– Slow 
– Expensive 
– Difficult and/or Tedious 
– Sample Bias…
Natural Language Processing 
• MTurk annotation for 5 NLP tasks 
• 22K labels for US $26 
• High agreement between consensus labels and 
gold-standard labels 
• Workers as good as experts 
R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert 
Annotations for Natural Language Tasks”. EMNLP-2008.
Machine Translation 
• Manual evaluation 
on translation quality 
is slow and expensive 
• High agreement 
between non-experts 
and experts 
• $0.10 to translate a 
sentence 
C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality 
Using Amazon’s Mechanical Turk”, EMNLP 2009.
Soylent 
M. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”, UIST 2010
Mechanical Turk 
• Amazon Mechanical Turk 
(AMT, MTurk, 
www.mturk.com) 
• Crowdsourcing platform 
• On-demand workforce 
• “Artificial artificial 
intelligence”: get humans to 
do hard part 
• Named after faux automaton 
of 18th C.
• Multiple Channels 
• Gold-based tests 
• Only pay for 
“trusted” judgments
HIT example
HIT example
{where to go on vacation} 
• MTurk: 50 answers, 
$1.80 
• Quora: 2 answers 
• Y! Answers: 2 
answers 
• FB: 1 answer 
• Tons of results 
• Read title + snippet + 
URL 
• Explore a few pages in 
detail
{where to go on vacation} 
Countries 
Cities
Flip a coin 
• Please flip a coin and report the results 
• Two questions 
1. Coin type? 
2. Head or tails? 
• Results 
Row Labels Counts 
head 57 
tail 43 
Grand Total 100 
Row Labels Count 
Dollar 56 
Euro 11 
Other 30 
(blank) 3 
Grand Total 100
Why is this interesting? 
• Easy to prototype and test new experiments 
• Cheap and fast 
• No need to setup infrastructure 
• Introduce experimentation early in the cycle 
• For new ideas, this is very helpful
Caveats and clarifications 
• Trust and reliability 
• Wisdom of the crowd re-visit 
• Adjust expectations 
• Crowdsourcing is another data point for your 
analysis 
• Complementary to other experiments
Why now? 
• The Web 
• Use humans as processors in a distributed 
system 
• Address problems that computers aren’t good 
• Scale 
• Reach
Who are 
the workers? 
• A. Baio, November 2008. The Faces of Mechanical Turk. 
• P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk 
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
Issues
ASSESSMENTS AND LABELS
Relevance assessments 
Is this document relevant to the query?
Careful with That Axe Data, Eugene 
• In the area of big data and machine learning: 
– labels -> features -> predictive model -> optimization 
• Labeling/experimentation perceived as boring 
• Don’t rush labeling 
– Human and machine 
• Label quality is very important 
– Don’t outsource it 
– Own it end to end 
– Large scale
More on label quality 
• Data gathering is not a free lunch 
• Labels for the machine != labels for humans 
• Emphasis on algorithms, 
models/optimizations and mining from labels 
• Not so much on algorithms for ensuring high 
quality labels 
• Training sets
The importance of labels – IR context
INFORMATION RETRIEVAL AND 
CROWDSOURCING
Motivating Example: Relevance Judging 
• Relevance of search results is difficult to judge 
– Highly subjective 
– Expensive to measure 
• Professional editors commonly used 
• Potential benefits of crowdsourcing 
– Scalability (time and cost) 
– Diversity of judgments
Started with a joke …
Results for {idiot} 
February 2011: 5/7 (R), 2/7 (NR) 
Relevant 
1. Most of the time those TV reality stars have absolutely no talent. They do whatever they 
can to make a quick dollar. Most of the time the reality tv stars don not have a mind of 
their own. R 
2. Most are just celebrity wannabees. Many have little or no talent, they just want fame. R 
3. Have you seen the knuckledraggers on reality television? They should be required to change 
their names to idiot after appearing on the show. You could put numbers after the word 
idiot so we can tell them apart. R 
4. Although I have not followed too many of these shows, those that I have encountered have 
for a great part a very common property. That property is that most of the participants 
involved exhibit a shallow self-serving personality that borders on social pathological 
behavior. To perform or act in such an abysmal way could only be an act of an idiot. R 
5. I can see this one going both ways. A particular sort of reality star comes to mind, 
though, one who was voted off Survivor because he chose not to use his immunity necklace. 
Sometimes the label fits, but sometimes it might be unfair. R 
Not Relevant 
1. Just because someone else thinks they are an "idiot", doesn't mean that is what the word 
means. I don't like to think that any one person's photo would be used to describe a 
certain term. NR 
2. While some reality-television stars are genuinely stupid (or cultivate an image of 
stupidity), that does not mean they can or should be classified as "idiots." Some simply 
act that way to increase their TV exposure and potential earnings. Other reality-television 
stars are really intelligent people, and may be considered as idiots by people who don't 
like them or agree with them. It is too subjective an issue to be a good result for a 
search engine. NR
You have a new idea 
• Novel IR technique 
• Don’t have access to click data 
• Can’t hire editors 
• How to test new ideas?
Crowdsourcing and relevance evaluation 
• Subject pool access: no need to come into the 
lab 
• Diversity 
• Low cost 
• Agile
Pedal to the metal 
• You read the papers 
• You tell your boss (or advisor) that 
crowdsourcing is the way to go 
• You now need to produce hundreds of 
thousands of labels per month 
• Easy, right?
Ask the right questions 
• Instructions are key 
• Workers are not IR experts so don’t assume 
the same understanding in terms of 
terminology 
• Show examples 
• Hire a technical writer 
• Prepare to iterate
How not to do things 
• Lot of work for a few cents 
• Go here, go there, copy, enter, count …
UX design 
• Time to apply all those usability concepts 
• Need to grab attention 
• Generic tips 
– Experiment should be self-contained. 
– Keep it short and simple. 
– Be very clear with the task. 
– Engage with the worker. Avoid boring stuff. 
– Always ask for feedback (open-ended question) in an 
input box. 
• Localization
Payments 
• How much is a HIT? 
• Delicate balance 
– Too little, no interest 
– Too much, attract spammers 
• Heuristics 
– Start with something and wait to see if there is 
interest or feedback (“I’ll do this for X amount”) 
– Payment based on user effort. Example: $0.04 (2 cents 
to answer a yes/no question, 2 cents if you provide 
feedback that is not mandatory) 
• Bonus
Managing crowds
Quality control 
• Extremely important part of the experiment 
• Approach it as “overall” quality – not just for 
workers 
• Bi-directional channel 
– You may think the worker is doing a bad job. 
– The same worker may think you are a lousy 
requester. 
• Test with a gold standard
When to assess work quality? 
• Beforehand (prior to main task activity) 
– How: “qualification tests” or similar mechanism 
– Purpose: screening, selection, recruiting, training 
• During 
– How: assess labels as worker produces them 
– Like random checks on a manufacturing line 
– Purpose: calibrate, reward/penalize, weight 
• After 
– How: compute accuracy metrics post-hoc 
– Purpose: filter, calibrate, weight, retain
How do we measure work quality? 
• Compare worker’s label vs. 
– Known (correct, trusted) label 
– Other workers’ labels 
– Model predictions of workers and labels 
• Verify worker’s label 
– Yourself 
– Tiered approach (e.g. Find-Fix-Verify)
Methods for measuring agreement 
• Inter-agreement level 
– Agreement between judges 
– Agreement between judges and the gold set 
• Some statistics 
– Cohen’s kappa (2 raters) 
– Fleiss’ kappa (any number of raters) 
– Krippendorff’s alpha 
• Gray areas 
– 2 workers say “relevant” and 3 say “not relevant” 
– 2-tier system
Content quality 
• People like to work on things that they like 
• Content and judgments according to modern 
times 
– TREC data set: airport security docs are pre 9/11 
• Document length 
• Randomize content 
• Avoid worker fatigue 
– Judging 100 documents on the same subject can be 
tiring, leading to decreasing quality
Was the task difficult? 
Ask workers to rate difficulty of a search topic 
50 topics; 5 workers, $0.01 per task
So far … 
• One may say “this is all good but looks like a 
ton of work” 
• The original goal: data is king 
• Data quality and experimental designs are 
preconditions to make sure we get the right 
stuff 
• Don’t cut corners
Pause 
• Crowdsourcing works 
– Fast turnaround, easy to experiment, few dollars to test 
– But: you have to design experiments carefully, quality, 
platform limitations 
• Crowdsourcing in production 
– Large scale data sets 
– Continuous execution 
– Difficult to debug 
• How do you know the experiment is working 
• Goal: framework for ensuring reliability on 
crowdsourcing tasks 
O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable 
results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
Labeling tweets – an example of a task 
• Is this tweet interesting? 
• Subjective activity 
• Not focused on specific events 
• Findings 
– Difficult problem, low inter-rater agreement 
– Tested many designs, number of workers, platforms 
(MTurk and others) 
• Multiple contingent factors 
– Worker performance 
– Work 
– Task design 
O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
Designs that include in-task CAPTCHA 
• Borrowed idea from reCAPTCHA -> use of 
control term 
• HIDDEN 
• Adapt your labeling task 
• 2 more questions as control 
– 1 algorithmic 
– 1 semantic
Production example #1 
Q1 (k = 0.91, alpha = 0.91) 
Q2 (k = 0.771, alpha = 0.771) 
Q3 (k = 0.033, alpha = 0.035) 
Tweet de-branded 
In-task captcha 
The main 
question
Production example #2 
Q1 (k = 0.907, alpha = 0.907) 
Q2 (k = 0.728, alpha = 0.728) 
• Q3 Worthless (alpha = 0.033) 
• Q3 Trivial (alpha = 0.043) 
• Q3 Funny (alpha = -0.016) 
• Q3 Makes me curious (alpha = 0.026) 
• Q3 Contains useful info (alpha = 0.048) 
• Q3 Important news (alpha = 0.207) 
Tweet de-branded 
In-task captcha 
Breakdown by 
categories to 
get better signal
Once we get here 
• High quality labels 
• Data will be later be used for rankers, ML 
models, evaluations, etc. 
• Training sets 
• Scalability and repeatability
CURRENT TRENDS
Algorithms 
• Bandit problems; explore-exploit 
• Optimizing amount of work by workers 
– Humans have limited throughput 
– Harder to scale than machines 
• Selecting the right crowds 
• Stopping rule
Humans in the loop 
• Computation loops that mix humans and 
machines 
• Kind of active learning 
• Double goal: 
– Human checking on the machine 
– Machine checking on humans 
• Example: classifiers for social data
Routing 
• Expertise detection and routing 
• Social load balancing 
• When to switch between machines and 
humans 
• CrowdSTAR 
B. Nushi, O. Alonso, M. Hentschel, V. Kandylas. “CrowdSTAR: A Social Task Routing 
Framework for Online Communities”, 2014. http://arxiv.org/abs/1407.6714
Social Task Routing 
Task A? Task B? 
C1 C2 Crowd Summaries 
Crowd 1 (Twitter) Crowd 2 (Quora) 
Routing across 
crowds 
Routing within a crowd
Question Posting – Twitter Examples
Conclusions 
• Crowdsourcing at scale works but requires a solid 
framework 
• Fast turnaround, easy to experiment, few dollars to test 
• But you have to design the experiments carefully 
• Usability considerations 
• Lots of opportunities to improve current platforms 
• Three aspects that need attention: workers, work and 
task design 
• Labeling social data is hard
Conclusions – II 
• Important to know your limitations and be 
ready to collaborate 
• Lots of different skills and expertise required 
– Social/behavioral science 
– Human factors 
– Algorithms 
– Economics 
– Distributed systems 
– Statistics
Thank you - @elunca

More Related Content

Similar to Human computation, crowdsourcing and social: An industrial perspective

UX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX DesignUX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX Design
Sarah Fathallah
 
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive
 
How to Use Artificial Intelligence by Microsoft Product Manager
 How to Use Artificial Intelligence by Microsoft Product Manager How to Use Artificial Intelligence by Microsoft Product Manager
How to Use Artificial Intelligence by Microsoft Product Manager
Product School
 
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
UXPA International
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Rosenfeld Media
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
Alan Said
 
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
Master Technical Recruiting Workshop:  How to Recruit Top Tech TalentMaster Technical Recruiting Workshop:  How to Recruit Top Tech Talent
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
RecruitingDaily.com LLC
 
Proyectos Investigación y Desarrollo
Proyectos Investigación y DesarrolloProyectos Investigación y Desarrollo
Proyectos Investigación y Desarrollo
Juan Manuel Gonzalez Calleros
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012TRG Arts
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
Lynne Thomas
 
Be the Captain of Your Career
Be the Captain of Your Career Be the Captain of Your Career
Be the Captain of Your Career
Jack Molisani
 
Managerial Decision-Making
Managerial Decision-MakingManagerial Decision-Making
Managerial Decision-Making
Lee Schlenker
 
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdfResearch and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
VWO
 
Embedding Clinical standards in research workshop
Embedding Clinical standards in research workshopEmbedding Clinical standards in research workshop
Embedding Clinical standards in research workshop
James Malone
 
Managerial Decision Making
Managerial Decision MakingManagerial Decision Making
Managerial Decision Making
Lee Schlenker
 
User Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersUser Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with Users
Laura B
 
Conversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynoteConversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynote
Webanalisten .nl
 
20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them
Craig Sullivan
 
More Than Usability
More Than UsabilityMore Than Usability
More Than Usability
Razan Sadeq
 

Similar to Human computation, crowdsourcing and social: An industrial perspective (20)

UX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX DesignUX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX Design
 
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
 
How to Use Artificial Intelligence by Microsoft Product Manager
 How to Use Artificial Intelligence by Microsoft Product Manager How to Use Artificial Intelligence by Microsoft Product Manager
How to Use Artificial Intelligence by Microsoft Product Manager
 
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
Master Technical Recruiting Workshop:  How to Recruit Top Tech TalentMaster Technical Recruiting Workshop:  How to Recruit Top Tech Talent
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
 
Proyectos Investigación y Desarrollo
Proyectos Investigación y DesarrolloProyectos Investigación y Desarrollo
Proyectos Investigación y Desarrollo
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Be the Captain of Your Career
Be the Captain of Your Career Be the Captain of Your Career
Be the Captain of Your Career
 
Managerial Decision-Making
Managerial Decision-MakingManagerial Decision-Making
Managerial Decision-Making
 
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdfResearch and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
 
Embedding Clinical standards in research workshop
Embedding Clinical standards in research workshopEmbedding Clinical standards in research workshop
Embedding Clinical standards in research workshop
 
Managerial Decision Making
Managerial Decision MakingManagerial Decision Making
Managerial Decision Making
 
User Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersUser Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with Users
 
Conversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynoteConversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynote
 
20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them
 
More Than Usability
More Than UsabilityMore Than Usability
More Than Usability
 

Recently uploaded

BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Human computation, crowdsourcing and social: An industrial perspective

  • 1. Human computation, crowdsourcing and social: An industrial perspective Omar Alonso Microsoft 12 November 2014
  • 2. Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft.
  • 3. Introduction • Crowdsourcing is hot • Lots of interest in the research community – Articles showing good results – Journals special issues (IR, IEEE Internet Computing, etc.) – Workshops and tutorials (SIGIR, NACL, WSDM, WWW, CHI, RecSys, VLDB, etc.) – HCOMP – CrowdConf • Large companies leveraging crowdsourcing • Big data • Start-ups • Venture capital investment
  • 4. Crowdsourcing • Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call. • The application of Open Source principles to fields outside of software. • Most successful story: Wikipedia
  • 5.
  • 6.
  • 8. Human computation • Not a new idea • Computers before computers • You are a human computer
  • 9. Some definitions • Human computation is a computation that is performed by a human • Human computation system is a system that organizes human efforts to carry out computation • Crowdsourcing is a tool that a human computation system can use to distribute tasks. Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
  • 10. More examples • ESP game • Captcha: 200M every day • ReCaptcha: 750M to date
  • 11. Data is king • Massive free Web data changed how we train learning systems • Crowds provide new access to cheap & labeled big data • But quality also matters M. Banko and E. Brill. “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001. A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems 2009.
  • 12. Traditional Data Collection • Setup data collection software / harness • Recruit participants / annotators / assessors • Pay a flat fee for experiment or hourly wage • Characteristics – Slow – Expensive – Difficult and/or Tedious – Sample Bias…
  • 13. Natural Language Processing • MTurk annotation for 5 NLP tasks • 22K labels for US $26 • High agreement between consensus labels and gold-standard labels • Workers as good as experts R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks”. EMNLP-2008.
  • 14. Machine Translation • Manual evaluation on translation quality is slow and expensive • High agreement between non-experts and experts • $0.10 to translate a sentence C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
  • 15. Soylent M. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”, UIST 2010
  • 16. Mechanical Turk • Amazon Mechanical Turk (AMT, MTurk, www.mturk.com) • Crowdsourcing platform • On-demand workforce • “Artificial artificial intelligence”: get humans to do hard part • Named after faux automaton of 18th C.
  • 17. • Multiple Channels • Gold-based tests • Only pay for “trusted” judgments
  • 20. {where to go on vacation} • MTurk: 50 answers, $1.80 • Quora: 2 answers • Y! Answers: 2 answers • FB: 1 answer • Tons of results • Read title + snippet + URL • Explore a few pages in detail
  • 21. {where to go on vacation} Countries Cities
  • 22. Flip a coin • Please flip a coin and report the results • Two questions 1. Coin type? 2. Head or tails? • Results Row Labels Counts head 57 tail 43 Grand Total 100 Row Labels Count Dollar 56 Euro 11 Other 30 (blank) 3 Grand Total 100
  • 23. Why is this interesting? • Easy to prototype and test new experiments • Cheap and fast • No need to setup infrastructure • Introduce experimentation early in the cycle • For new ideas, this is very helpful
  • 24. Caveats and clarifications • Trust and reliability • Wisdom of the crowd re-visit • Adjust expectations • Crowdsourcing is another data point for your analysis • Complementary to other experiments
  • 25. Why now? • The Web • Use humans as processors in a distributed system • Address problems that computers aren’t good • Scale • Reach
  • 26. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010.
  • 29. Relevance assessments Is this document relevant to the query?
  • 30.
  • 31. Careful with That Axe Data, Eugene • In the area of big data and machine learning: – labels -> features -> predictive model -> optimization • Labeling/experimentation perceived as boring • Don’t rush labeling – Human and machine • Label quality is very important – Don’t outsource it – Own it end to end – Large scale
  • 32. More on label quality • Data gathering is not a free lunch • Labels for the machine != labels for humans • Emphasis on algorithms, models/optimizations and mining from labels • Not so much on algorithms for ensuring high quality labels • Training sets
  • 33. The importance of labels – IR context
  • 34. INFORMATION RETRIEVAL AND CROWDSOURCING
  • 35.
  • 36. Motivating Example: Relevance Judging • Relevance of search results is difficult to judge – Highly subjective – Expensive to measure • Professional editors commonly used • Potential benefits of crowdsourcing – Scalability (time and cost) – Diversity of judgments
  • 37.
  • 38. Started with a joke …
  • 39. Results for {idiot} February 2011: 5/7 (R), 2/7 (NR) Relevant 1. Most of the time those TV reality stars have absolutely no talent. They do whatever they can to make a quick dollar. Most of the time the reality tv stars don not have a mind of their own. R 2. Most are just celebrity wannabees. Many have little or no talent, they just want fame. R 3. Have you seen the knuckledraggers on reality television? They should be required to change their names to idiot after appearing on the show. You could put numbers after the word idiot so we can tell them apart. R 4. Although I have not followed too many of these shows, those that I have encountered have for a great part a very common property. That property is that most of the participants involved exhibit a shallow self-serving personality that borders on social pathological behavior. To perform or act in such an abysmal way could only be an act of an idiot. R 5. I can see this one going both ways. A particular sort of reality star comes to mind, though, one who was voted off Survivor because he chose not to use his immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R Not Relevant 1. Just because someone else thinks they are an "idiot", doesn't mean that is what the word means. I don't like to think that any one person's photo would be used to describe a certain term. NR 2. While some reality-television stars are genuinely stupid (or cultivate an image of stupidity), that does not mean they can or should be classified as "idiots." Some simply act that way to increase their TV exposure and potential earnings. Other reality-television stars are really intelligent people, and may be considered as idiots by people who don't like them or agree with them. It is too subjective an issue to be a good result for a search engine. NR
  • 40. You have a new idea • Novel IR technique • Don’t have access to click data • Can’t hire editors • How to test new ideas?
  • 41. Crowdsourcing and relevance evaluation • Subject pool access: no need to come into the lab • Diversity • Low cost • Agile
  • 42. Pedal to the metal • You read the papers • You tell your boss (or advisor) that crowdsourcing is the way to go • You now need to produce hundreds of thousands of labels per month • Easy, right?
  • 43. Ask the right questions • Instructions are key • Workers are not IR experts so don’t assume the same understanding in terms of terminology • Show examples • Hire a technical writer • Prepare to iterate
  • 44. How not to do things • Lot of work for a few cents • Go here, go there, copy, enter, count …
  • 45. UX design • Time to apply all those usability concepts • Need to grab attention • Generic tips – Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box. • Localization
  • 46. Payments • How much is a HIT? • Delicate balance – Too little, no interest – Too much, attract spammers • Heuristics – Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory) • Bonus
  • 48. Quality control • Extremely important part of the experiment • Approach it as “overall” quality – not just for workers • Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. • Test with a gold standard
  • 49. When to assess work quality? • Beforehand (prior to main task activity) – How: “qualification tests” or similar mechanism – Purpose: screening, selection, recruiting, training • During – How: assess labels as worker produces them – Like random checks on a manufacturing line – Purpose: calibrate, reward/penalize, weight • After – How: compute accuracy metrics post-hoc – Purpose: filter, calibrate, weight, retain
  • 50. How do we measure work quality? • Compare worker’s label vs. – Known (correct, trusted) label – Other workers’ labels – Model predictions of workers and labels • Verify worker’s label – Yourself – Tiered approach (e.g. Find-Fix-Verify)
  • 51. Methods for measuring agreement • Inter-agreement level – Agreement between judges – Agreement between judges and the gold set • Some statistics – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha • Gray areas – 2 workers say “relevant” and 3 say “not relevant” – 2-tier system
  • 52. Content quality • People like to work on things that they like • Content and judgments according to modern times – TREC data set: airport security docs are pre 9/11 • Document length • Randomize content • Avoid worker fatigue – Judging 100 documents on the same subject can be tiring, leading to decreasing quality
  • 53. Was the task difficult? Ask workers to rate difficulty of a search topic 50 topics; 5 workers, $0.01 per task
  • 54. So far … • One may say “this is all good but looks like a ton of work” • The original goal: data is king • Data quality and experimental designs are preconditions to make sure we get the right stuff • Don’t cut corners
  • 55. Pause • Crowdsourcing works – Fast turnaround, easy to experiment, few dollars to test – But: you have to design experiments carefully, quality, platform limitations • Crowdsourcing in production – Large scale data sets – Continuous execution – Difficult to debug • How do you know the experiment is working • Goal: framework for ensuring reliability on crowdsourcing tasks O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
  • 56. Labeling tweets – an example of a task • Is this tweet interesting? • Subjective activity • Not focused on specific events • Findings – Difficult problem, low inter-rater agreement – Tested many designs, number of workers, platforms (MTurk and others) • Multiple contingent factors – Worker performance – Work – Task design O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
  • 57. Designs that include in-task CAPTCHA • Borrowed idea from reCAPTCHA -> use of control term • HIDDEN • Adapt your labeling task • 2 more questions as control – 1 algorithmic – 1 semantic
  • 58. Production example #1 Q1 (k = 0.91, alpha = 0.91) Q2 (k = 0.771, alpha = 0.771) Q3 (k = 0.033, alpha = 0.035) Tweet de-branded In-task captcha The main question
  • 59. Production example #2 Q1 (k = 0.907, alpha = 0.907) Q2 (k = 0.728, alpha = 0.728) • Q3 Worthless (alpha = 0.033) • Q3 Trivial (alpha = 0.043) • Q3 Funny (alpha = -0.016) • Q3 Makes me curious (alpha = 0.026) • Q3 Contains useful info (alpha = 0.048) • Q3 Important news (alpha = 0.207) Tweet de-branded In-task captcha Breakdown by categories to get better signal
  • 60. Once we get here • High quality labels • Data will be later be used for rankers, ML models, evaluations, etc. • Training sets • Scalability and repeatability
  • 62. Algorithms • Bandit problems; explore-exploit • Optimizing amount of work by workers – Humans have limited throughput – Harder to scale than machines • Selecting the right crowds • Stopping rule
  • 63. Humans in the loop • Computation loops that mix humans and machines • Kind of active learning • Double goal: – Human checking on the machine – Machine checking on humans • Example: classifiers for social data
  • 64. Routing • Expertise detection and routing • Social load balancing • When to switch between machines and humans • CrowdSTAR B. Nushi, O. Alonso, M. Hentschel, V. Kandylas. “CrowdSTAR: A Social Task Routing Framework for Online Communities”, 2014. http://arxiv.org/abs/1407.6714
  • 65. Social Task Routing Task A? Task B? C1 C2 Crowd Summaries Crowd 1 (Twitter) Crowd 2 (Quora) Routing across crowds Routing within a crowd
  • 66. Question Posting – Twitter Examples
  • 67. Conclusions • Crowdsourcing at scale works but requires a solid framework • Fast turnaround, easy to experiment, few dollars to test • But you have to design the experiments carefully • Usability considerations • Lots of opportunities to improve current platforms • Three aspects that need attention: workers, work and task design • Labeling social data is hard
  • 68. Conclusions – II • Important to know your limitations and be ready to collaborate • Lots of different skills and expertise required – Social/behavioral science – Human factors – Algorithms – Economics – Distributed systems – Statistics
  • 69. Thank you - @elunca