Integrating Crowd & Cloud  Resources for Big Data                Michael Franklin      Middleware 2012, Montreal          ...
CROWDSOURCINGWHAT IS IT?
Citizen ScienceNASA “Clickworkers” 2000
Citizen Journalism/Participatory Sensing4
Communities & Expertise
Data Collection & Curatione.g., Freebase
An Academic ViewFrom Quinn & Bederson, “Human Computation: A Surveyand Taxonomy of a Growing Field”, CHI 2011.
The Way Industry Looks At It How Industry Looks At It
Useful Taxonomies• Doan, Halevy, Ramakrishnan; (Crowdsourcing)  CACM 4/11  –   nature of collaboration (implicit vs. expli...
Types of TasksTask Granularity               ExamplesComplex Tasks                  • Build a website                     ...
MICRO-TASK MARKETPLACES
Amazon Mechanical Turk (AMT)
Microtasking – Virutalized Humans• Current leader: Amazon Mechanical Turk• Requestors place Human Intelligence Tasks  (HIT...
AMT Worker Interface
Microtask Aggregators
Crowdsourcing for Data Management• Relational                        • Beyond relational      –   data cleaning           ...
TOWARDS HYBRIDCROWD/CLOUD COMPUTING
Not Exactly Crowdsourcing, but…“The hope is that, in not too many years, human brainsand computing machines will be couple...
AMP: Integrating Diverse Resources                           Algorithms:                       Machine Learning and       ...
The Berkeley AMPLab• Goal: Data analytics stack integrating A, M & P  • BDAS: Released as BSD/Apache Open Source• 6 year d...
People in AMP• Long term Goal: Make people  an integrated part of the system!  • Leverage human activity                  ...
Carat: Leveraging Human Activity                      ~500,000                      downloads                      to date...
Carat: How it works     Collaborative Detection of Energy Bugs25
Leveraging Human IntelligenceFirst Attempt:                     CrowdSQL                                         ResultsCr...
DB-hard QueriesCompany_Name              Address                    Market CapGoogle                    Googleplex, Mtn. V...
DB-hard QueriesCompany_Name              Address                    Market CapGoogle                    Googleplex, Mtn. V...
DB-hard QueriesSELECT ImageFrom PicturesWhere Image contains“Good Looking Dog”                       Number of Rows: 0    ...
Leveraging Human IntelligenceFirst Attempt:                     CrowdSQL                                         ResultsCr...
CrowdDB - Worker Interface31
Mobile Platform32
CrowdSQLDDL Extensions:Crowdsourced columns          Crowdsourced tablesCREATE TABLE company (        CREATE CROWD TABLE d...
CrowdDB Query: Picture ordering                                                                       Which picture visual...
User Interface vs. Quality                       Please fill out the missing                           Please fill out the m...
Turker Affinity and Errors               Turker Rank36
A Bigger Underlying Issue     Closed-World       Open-World37
What Does This Query Mean?SELECT COUNT(*) FROM IceCreamFlavors Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE ...
Estimating CompletenessSELECT COUNT(*) FROM US States US States using Mechanical Turk           Species Estimation techniq...
Estimating CompletenessSELECT COUNT(*) FROM IceCreamFlavors• Ice Cream              Ice Cream Flavors  Flavors    – Estima...
pay-as-you-go• “I don’t believe it is usually possible to estimate the  number of species... but only an appropriate lower...
CrowdER - Entity Resolution                    DB42/17
Hybrid Entity-Resolution                                                               Threshold = 0.2                    ...
CrowdQ – Query Generation • Help find answers to unstructured queries     – Approach: Generate a structured query via temp...
SO, WHERE DOESMIDDLEWARE FIT IN?
Generic Architecture                  Middleware is the software that                  resides between applications applic...
The Challenge                Incentives                Latency & Prediction                Failure ModesSome issues:    Wo...
Can you incentivize workers?     http://waxy.org/2008/11/the_faces_of_48     mechanical_turk/
Incentives49
Can you trust the crowd?      On Wikipedia ”any user can      change any entry, and if      enough users agree with      t...
Answer Quality Approaches• Some General Techniques     – Approval Rate / Demographic Restrictions     – Qualification Test...
Can you organize the crowd?                                        Independent agreement to identify patches              ...
Can You Predict the Crowd?     Streakers        List walking53
Can you build a low-latency crowd?from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in TwoSeconds: Enabling R...
Can you help the crowd?
For More InformationCrowdsourcing Tutorials: • P. Ipeirotis, Managing Crowdsourced Human Computation,   WWW „11, March 201...
Middeware2012 crowd
Middeware2012 crowd
Upcoming SlideShare
Loading in …5
×

Middeware2012 crowd

2,347 views
2,212 views

Published on

An overview of issues and early work on combining human computation and scalable computing to tackle big data analytics problems. Includes a survey of relevant projects underway at the UC Berkeley AMPLab.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,347
On SlideShare
0
From Embeds
0
Number of Embeds
66
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Fix ME!!!!
  • For the database administrator it is the correct answer, but for the CEO it is not really understandable
  • Equal is not a good fit
  • 210 HITsIt took 68 minutes to complete the whole experiment.
  • -
  • Lead off with saying heavily skewed distribution will be difficult to estimate, only lower bound(say quote)Instead, reason about cost vs. benefit tradeoffWhen you ask a slightly different question, you can still make progress!
  • General Techniques (NON-DB techniques)
  • Middeware2012 crowd

    1. 1. Integrating Crowd & Cloud Resources for Big Data Michael Franklin Middleware 2012, Montreal December 6 2012 Expeditions UC BERKELEY in Computing
    2. 2. CROWDSOURCINGWHAT IS IT?
    3. 3. Citizen ScienceNASA “Clickworkers” 2000
    4. 4. Citizen Journalism/Participatory Sensing4
    5. 5. Communities & Expertise
    6. 6. Data Collection & Curatione.g., Freebase
    7. 7. An Academic ViewFrom Quinn & Bederson, “Human Computation: A Surveyand Taxonomy of a Growing Field”, CHI 2011.
    8. 8. The Way Industry Looks At It How Industry Looks At It
    9. 9. Useful Taxonomies• Doan, Halevy, Ramakrishnan; (Crowdsourcing) CACM 4/11 – nature of collaboration (implicit vs. explicit) – architecture (standalone vs. piggybacked) – must recruit users/workers? (yes or no) – What do users/workers do?• Bederson & Quinn; (Human Computation) CHI ‟11 – Motivation (Pay, Altruism, Enjoyment, Reputation) – Quality Control (many mechanisms) – Aggregation (how are results combined?) – Human Skill (Visual recognition, language, …) – …
    10. 10. Types of TasksTask Granularity ExamplesComplex Tasks • Build a website • Develop a software system • Overthrow a government?Simple Projects • Design a logo and visual identity • Write a term paperMacro Tasks • Write a restaurant review • Test a new website feature • Identify a galaxyMicro Tasks • Label an image • Verify an address • Simple entity resolution Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009
    11. 11. MICRO-TASK MARKETPLACES
    12. 12. Amazon Mechanical Turk (AMT)
    13. 13. Microtasking – Virutalized Humans• Current leader: Amazon Mechanical Turk• Requestors place Human Intelligence Tasks (HITs) – set price per “assignment” (usually cents) – specify #of replicas (assignments), expiration, … – User Interface (for workers) – API-based: “createHit()”, “getAssignments()”, “approveAssignments()”, “forceExpire()”• Requestors approve jobs and payment• Workers (a.k.a. “turkers”) choose jobs, do them, get paid 13
    14. 14. AMT Worker Interface
    15. 15. Microtask Aggregators
    16. 16. Crowdsourcing for Data Management• Relational • Beyond relational – data cleaning – graph search – data entry – classification – information extraction – transcription – schema matching – mobile image search – entity resolution – social media analysis – data spaces – question answering – building structured KBs – NLP – sorting – text summarization – top-k – sentiment analysis – ... – semantic wikis – ... 18
    17. 17. TOWARDS HYBRIDCROWD/CLOUD COMPUTING
    18. 18. Not Exactly Crowdsourcing, but…“The hope is that, in not too many years, human brainsand computing machines will be coupled together verytightly, and that the resulting partnership will think as nohuman brain has ever thought and process data in a waynot approached by the information-handling machineswe know today.”
    19. 19. AMP: Integrating Diverse Resources Algorithms: Machine Learning and Analytics People: Machines: CrowdSourcing & Cloud Computing Human Computation21
    20. 20. The Berkeley AMPLab• Goal: Data analytics stack integrating A, M & P • BDAS: Released as BSD/Apache Open Source• 6 year duration: 2011-2017• 8 CS Faculty • Directors: Franklin(DB), Jordan (ML), Stoica (Sys)• Industrial Support & Collaboration:• NSF Expedition and Darpa XData 22
    21. 21. People in AMP• Long term Goal: Make people an integrated part of the system! • Leverage human activity Machines + • Leverage human intelligence Algorithms• Current AMP People Projects – Carat: Collaborative Energy Questions activity Answers Debugging data, – CrowdDB: “The World‟s Dumbest Database System” – CrowdER: Hybrid computation for Entity Resolution – CrowdQ: Hybrid Unstructured Query Answering 23
    22. 22. Carat: Leveraging Human Activity ~500,000 downloads to date A. J. Oliner, et al. Collaborative Energy Debugging for Mobile Devices. Workshop on Hot Topics in System Dependability (HotDep), 2012.24
    23. 23. Carat: How it works Collaborative Detection of Energy Bugs25
    24. 24. Leveraging Human IntelligenceFirst Attempt: CrowdSQL ResultsCrowdDB Parser Turker Relationship MetaData Manager UI Form Optimizer Creation EditorSee also: Executor UI Template Manager Qurk – MIT Statistics HIT Manager Deco – Stanford Files Access Methods Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 26 Query Processing with the VLDB Crowd, VLDB 2011
    25. 25. DB-hard QueriesCompany_Name Address Market CapGoogle Googleplex, Mtn. View CA $210BnIntl. Business Machines Armonk, NY $200BnMicrosoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “IBM” Number of Rows: 0 Problem: Entity Resolution27
    26. 26. DB-hard QueriesCompany_Name Address Market CapGoogle Googleplex, Mtn. View CA $210BnIntl. Business Machines Armonk, NY $200BnMicrosoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “Apple” Number of Rows: 0 Problem: Closed-World Assumption28
    27. 27. DB-hard QueriesSELECT ImageFrom PicturesWhere Image contains“Good Looking Dog” Number of Rows: 0 Problem: Subjective Comparision29
    28. 28. Leveraging Human IntelligenceFirst Attempt: CrowdSQL ResultsCrowdDB Parser Turker Relationship MetaData Manager UI FormWhere to use the crowd: Optimizer Creation Editor• Cleaning and Executor UI Template Manager Statistics Disambiguation• Find missing data Files Access Methods HIT Manager• Make subjective comparisons Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 30 Query Processing with the VLDB Crowd, VLDB 2011
    29. 29. CrowdDB - Worker Interface31
    30. 30. Mobile Platform32
    31. 31. CrowdSQLDDL Extensions:Crowdsourced columns Crowdsourced tablesCREATE TABLE company ( CREATE CROWD TABLE department ( name STRING PRIMARY KEY, university STRING, hq_address CROWD STRING); department STRING, phone_no STRING) PRIMARY KEY (university, department);DML Extensions: CrowdEqual: CROWDORDER operators (currently UDFs): SELECT * SELECT p FROM picture FROM companies WHERE subject = WHERE Name ~= “Big Blue” "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); 33
    32. 32. CrowdDB Query: Picture ordering Which picture visualizes betterQuery: "Golden Gate Bridge"SELECT p FROM pictureWHERE subject = "Golden Gate Bridge"ORDER BY CROWDORDER(p, "Which pic shows better %subject");Data-Size: 30 subject areas, with 8 pictures eachBatching: 4 orderings per HITReplication: 3 Assignments per HITPrice: 1 cent per HIT Submit34 (turker-votes, turker-ranking, expert-ranking)
    33. 33. User Interface vs. Quality Please fill out the missing Please fill out the missing professor data department data N ame Carey Department CS Department CS Name Please fill out the missing name MTJoin professor data MTJoin Phone E-Mail (Dep) Name Carey (Professor) p.dep = d.name Submitp.name = "carey" Submit MTProbe E-Mail (Professor, Dep) Department name=Carey Please fill out the missing Please fill out the missing Department MTProbe professor data Phone department data (Professor) CareyMTProbe(Dep) Department Name name=Carey Submit Name E-Mail Phone Department Submit Submit (Department first) (Professor first) (De-normalized Probe) ≈10% Error-Rate ≈10% Error-Rate ≈80% Error-Rate 35
    34. 34. Turker Affinity and Errors Turker Rank36
    35. 35. A Bigger Underlying Issue Closed-World Open-World37
    36. 36. What Does This Query Mean?SELECT COUNT(*) FROM IceCreamFlavors Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear) 38
    37. 37. Estimating CompletenessSELECT COUNT(*) FROM US States US States using Mechanical Turk Species Estimation techniques perform well on average •Uniform under-predicts slightly, coeff of var. = 0.5 •Decent estimate after 100 HITs States: unique items Average US States 50 40 avg # unique answers 30 20 10 0 0 50 100 150 200 250 30039 # responses # Answers (HITs)
    38. 38. Estimating CompletenessSELECT COUNT(*) FROM IceCreamFlavors• Ice Cream Ice Cream Flavors Flavors – Estimators don‟t converge – Very highly skewed (CV = 5.8) – Detect that # HITs insufficient Few, short lists of ice cream flavors (e.g. “alumni swirl, apple cobbler (beginning of crunch, arboretum breeze,…” from Penn State Creamery 40 curve)
    39. 39. pay-as-you-go• “I don’t believe it is usually possible to estimate the number of species... but only an appropriate lower bound for that number. This is because there is nearly always a good chance that there are a very large number of extremely rare species” – Good, 1953• So instead, can ask: “What‟s the benefit of m additional HITs?” Ice Cream after 1500 HITs m Actual Shen Spline 10 1 1.79 1.62 50 7 8.91 8.22 200 39 35.4 32.941
    40. 40. CrowdER - Entity Resolution DB42/17
    41. 41. Hybrid Entity-Resolution Threshold = 0.2 #Pairs = 8,315 #HITs = 508 Cost= $38.1 Time = 4.5h Time(QT) = 20h J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 201243/17
    42. 42. CrowdQ – Query Generation • Help find answers to unstructured queries – Approach: Generate a structured query via templates • Machines do parsing and ontology lookup • People do the rest: verification, entity extraction, etc.Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear) 44
    43. 43. SO, WHERE DOESMIDDLEWARE FIT IN?
    44. 44. Generic Architecture Middleware is the software that resides between applications application and the underlying architecture. The goal of middleware is to facilitate the development of applications by providing higher- level abstractions for betterHybrid Platform programmability, performance, s calability, security, and a variety of essential features. Middleware 2012 web page
    45. 45. The Challenge Incentives Latency & Prediction Failure ModesSome issues: Work Conditions Interface Task Structuring Task Routing 47 …
    46. 46. Can you incentivize workers? http://waxy.org/2008/11/the_faces_of_48 mechanical_turk/
    47. 47. Incentives49
    48. 48. Can you trust the crowd? On Wikipedia ”any user can change any entry, and if enough users agree with them, it becomes true." “The Elephant population in Africa has tripled over the past six months.”[1]Wikiality: Reality as decided on by majority rule.[2][1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report[2] http://www.urbandictionary.com/define.php?term=wikiality
    49. 49. Answer Quality Approaches• Some General Techniques – Approval Rate / Demographic Restrictions – Qualification Test – Gold Sets/Honey Pots – Redundancy and Voting – Statistical Measures and Bias Reduction – Verification/Review• Query Specific Techniques• Worker Relationship Management51
    50. 50. Can you organize the crowd? Independent agreement to identify patches Soylent, a prototype... Randomize order of suggestions52 [Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]
    51. 51. Can You Predict the Crowd? Streakers List walking53
    52. 52. Can you build a low-latency crowd?from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in TwoSeconds: Enabling Realtime Crowdsourced Applications”, UIST 2011. 54
    53. 53. Can you help the crowd?
    54. 54. For More InformationCrowdsourcing Tutorials: • P. Ipeirotis, Managing Crowdsourced Human Computation, WWW „11, March 2011. • O. Alonso, M. Lease, Crowdsourcing for Information Retrieval: Principles, Methods, and Applications, SIGIR July 2011. • A. Doan, M. Franklin, D. Kossmann, T. Kraska, Crowdsourcing Applications and Platforms: A Data Management Perspective, VLDB 2011.AMPLab: amplab.cs.berkeley.edu • Papers • Project Descriptions and Pages • News updates and Blogs56

    ×