Your SlideShare is downloading. ×
Middeware2012 crowd
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Middeware2012 crowd

2,016
views

Published on

An overview of issues and early work on combining human computation and scalable computing to tackle big data analytics problems. Includes a survey of relevant projects underway at the UC Berkeley …

An overview of issues and early work on combining human computation and scalable computing to tackle big data analytics problems. Includes a survey of relevant projects underway at the UC Berkeley AMPLab.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,016
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Fix ME!!!!
  • For the database administrator it is the correct answer, but for the CEO it is not really understandable
  • Equal is not a good fit
  • 210 HITsIt took 68 minutes to complete the whole experiment.
  • -
  • Lead off with saying heavily skewed distribution will be difficult to estimate, only lower bound(say quote)Instead, reason about cost vs. benefit tradeoffWhen you ask a slightly different question, you can still make progress!
  • General Techniques (NON-DB techniques)
  • Transcript

    • 1. Integrating Crowd & Cloud Resources for Big Data Michael Franklin Middleware 2012, Montreal December 6 2012 Expeditions UC BERKELEY in Computing
    • 2. CROWDSOURCINGWHAT IS IT?
    • 3. Citizen ScienceNASA “Clickworkers” 2000
    • 4. Citizen Journalism/Participatory Sensing4
    • 5. Communities & Expertise
    • 6. Data Collection & Curatione.g., Freebase
    • 7. An Academic ViewFrom Quinn & Bederson, “Human Computation: A Surveyand Taxonomy of a Growing Field”, CHI 2011.
    • 8. The Way Industry Looks At It How Industry Looks At It
    • 9. Useful Taxonomies• Doan, Halevy, Ramakrishnan; (Crowdsourcing) CACM 4/11 – nature of collaboration (implicit vs. explicit) – architecture (standalone vs. piggybacked) – must recruit users/workers? (yes or no) – What do users/workers do?• Bederson & Quinn; (Human Computation) CHI ‟11 – Motivation (Pay, Altruism, Enjoyment, Reputation) – Quality Control (many mechanisms) – Aggregation (how are results combined?) – Human Skill (Visual recognition, language, …) – …
    • 10. Types of TasksTask Granularity ExamplesComplex Tasks • Build a website • Develop a software system • Overthrow a government?Simple Projects • Design a logo and visual identity • Write a term paperMacro Tasks • Write a restaurant review • Test a new website feature • Identify a galaxyMicro Tasks • Label an image • Verify an address • Simple entity resolution Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009
    • 11. MICRO-TASK MARKETPLACES
    • 12. Amazon Mechanical Turk (AMT)
    • 13. Microtasking – Virutalized Humans• Current leader: Amazon Mechanical Turk• Requestors place Human Intelligence Tasks (HITs) – set price per “assignment” (usually cents) – specify #of replicas (assignments), expiration, … – User Interface (for workers) – API-based: “createHit()”, “getAssignments()”, “approveAssignments()”, “forceExpire()”• Requestors approve jobs and payment• Workers (a.k.a. “turkers”) choose jobs, do them, get paid 13
    • 14. AMT Worker Interface
    • 15. Microtask Aggregators
    • 16. Crowdsourcing for Data Management• Relational • Beyond relational – data cleaning – graph search – data entry – classification – information extraction – transcription – schema matching – mobile image search – entity resolution – social media analysis – data spaces – question answering – building structured KBs – NLP – sorting – text summarization – top-k – sentiment analysis – ... – semantic wikis – ... 18
    • 17. TOWARDS HYBRIDCROWD/CLOUD COMPUTING
    • 18. Not Exactly Crowdsourcing, but…“The hope is that, in not too many years, human brainsand computing machines will be coupled together verytightly, and that the resulting partnership will think as nohuman brain has ever thought and process data in a waynot approached by the information-handling machineswe know today.”
    • 19. AMP: Integrating Diverse Resources Algorithms: Machine Learning and Analytics People: Machines: CrowdSourcing & Cloud Computing Human Computation21
    • 20. The Berkeley AMPLab• Goal: Data analytics stack integrating A, M & P • BDAS: Released as BSD/Apache Open Source• 6 year duration: 2011-2017• 8 CS Faculty • Directors: Franklin(DB), Jordan (ML), Stoica (Sys)• Industrial Support & Collaboration:• NSF Expedition and Darpa XData 22
    • 21. People in AMP• Long term Goal: Make people an integrated part of the system! • Leverage human activity Machines + • Leverage human intelligence Algorithms• Current AMP People Projects – Carat: Collaborative Energy Questions activity Answers Debugging data, – CrowdDB: “The World‟s Dumbest Database System” – CrowdER: Hybrid computation for Entity Resolution – CrowdQ: Hybrid Unstructured Query Answering 23
    • 22. Carat: Leveraging Human Activity ~500,000 downloads to date A. J. Oliner, et al. Collaborative Energy Debugging for Mobile Devices. Workshop on Hot Topics in System Dependability (HotDep), 2012.24
    • 23. Carat: How it works Collaborative Detection of Energy Bugs25
    • 24. Leveraging Human IntelligenceFirst Attempt: CrowdSQL ResultsCrowdDB Parser Turker Relationship MetaData Manager UI Form Optimizer Creation EditorSee also: Executor UI Template Manager Qurk – MIT Statistics HIT Manager Deco – Stanford Files Access Methods Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 26 Query Processing with the VLDB Crowd, VLDB 2011
    • 25. DB-hard QueriesCompany_Name Address Market CapGoogle Googleplex, Mtn. View CA $210BnIntl. Business Machines Armonk, NY $200BnMicrosoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “IBM” Number of Rows: 0 Problem: Entity Resolution27
    • 26. DB-hard QueriesCompany_Name Address Market CapGoogle Googleplex, Mtn. View CA $210BnIntl. Business Machines Armonk, NY $200BnMicrosoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “Apple” Number of Rows: 0 Problem: Closed-World Assumption28
    • 27. DB-hard QueriesSELECT ImageFrom PicturesWhere Image contains“Good Looking Dog” Number of Rows: 0 Problem: Subjective Comparision29
    • 28. Leveraging Human IntelligenceFirst Attempt: CrowdSQL ResultsCrowdDB Parser Turker Relationship MetaData Manager UI FormWhere to use the crowd: Optimizer Creation Editor• Cleaning and Executor UI Template Manager Statistics Disambiguation• Find missing data Files Access Methods HIT Manager• Make subjective comparisons Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 30 Query Processing with the VLDB Crowd, VLDB 2011
    • 29. CrowdDB - Worker Interface31
    • 30. Mobile Platform32
    • 31. CrowdSQLDDL Extensions:Crowdsourced columns Crowdsourced tablesCREATE TABLE company ( CREATE CROWD TABLE department ( name STRING PRIMARY KEY, university STRING, hq_address CROWD STRING); department STRING, phone_no STRING) PRIMARY KEY (university, department);DML Extensions: CrowdEqual: CROWDORDER operators (currently UDFs): SELECT * SELECT p FROM picture FROM companies WHERE subject = WHERE Name ~= “Big Blue” "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); 33
    • 32. CrowdDB Query: Picture ordering Which picture visualizes betterQuery: "Golden Gate Bridge"SELECT p FROM pictureWHERE subject = "Golden Gate Bridge"ORDER BY CROWDORDER(p, "Which pic shows better %subject");Data-Size: 30 subject areas, with 8 pictures eachBatching: 4 orderings per HITReplication: 3 Assignments per HITPrice: 1 cent per HIT Submit34 (turker-votes, turker-ranking, expert-ranking)
    • 33. User Interface vs. Quality Please fill out the missing Please fill out the missing professor data department data N ame Carey Department CS Department CS Name Please fill out the missing name MTJoin professor data MTJoin Phone E-Mail (Dep) Name Carey (Professor) p.dep = d.name Submitp.name = "carey" Submit MTProbe E-Mail (Professor, Dep) Department name=Carey Please fill out the missing Please fill out the missing Department MTProbe professor data Phone department data (Professor) CareyMTProbe(Dep) Department Name name=Carey Submit Name E-Mail Phone Department Submit Submit (Department first) (Professor first) (De-normalized Probe) ≈10% Error-Rate ≈10% Error-Rate ≈80% Error-Rate 35
    • 34. Turker Affinity and Errors Turker Rank36
    • 35. A Bigger Underlying Issue Closed-World Open-World37
    • 36. What Does This Query Mean?SELECT COUNT(*) FROM IceCreamFlavors Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear) 38
    • 37. Estimating CompletenessSELECT COUNT(*) FROM US States US States using Mechanical Turk Species Estimation techniques perform well on average •Uniform under-predicts slightly, coeff of var. = 0.5 •Decent estimate after 100 HITs States: unique items Average US States 50 40 avg # unique answers 30 20 10 0 0 50 100 150 200 250 30039 # responses # Answers (HITs)
    • 38. Estimating CompletenessSELECT COUNT(*) FROM IceCreamFlavors• Ice Cream Ice Cream Flavors Flavors – Estimators don‟t converge – Very highly skewed (CV = 5.8) – Detect that # HITs insufficient Few, short lists of ice cream flavors (e.g. “alumni swirl, apple cobbler (beginning of crunch, arboretum breeze,…” from Penn State Creamery 40 curve)
    • 39. pay-as-you-go• “I don’t believe it is usually possible to estimate the number of species... but only an appropriate lower bound for that number. This is because there is nearly always a good chance that there are a very large number of extremely rare species” – Good, 1953• So instead, can ask: “What‟s the benefit of m additional HITs?” Ice Cream after 1500 HITs m Actual Shen Spline 10 1 1.79 1.62 50 7 8.91 8.22 200 39 35.4 32.941
    • 40. CrowdER - Entity Resolution DB42/17
    • 41. Hybrid Entity-Resolution Threshold = 0.2 #Pairs = 8,315 #HITs = 508 Cost= $38.1 Time = 4.5h Time(QT) = 20h J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 201243/17
    • 42. CrowdQ – Query Generation • Help find answers to unstructured queries – Approach: Generate a structured query via templates • Machines do parsing and ontology lookup • People do the rest: verification, entity extraction, etc.Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear) 44
    • 43. SO, WHERE DOESMIDDLEWARE FIT IN?
    • 44. Generic Architecture Middleware is the software that resides between applications application and the underlying architecture. The goal of middleware is to facilitate the development of applications by providing higher- level abstractions for betterHybrid Platform programmability, performance, s calability, security, and a variety of essential features. Middleware 2012 web page
    • 45. The Challenge Incentives Latency & Prediction Failure ModesSome issues: Work Conditions Interface Task Structuring Task Routing 47 …
    • 46. Can you incentivize workers? http://waxy.org/2008/11/the_faces_of_48 mechanical_turk/
    • 47. Incentives49
    • 48. Can you trust the crowd? On Wikipedia ”any user can change any entry, and if enough users agree with them, it becomes true." “The Elephant population in Africa has tripled over the past six months.”[1]Wikiality: Reality as decided on by majority rule.[2][1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report[2] http://www.urbandictionary.com/define.php?term=wikiality
    • 49. Answer Quality Approaches• Some General Techniques – Approval Rate / Demographic Restrictions – Qualification Test – Gold Sets/Honey Pots – Redundancy and Voting – Statistical Measures and Bias Reduction – Verification/Review• Query Specific Techniques• Worker Relationship Management51
    • 50. Can you organize the crowd? Independent agreement to identify patches Soylent, a prototype... Randomize order of suggestions52 [Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]
    • 51. Can You Predict the Crowd? Streakers List walking53
    • 52. Can you build a low-latency crowd?from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in TwoSeconds: Enabling Realtime Crowdsourced Applications”, UIST 2011. 54
    • 53. Can you help the crowd?
    • 54. For More InformationCrowdsourcing Tutorials: • P. Ipeirotis, Managing Crowdsourced Human Computation, WWW „11, March 2011. • O. Alonso, M. Lease, Crowdsourcing for Information Retrieval: Principles, Methods, and Applications, SIGIR July 2011. • A. Doan, M. Franklin, D. Kossmann, T. Kraska, Crowdsourcing Applications and Platforms: A Data Management Perspective, VLDB 2011.AMPLab: amplab.cs.berkeley.edu • Papers • Project Descriptions and Pages • News updates and Blogs56