Query Optimization over Crowdsourced Data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
3,397
On Slideshare
3,397
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Query Optimization over Crowdsourced Data Hyunjung Park, Jennifer Widom Stanford University
  • 2. Deco: Declarative Crowdsourcing Give me a Spanish-speaking country. Give me a country. What language do they speak in country X? What is the capital of country X? 8/27/2013 Hyunjung Park 2 “Find the capitals of eight Spanish-speaking countries” DBMS country language capital Italy Italian Rome Spain Spanish Madrid … … … country language capital Italy Italian Rome Spain Spanish Madrid Deco System
  • 3. Deco Query Optimization •  Crowd incurs monetary cost •  Some query plans are much cheaper than others •  Cost estimation is complicated by: –  Previously collected data –  Unknown database state –  Inconsistency of human answers 8/27/2013 Hyunjung Park 3
  • 4. Outline •  Motivating example •  Deco data model and queries •  Cost and cardinality estimation •  Experimental results 8/27/2013 Hyunjung Park 4 Everything implemented in full prototype
  • 5. Motivating Example: Plan 1 8/27/2013 Hyunjung Park 5 Give me a country. What language do they speak in country X? What is the capital of country X? unseen Spanish F T T F “Find the capitals of eight Spanish-speaking countries” 8x
  • 6. Give me a country.Give me a country.Give me a country. Motivating Example: Plan 2 8/27/2013 Hyunjung Park 6 Give me a Spanish-speaking country. What language do they speak in country X? What is the capital of country X? unseen Spanish F T T F “Find the capitals of eight Spanish-speaking countries” 8x
  • 7. Preview of Experimental Results 0 5 10 15 Plan 1 Plan 2 Actual costs spent on Mechanical Turk What is the capital of country X? What language do they speak in country X? Give me a Spanish-speaking country. Give me a country. 8/27/2013 Hyunjung Park 7 ($)
  • 8. Outline •  Motivating example •  Deco data model and queries •  Cost and cardinality estimation •  Experimental results 8/27/2013 Hyunjung Park 8
  • 9. Deco: Data Model (1/2) •  Conceptual Relation: visible to end-users Country (country, language, capital) •  Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3) capital: majority(3) 8/27/2013 Hyunjung Park 9
  • 10. Deco: Data Model (2/2) •  Fetch Rules: “access methods” for the crowd language => country “Give me a {language}-speaking country.” Ø => country “Give me a country.” country => language “What language do they speak in {country}?” country => capital “What is the capital of {country}?” 8/27/2013 Hyunjung Park 10 [$0.05] [$0.01] [$0.02] [$0.03]
  • 11. Deco: Queries •  Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8 •  Query processor: access the crowd as needed to produce query result while: 1.  Minimizing monetary cost 2.  Reducing latency 8/27/2013 Hyunjung Park 11 query optimizer query execution engine
  • 12. Query Optimization •  Find the best query plan in terms of estimated monetary cost •  As in traditional query optimizer 1.  Cost and cardinality estimation 2.  Search space 3.  Plan enumeration algorithm 8/27/2013 12Hyunjung Park
  • 13. Cost Estimation •  Total monetary cost = ∑Fetch  F  F.price × F.cardinality –  Existing data is “free” •  Definition of Cardinality in Deco –  Total number of expected output tuples from operator until query execution terminates •  Cardinality estimation –  Final database state needs to be estimated simultaneously 8/27/2013 Hyunjung Park 13
  • 14. Cardinality Estimation: Setting •  $0.05 for all fetch rules •  No existing data •  Selectivity factors –  language=‘Spanish’: 0.1 –  dupElim: 0.8 –  majority(3): 0.4 (=1/2.5) 8/27/2013 Hyunjung Park 14
  • 15. Cardinality Estimation: Plan 1 8/27/2013 15Hyunjung Park SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8 MinTuples[8] Project[co,ca] DLOJoin[co] DLOJoin[co] Resolve[dupeli] Resolve[maj3] Resolve[maj3]Filter[la=’Spanish’] Scan [CtryA] Fetch [Øàco] Scan [CtryD2] Fetch [coàca] Scan [CtryD1] Fetch [coàla] 1 2 3 4 12 5 13 96 7 8 10 11 14 Ø => country country => language country => capital Cost estimation: $0.05×(100+200+20) = $16.00200 20 100
  • 16. Cardinality Estimation: Plan 2 8/27/2013 16Hyunjung Park MinTuples[8] Project[co,ca] DLOJoin[co] DLOJoin[co] Resolve[dupeli] Resolve[maj3] Resolve[maj3]Filter[la=’Spanish’] Scan [CtryA] Fetch [laàco] Scan [CtryD2] Fetch [coàca] Scan [CtryD1] Fetch [coàla] 1 2 3 4 12 5 13 96 7 8a 10 11 14 SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8 language => country country => language country => capital Cost estimation: $0.05×(10+20+20) = $2.502010 20
  • 17. 8/27/2013 Hyunjung Park 17 0 1 2 3 Actual Plan 2 Experimental Results 0 5 10 15 Actual Plan 1 country => capital country => language language => country Ø => country ($) ($)
  • 18. 8/27/2013 Hyunjung Park 18 0 1 2 3 Actual Estimated Plan 2 Experimental Results 0 5 10 15 Actual Estimated Plan 1 country => capital country => language language => country Ø => country ($) ($)
  • 19. Related Work •  Declarative approach for crowdsourcing –  Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ... •  Crowd-powered algorithms/operations –  Filter, sort, join, max, entity resolution, … •  Also: –  Traditional query optimization –  Heterogeneous or federated database systems 8/27/2013 19Hyunjung Park
  • 20. Summary •  Cost estimation in Deco –  Distinguish between existing data vs. new data –  Estimate cardinality and final database state simultaneously •  In the paper: –  Full description of cost estimation and plan enumeration algorithms –  More experimental results 8/27/2013 Hyunjung Park 20
  • 21. Thank you!