Query Optimization
over Crowdsourced Data
Hyunjung Park, Jennifer Widom
Stanford University
Deco: Declarative Crowdsourcing
Give me a Spanish-speaking
country.
Give me a country.
What language do they speak
in coun...
Deco Query Optimization
•  Crowd incurs monetary cost
•  Some query plans are much cheaper than others
•  Cost estimation ...
Outline
•  Motivating example
•  Deco data model and queries
•  Cost and cardinality estimation
•  Experimental results
8/...
Motivating Example: Plan 1
8/27/2013 Hyunjung Park 5
Give me a country.
What language do they speak in country X?
What is ...
Give me a country.Give me a country.Give me a country.
Motivating Example: Plan 2
8/27/2013 Hyunjung Park 6
Give me a Span...
Preview of Experimental Results
0
5
10
15
Plan 1 Plan 2
Actual costs spent on Mechanical Turk
What is the capital of
count...
Outline
•  Motivating example
•  Deco data model and queries
•  Cost and cardinality estimation
•  Experimental results
8/...
Deco: Data Model (1/2)
•  Conceptual Relation: visible to end-users
Country (country, language, capital)
•  Resolution Rul...
Deco: Data Model (2/2)
•  Fetch Rules: “access methods” for the crowd
language => country
“Give me a {language}-speaking c...
Deco: Queries
•  Deco query: SQL query over conceptual relations
SELECT country, capital
FROM Country
WHERE language=‘Span...
Query Optimization
•  Find the best query plan in terms of estimated
monetary cost
•  As in traditional query optimizer
1....
Cost Estimation
•  Total monetary cost = ∑Fetch	
  F	
  F.price × F.cardinality
–  Existing data is “free”
•  Definition of...
Cardinality Estimation: Setting
•  $0.05 for all fetch rules
•  No existing data
•  Selectivity factors
–  language=‘Spani...
Cardinality Estimation: Plan 1
8/27/2013 15Hyunjung Park
SELECT country, capital
FROM Country
WHERE language=‘Spanish’
MIN...
Cardinality Estimation: Plan 2
8/27/2013 16Hyunjung Park
MinTuples[8]
Project[co,ca]
DLOJoin[co]
DLOJoin[co]
Resolve[dupel...
8/27/2013 Hyunjung Park 17
0
1
2
3
Actual
Plan 2
Experimental Results
0
5
10
15
Actual
Plan 1
country => capital
country =...
8/27/2013 Hyunjung Park 18
0
1
2
3
Actual Estimated
Plan 2
Experimental Results
0
5
10
15
Actual Estimated
Plan 1
country ...
Related Work
•  Declarative approach for crowdsourcing
–  Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...
•  Crowd-...
Summary
•  Cost estimation in Deco
–  Distinguish between existing data vs. new data
–  Estimate cardinality and final data...
Thank you!
Upcoming SlideShare
Loading in …5
×

Query Optimization over Crowdsourced Data

4,100 views
3,901 views

Published on

Presented in VLDB 2013.

Published in: Software, Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,100
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Query Optimization over Crowdsourced Data

  1. 1. Query Optimization over Crowdsourced Data Hyunjung Park, Jennifer Widom Stanford University
  2. 2. Deco: Declarative Crowdsourcing Give me a Spanish-speaking country. Give me a country. What language do they speak in country X? What is the capital of country X? 8/27/2013 Hyunjung Park 2 “Find the capitals of eight Spanish-speaking countries” DBMS country language capital Italy Italian Rome Spain Spanish Madrid … … … country language capital Italy Italian Rome Spain Spanish Madrid Deco System
  3. 3. Deco Query Optimization •  Crowd incurs monetary cost •  Some query plans are much cheaper than others •  Cost estimation is complicated by: –  Previously collected data –  Unknown database state –  Inconsistency of human answers 8/27/2013 Hyunjung Park 3
  4. 4. Outline •  Motivating example •  Deco data model and queries •  Cost and cardinality estimation •  Experimental results 8/27/2013 Hyunjung Park 4 Everything implemented in full prototype
  5. 5. Motivating Example: Plan 1 8/27/2013 Hyunjung Park 5 Give me a country. What language do they speak in country X? What is the capital of country X? unseen Spanish F T T F “Find the capitals of eight Spanish-speaking countries” 8x
  6. 6. Give me a country.Give me a country.Give me a country. Motivating Example: Plan 2 8/27/2013 Hyunjung Park 6 Give me a Spanish-speaking country. What language do they speak in country X? What is the capital of country X? unseen Spanish F T T F “Find the capitals of eight Spanish-speaking countries” 8x
  7. 7. Preview of Experimental Results 0 5 10 15 Plan 1 Plan 2 Actual costs spent on Mechanical Turk What is the capital of country X? What language do they speak in country X? Give me a Spanish-speaking country. Give me a country. 8/27/2013 Hyunjung Park 7 ($)
  8. 8. Outline •  Motivating example •  Deco data model and queries •  Cost and cardinality estimation •  Experimental results 8/27/2013 Hyunjung Park 8
  9. 9. Deco: Data Model (1/2) •  Conceptual Relation: visible to end-users Country (country, language, capital) •  Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3) capital: majority(3) 8/27/2013 Hyunjung Park 9
  10. 10. Deco: Data Model (2/2) •  Fetch Rules: “access methods” for the crowd language => country “Give me a {language}-speaking country.” Ø => country “Give me a country.” country => language “What language do they speak in {country}?” country => capital “What is the capital of {country}?” 8/27/2013 Hyunjung Park 10 [$0.05] [$0.01] [$0.02] [$0.03]
  11. 11. Deco: Queries •  Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8 •  Query processor: access the crowd as needed to produce query result while: 1.  Minimizing monetary cost 2.  Reducing latency 8/27/2013 Hyunjung Park 11 query optimizer query execution engine
  12. 12. Query Optimization •  Find the best query plan in terms of estimated monetary cost •  As in traditional query optimizer 1.  Cost and cardinality estimation 2.  Search space 3.  Plan enumeration algorithm 8/27/2013 12Hyunjung Park
  13. 13. Cost Estimation •  Total monetary cost = ∑Fetch  F  F.price × F.cardinality –  Existing data is “free” •  Definition of Cardinality in Deco –  Total number of expected output tuples from operator until query execution terminates •  Cardinality estimation –  Final database state needs to be estimated simultaneously 8/27/2013 Hyunjung Park 13
  14. 14. Cardinality Estimation: Setting •  $0.05 for all fetch rules •  No existing data •  Selectivity factors –  language=‘Spanish’: 0.1 –  dupElim: 0.8 –  majority(3): 0.4 (=1/2.5) 8/27/2013 Hyunjung Park 14
  15. 15. Cardinality Estimation: Plan 1 8/27/2013 15Hyunjung Park SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8 MinTuples[8] Project[co,ca] DLOJoin[co] DLOJoin[co] Resolve[dupeli] Resolve[maj3] Resolve[maj3]Filter[la=’Spanish’] Scan [CtryA] Fetch [Øàco] Scan [CtryD2] Fetch [coàca] Scan [CtryD1] Fetch [coàla] 1 2 3 4 12 5 13 96 7 8 10 11 14 Ø => country country => language country => capital Cost estimation: $0.05×(100+200+20) = $16.00200 20 100
  16. 16. Cardinality Estimation: Plan 2 8/27/2013 16Hyunjung Park MinTuples[8] Project[co,ca] DLOJoin[co] DLOJoin[co] Resolve[dupeli] Resolve[maj3] Resolve[maj3]Filter[la=’Spanish’] Scan [CtryA] Fetch [laàco] Scan [CtryD2] Fetch [coàca] Scan [CtryD1] Fetch [coàla] 1 2 3 4 12 5 13 96 7 8a 10 11 14 SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8 language => country country => language country => capital Cost estimation: $0.05×(10+20+20) = $2.502010 20
  17. 17. 8/27/2013 Hyunjung Park 17 0 1 2 3 Actual Plan 2 Experimental Results 0 5 10 15 Actual Plan 1 country => capital country => language language => country Ø => country ($) ($)
  18. 18. 8/27/2013 Hyunjung Park 18 0 1 2 3 Actual Estimated Plan 2 Experimental Results 0 5 10 15 Actual Estimated Plan 1 country => capital country => language language => country Ø => country ($) ($)
  19. 19. Related Work •  Declarative approach for crowdsourcing –  Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ... •  Crowd-powered algorithms/operations –  Filter, sort, join, max, entity resolution, … •  Also: –  Traditional query optimization –  Heterogeneous or federated database systems 8/27/2013 19Hyunjung Park
  20. 20. Summary •  Cost estimation in Deco –  Distinguish between existing data vs. new data –  Estimate cardinality and final database state simultaneously •  In the paper: –  Full description of cost estimation and plan enumeration algorithms –  More experimental results 8/27/2013 Hyunjung Park 20
  21. 21. Thank you!

×