Vldb14

Aggregate Estimation Over Dynamic
Hidden Web Databases
Presenter: Weimo Liu (The George Washington University)
Joint work with Saravanan Thirumuruganathan (University
of Texas at Arlington), Nan Zhang (The George Washington
University), and Gautam Das (University of Texas at Arlington)
1

Outline
 Background and Motivation
 REISSUE-ESTIMATOR
 RS-ESTIMATOR
 SYSTEM DESIGN
 Experimental Results
 Conclusion
2

Hidden Databases: Used Car Inventory
 Form-like interface
 Return top-k tuples
3

Search Queries vs Aggregate Queries
 Search Queries
 SELECT * FROM D WHERE ac1 = vc1 &···& acu = vcu
 e.g., List 2006 Ford F-150 with 4WD and 5.4L engine in Cargiant’s inventory
 Answered by hidden database with top-k restriction
 Aggregate Queries
 SELECT AGGR(*) FROM D WHERE ac1 = vc1 &···& acu = vcu,
 e.g., How many vehicles in Cargiant’s inventory have MPG > 30?
 Cannot be answered through the public web interface
Search query
Aggregate query
Web interface
Hidden database
4

Challenges
 Prior work is over a static hidden database. Problems
exist in the simple approach to tackle the dynamic case
by repeatedly executing (at certain time interval) the
existing “static” algorithms:
 Daily limit number of search queries per-IP
 Repeated executions waste a lot of search queries
5

Outline of Technical Results
 Baseline
 Repeated executions of existing “static” algorithm [DJJ+10]
 Two Algorithms
 REISSUE-ESTIMATOR
 We try to infer whether and how search query answers received in the
last round change in this round.
 RS-ESTIMATOR
 Automatically maintains a sample of a database according to how the
database changes.
6

Model of Dynamic Hidden Web Databases
 Hidden Web Database and Query Interface
 A hidden database D with m attributes A1, …, Am. Let Ui be the
domain for attribute Ai. For a tuple t Î D, we use t[Ai] Î Ui to
denote the value of Ai for t.
 SELECT * FROM D WHERE Ai1 = ui1 AND … AND Ais = uis
where i1, …, in Î [1, m] and uij Î Uij . Let Sel(q) Î D be the
tuples matching q.
 Dynamic Hidden Databases
 In most part of the paper, we consider a round-update model
where modifications occur at the beginning instant of each
round.
7

Objectives of Aggregate Estimation
 In this paper, we consider two types of aggregate
estimation tasks over a dynamic hidden database:
 Single-round aggregates
 In one round
 Average, Count, Sum
 Trans-round aggregates
 The current ROUND and the previous ROUND
 |Di|-|Di-1|

Outline
 RS-ESTIMATOR
 SYSTEM DESIGN
 Conclusion
9

Query Reissuing for Multiple Rounds

Key Question: Reissue or Restart?
 Example 1 (No change)
 The queries issued by REISSUE-ESTIMATOR are always a
subset of those issued by RESTART-ESTIMATOR

Key Question: Reissue or Restart?
 Example 2 (Total change)
 REISSUE-ESTIMATOR might end up performing worse
than RESTART-ESTIMATOR

Outline
 RS-ESTIMATOR
 SYSTEM DESIGN
 Conclusion
18

Problem of REISSUE-ESTIMATOR
 Example (No Change)
 One does not need to issue many queries before realizing the
database has changed little, and therefore reallocate the
remaining query budget to initiate new drill downs
 Reservoir Sampling [V85]
 How much change should happen to the sample being
maintained depends on how much incoming data are inserted
to the database.

Outline
 RS-ESTIMATOR
 SYSTEM DESIGN
 Conclusion
22

Outline
 RS-ESTIMATOR
 SYSTEM DESIGN
 Conclusion
24

CONCLUSION AND FUTURE WORK
A study of estimating aggregates over
dynamic hidden web databases
 Query reissuing
 Bootstrapping-based query-plan adjustment
Future Work
 A study of how meta data such as COUNT can be used to guide
the design of drill downs in future rounds;
 Given a workload of aggregate queries, how to minimize the
total query cost for estimating all of them;
 How to leverage both keyword search and form-like search
interfaces provided by many web databases to further improve
the performance of aggregate estimations.

References
 [DJJ+10]Arjun Dasgupta, Xin Jin, Bradley Jewell, Nan
Zhang, and Gautam Das, Unbiased Estimation of Size and
Other Aggregates Over Hidden Web Databases, in SIGMOD
2010.
 [V85] J. S. Vitter, Random sampling with a reservoir. ACM
Trans. Math. Software., 11(1):37–57, Mar. 1985.
26

Vldb14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Vldb14

Similar to Vldb14 (20)

Recently uploaded

Recently uploaded (20)

Vldb14