Query-Awared Database
Generators
Background and
Demands
How to simulate clients’ bussiness?
● Ideal circumstance: Database Schema + Real Data + SQL
○ Sometimes Real Data is not available
● Alternatives: Database Schema + SQL + (Statistics?)
● Mock distribution/ domains for an attribution
○ pk -> uniform distribution
○ pk + fk -> Intergration Constriant
○ Zipfian
○ Enum Type (every low cardinality)
● Some SQL are too complex to understand:
○ TPC… SSB…
○ Heavy Anylitical Tasks
● We need a query-aware generator
Hassles of Making a Generator
● For simple query, the solution is simple and direct
○ select [cols...] from t1 join t2 on t1.pk = t2.fk
● What if we need to simulate some correlation
○ where a < ‘xx’ and b > ‘yy’
● Or proccess complex projection
○ select case when cond1 then ***, case when cond2
● And the combination of above circumstances….
Two papers with two different methods
● QAGen
● DGL
QAGen: Generating query-
aware test databases
Symbolic Execution
● Volcano mode iterator
● Data are symbols instead of real data
● Generate constraints during execution
Symbolic Execution
● Volcano mode iterator
● Data are symbols instead of real data
● Generate constraints during execution
Agenda of QAGen
● Analyze the Query && Find out the Available “Knob”s
○ A knob of operator is the cardinality of the output of the operator
○ or the Distribution
○ if pre-grouped
● Do the Symbolic Execution (and Get the Constraint of the Data)
● Instantiating the Data
○ Always with expensive cost
Analyze the Query
Symbolic Execution - Selection
● GetNext for child, read tuple “t”
● [Positive Tuple Annotation] if output not reach c
○ for each symbol s in t, insert <s, p>
○ return t to parent
● [Negative Tuple Processing]
○ insert <s, !p>
Symbolic Execution - Join
● Generate Join Distribution
● GetNext for (outer) child, read tuple “t”
● [Positive Tuple Annotation] replace the fk with pk
● [Negative Tuple Processing]
Instantiazing
● A typical process of model checking.
● Low efficiency
Do we need a state-of-the-art
generator?
● Pros:
○ without meddling, we can definitely get effective data
○ work for complex quries that are not understandable
○ perfectly guarantee constriants
■ Cardinality constriants
■ Integretry constriants
● Cons:
○ low efficiency
○ some cases still can’t be processed
● How about introducing an intermediate DSL
DGL: Flexible Database
Generators
Another alternative
● What if we have understood the workload pretty well?
● Then use a more flexiable DSL instead of coding for every workload?
● DGL (Data Generate Language):
○ Scalar
○ Rows
○ Iterator
■ Distribution
■ Real SQL (Query / Persist)
○ Tables
○ Expression / Functions
An example of TPC-H
Thanks

[Paper Reading] QAGen: Generating query-aware test databases

  • 1.
  • 2.
  • 3.
    How to simulateclients’ bussiness? ● Ideal circumstance: Database Schema + Real Data + SQL ○ Sometimes Real Data is not available ● Alternatives: Database Schema + SQL + (Statistics?) ● Mock distribution/ domains for an attribution ○ pk -> uniform distribution ○ pk + fk -> Intergration Constriant ○ Zipfian ○ Enum Type (every low cardinality) ● Some SQL are too complex to understand: ○ TPC… SSB… ○ Heavy Anylitical Tasks ● We need a query-aware generator
  • 4.
    Hassles of Makinga Generator ● For simple query, the solution is simple and direct ○ select [cols...] from t1 join t2 on t1.pk = t2.fk ● What if we need to simulate some correlation ○ where a < ‘xx’ and b > ‘yy’ ● Or proccess complex projection ○ select case when cond1 then ***, case when cond2 ● And the combination of above circumstances….
  • 5.
    Two papers withtwo different methods ● QAGen ● DGL
  • 6.
  • 7.
    Symbolic Execution ● Volcanomode iterator ● Data are symbols instead of real data ● Generate constraints during execution
  • 8.
    Symbolic Execution ● Volcanomode iterator ● Data are symbols instead of real data ● Generate constraints during execution
  • 9.
    Agenda of QAGen ●Analyze the Query && Find out the Available “Knob”s ○ A knob of operator is the cardinality of the output of the operator ○ or the Distribution ○ if pre-grouped ● Do the Symbolic Execution (and Get the Constraint of the Data) ● Instantiating the Data ○ Always with expensive cost
  • 10.
  • 11.
    Symbolic Execution -Selection ● GetNext for child, read tuple “t” ● [Positive Tuple Annotation] if output not reach c ○ for each symbol s in t, insert <s, p> ○ return t to parent ● [Negative Tuple Processing] ○ insert <s, !p>
  • 12.
    Symbolic Execution -Join ● Generate Join Distribution ● GetNext for (outer) child, read tuple “t” ● [Positive Tuple Annotation] replace the fk with pk ● [Negative Tuple Processing]
  • 13.
    Instantiazing ● A typicalprocess of model checking. ● Low efficiency
  • 14.
    Do we needa state-of-the-art generator? ● Pros: ○ without meddling, we can definitely get effective data ○ work for complex quries that are not understandable ○ perfectly guarantee constriants ■ Cardinality constriants ■ Integretry constriants ● Cons: ○ low efficiency ○ some cases still can’t be processed ● How about introducing an intermediate DSL
  • 15.
  • 16.
    Another alternative ● Whatif we have understood the workload pretty well? ● Then use a more flexiable DSL instead of coding for every workload? ● DGL (Data Generate Language): ○ Scalar ○ Rows ○ Iterator ■ Distribution ■ Real SQL (Query / Persist) ○ Tables ○ Expression / Functions
  • 17.
  • 18.