QBIC

QBIC
from Labs with ❤ by Misha Kozik

QBIC
• Cubic a.k.a Query Boost Intelligent Compiler
• Experimental project in Labs
• Recall Math
• Find Data Dependencies
• Use Machine Learning
• ...to speed up queries to db

Rationale
• Most BI-tools has nothing related with letter I
• UI -> buttons -> query -> response -> chart
• Query could run forever
• There are ways to speed up (partitions, indexes,
cache, cubes)
• ...
• Is there a way to get data without sending query?

Computers are
smarter than we
are think

Set
• Not related to java.util.Set
• Set of elements, for example {apple, tuesday,
42}
• No duplicates, {1, 1, 2} the same as {1, 2}
• No order, {1, 2} the same as {2, 1}
• Set could be element of another set {1, 2, {3,
4}}

Set of all subsets
• Powerset
• A is subset of B ( )
• Every element from A in B
• {1, 2} subset of {1, 2, 3}
• Powerset for {1, 2, 3}:

Ordered set
• Set, with defined order relation ( )
• is a binary operation (a b)
• Set {1, 9, 42} is ordered, with order relation
"less"
• Set {
!
,
"
,
#
} is ordered, with order "more
funny emoji"
• holds for every pair of elements!

Partially Ordered Set
• Something ordered, something not
• Everything what could be ordered, ordered.
• Some pair of elements could be incomparable
• {1, 4, 7, Tuesday, 42}

Lattice
• Partially ordered set, for every pair of
elements there is exact upper bound (supremum)
and exact lower bound (infimum)
• So, even elements A and B are not comparable,
there is an element C, which "greater" or
"less" than both A and B
• Hard?

Values of Senior Engineer
(sarcasm)
• House, car
• Build all possible combinations of those values
(powerset)
{{no house, no car},
{house, no car},
{no house, car},
{house, car}}

Every bottom-up
path in a lattice
is an ordered set

How that's related
to Databases?

Typical Query
select city, gender, count(*) -- groups and metrics
from table -- table
where income > 1000 and state = 'P' -- filters
group by city, gender -- groups again
order by count(*) desc -- sortings
limit 10 offset 0 -- limits
having count(*) > 5 -- more filters

Typical Query: simplify 1
select city, gender, count(*)
from table
group by city, gender
1
Leave attribute groups, only count(*) from metrics, no filters, no sorting, no limits. Later
we will try to add them back.

Build lattice by groups
• Take all attribute columns from table
city, gender, state
• Build all possible group combinations, powerset
{city, gender, state}
{city, gender}, {city, state}, {gender, state}
{city}, {gender}, {state}
{}

Pregen
• Pregen is completed query from database with data
saved somewhere (aka cache or view)
• Pros:
(1) very fast getting data from pregen
(2) fast getting data from all pregens below
• Cons:
(1) need time for caculating pregen (and
supporting)
(2) need disk space for saving pregen

Light formalization
• There is a lattice
• If no pregens exist, user query time equals to
database query time ( ), no space for
pregens needed ( )
• If pregen exist, user query time will be very fast
( ), but pregen will occupy some space ( )
and database query time needed once to prepare a
pregen ( )
Note: you can calculate all pregens below

Pregen optimization flow
• Scan table
Read metadata
Detect what columns can be groups
• Build lattice by groups
• Fill every lattice node with props
- pregen size (rows or MB)
- time to retrieve data from db
• Apply optimization algorithm

What algorithm for
optimization
should we apply?

Knapsack Problem
• 0-1 Knapsack Problem
• Detect what items take into
knapsack, to maximize its
value and do not exceed
maximum available weight
• NP-hard
• Exact and Approximate
solutions APD1
APD1
Appendix 1 - Review of algorithms for Knapsack Problem

(Practical) Knaspsack #1
• Dynamic programming with rounded values
• Power10 (log) scale
• Examples:
0.2, could be rounded to 1 (10^0)
..and so on

(Practical) Knaspsack #2
• Genetic with Greedy init population
• Genetic behaves not worse than Greedy
• You can control the running time of algorithm
• Tradeoff (time/accuracy)
• Open for modifications

Pregen
Optimization
Problem
• Detect what pregens should be
calculated in a lattice, to
maximize lattice value and do
not exceed maximum available
weight

Lattice: Weight
• Space occupied by pregen (rows, better
megabytes)
• Time to prepare pregen (time to query db )
• Time to support pregen (depends on update
frequency)

Lattice: Value
• If pregen is taken, the value 1, otherwise 0
• How many other pregens could be calculated
• Query popularity

Pregen Optimization
• At this point we can get a list of pregens
which are most optimal to maintain
• If new properties appear in the system
(popularity, data updates) we can recalculate
new optimal lattice
• Future: filters

Problem
• For large tables it's very expensive process to
label lattice nodes with actual values
• Basically, we need to calculate all possible
group bys
• Since its only optimization we don't need exact
values
• ...
• What if we can predict them?

Predict Lattice
Values: No Input
• Input: nothing
• We can't predict anything if
we have no input

Predict Lattice
Values: count(*)
• Input: select count(*)
• Again, we can't predict count
of groups

Predict Lattice
Values: Single
group by
• Input: all single group by
• Fill the first level of
lattice

Lattice Puzzle
• select city, count(*) -- 50
• select gender, count(*) -- 2
• ...
• How many rows will return
query?
(magic: without sending the
query)
select city, gender, count(*)

Lattice Puzzle: Linear
Scaling
• select city, count(*) -- 50
• select gender, count(*) -- 2
• select city, gender, count(*) -- ???
• Cartesian product (min:50, max:50*2)
• [50..100]

Scaling
• From sample you can get group counts
CITY(30), GENDER(2), CITY/GENDER(56)
• Build Interval as
[max(city,gender)..max(city,gender)*min(city,gender)]
Interval = [30..60], L=30, R=60, V=56
• Calculate
• Apply ratio to real data interval [50..100] to get estimate

Scaling
• Ok for small cardinality usecases
• Bad for high cardinality use case, because it get
diverged very soon
• Bad for small samples
• Bad for non-uniformly distributed samples
• Bad for non-linear functions
• Bad for other metrics

Linear Regression
•
• -> min
• One more optimization problemm
• Guess the coefficients a and b
• Apply function to find estimate

Actually,
not a good fit
* Anscombe Quartet

Anscombe Quartet
• Four different datasets
• All their statistical properties are equal
(mean, variance, correlation, etc.)
• Linear regression line is the same

you can always trick algorithm
if you know how it behaves
Be optimistic

Curve Fitting
• FitFunction.java
• If linear regression is bad, try other
functions
• LINEAR:
• POLY2:
• POWER:
• LOG:

Curve Fitting Flow
• Break sample into input points (x,y)
SAMPLE_SIZE, GROUP_COUNT
• Use the last point as evaluation of your model
• Apply different curves
• Model which is more closely to the evaluation
point will be your model
• Use this model to predict the lattice values

Cardinality Levels
• SMALL (0..10)
Gender, Income Group, Status
• MEDIUM (10..5000)
City, County, Category
• HIGH (5000..)
UserID, SKU

Cardinality Levels
• SMALL x SMALL (log works well)
• SMALL x MEDIUM (log works well)
• MEDIUM x MEDIUM (log works well)
• Anything x HIGH (hard to predict)
• ...
• Maybe drop high-cardinality use cases?

Hypothesis: If we
can predict number
of rows, we can
predict data

Predict Data
GroupBy & Count(*)

Metric: COUNT(*)
• Additive
• Easily can be predicted by sample because looks
like LINEAR
• Could be improved by better sampling techniques
• Could be improved by microqueries
• ...
• What about other metrics?

Metric: SUM
• Additive
• Linear as well
• Can be predicted the same way as count(*)
by using linear scale
• Accuracy can be improved by using numeric field
ranges

Metric: AVG
• Not additive
• But if we assume avg=sum/count (not 100% true)
could be easily predicted as well by linear
scale

Metric: MIN/MAX
• Additive
• But not linear
• Do not apply linear scale
• Min/max from sample is not a worst idea, either
• If first level lattice calculated from db,
second could be predicted by few tricks

Produce TOP5 CITY,CATEGORY by MAX(TOTAL) without sending the query?

At least 50% can be
calculated

Rest data could be filled
through
• Extra query
• Additional carried data with the lattice
Along with max value, save the row
• Even without them, you have the structure and values
so, you can start drawing the results, while query
is running
• Another form of sharpening

Metric: DISTINCT_COUNT
• Not additive
• Not linear
• We can get a raw estimate by using min/max
scale
or by applying curve fitting on sampling
similar to what we did in predicting the
lattice

For other metrics,
or custom
calculations
we still can find
workarounds

If we can predict the data
we can use it in
Sharpening™

(Better) Sharpening
• Instant fast
• No queries to DB
• No requirement on time fields only
• No requirement on indexed/partitions support
• Still can be combined with microquerying!

Well, if you can return data
without sending the queries...

Offline Mode
• Explore the data without generating DB load
• Find correct questions before asking DB
• No need for the separate "RUN" button
• No need for cancellation support if you sent
wrong query
• You can (and should) always fallback to
database

Data Dependencies
• Data is not just rows and columns
• There are hidden connections between cells,
rows, columns
• We can identify them to make software easier
• They are not 100% accurate, since there are
always excepetions, but with clever approach...

DD1: High Cardinality
• High cardinality fields has a huge amount of distinct
values
• Define what high means for your application
• If distinct count on sample equal to sample size,
it's probably high cardinlaity field (some sort of id)
• High cardinality fields has lower importance in group
bys and defaults than regular fields, also you can
drop them from pregen and lattice since they hold too
much data.

DD2: One-to-one fields
• One-to-one fields are pairs, with related data
• STORE_ID and STORE_NAME
• Having one field, you can infer other and vice verca
• If two fields have the same distinct count on a sample,
they are probably one-to-one fields
• NAME has more importance than its ID brother for UI
• Also you can drop the ID part from lattice and maintain
only NAME with mapping.

DD3: Child-Parent fields
• Parent child fiels are similar to one-to-one fields,
but you can go only from child to parent not vice verca
• CITY and COUNTRY (child-parent)
• CITY is more important than COUNTRY (having CITY you
can guess the COUNTRY)
• To find child-parent fields explore the values in a
sample
• More optimization to the lattice and to UI
recomendations

DD4: Empty Columns
• Sounds easy, but currently empty columns
treated the same way as other
• We can easily drop them from the lattice to
reduce its size
• We can remove them from UI

DD?: Define your own Data
Dependency
• Explain dependency?
• Rule to identify data dependecy?
• Where the knowledge can be applied?

How the future of
Zoomdata may look
like?

Zoomdata Applications
• Pregen management
• Sharpening with no limits
• Offline Mode
• Data Dependencies
• UX Recomendations

Hope for the best,
prepare for the
worst
— Quan Luu

Appendix 1:
Review of
algorithms for
Knapsack Problem

Exact methods
• Bruteforce
• Bound & Branch
• Dynamic Programming

Bruteforce
• KSBruteforceSolver.java
• Label each object either 1 (take) or 0 (do not take)
• Build all possible configurations
• Select the best one
• Complexity , - number of columns, -
lattice size
• For example, for columns, there are
possible combinations

Bound & Branch
• KSBoundBranchSolver.java
• Explore search space in depth-first search
manner, providing optimistic estimate for each
node to be explored, this way you can prune
nodes which optimistic estimate worse than best
of found solutions
• In most cases faster than bruteforce, but worse
case compexity is still

Dynamic Programming
• KSDynamicProgrammingSolver.java
• Explore the classic DP idea, solve smaller
problem first, then reuse results in bigger
problem.
• For some cases extra fast (pseudopolynomial
complexity)
• A lot of constraints (weights are integers,
weight limits should be relatively small)

Approximate methods
• Greedy Algorithm
• Genetic Algorithm

Greedy Algorithm
• KSGreedySolver.java
• Order all items by "value" (do not confuse with
knapsack value) and take until you reach the weight
constraint
• Strategies: VALUABLE, LIGHTER,DENSITY` (value/weight)
• Complexity , could be even faster
• Practical approach when constraints are large values
(!)

Genetic Algorithm
• KSGeneticSolver.java
• Algorithm which could be used to solve any problem.
Init. generate random population (set of solutions)
Selection. Choose the best solutions
Crossover. Produce new solutions by breeding previous
best solutions.
Mutation (Optional). Randomly change new-born
solutions.
Generation Step. If new populations is better than
previous, replace it and move to selection again.

QBIC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QBIC

Similar to QBIC (20)

More from Misha Kozik

More from Misha Kozik (7)

Recently uploaded

Recently uploaded (20)

QBIC