SlideShare a Scribd company logo
QBIC
from Labs with ❤ by Misha Kozik
QBIC
• Cubic a.k.a Query Boost Intelligent Compiler
• Experimental project in Labs
• Recall Math
• Find Data Dependencies
• Use Machine Learning
• ...to speed up queries to db
Rationale
• Most BI-tools has nothing related with letter I
• UI -> buttons -> query -> response -> chart
• Query could run forever
• There are ways to speed up (partitions, indexes,
cache, cubes)
• ...
• Is there a way to get data without sending query?
Computers are
smarter than we
are think
A bit of
Math
Set
• Not related to java.util.Set
• Set of elements, for example {apple, tuesday,
42}
• No duplicates, {1, 1, 2} the same as {1, 2}
• No order, {1, 2} the same as {2, 1}
• Set could be element of another set {1, 2, {3,
4}}
Set of all subsets
• Powerset
• A is subset of B ( )
• Every element from A in B
• {1, 2} subset of {1, 2, 3}
• Powerset for {1, 2, 3}:
Ordered set
• Set, with defined order relation ( )
• is a binary operation (a b)
• Set {1, 9, 42} is ordered, with order relation
"less"
• Set {
!
,
"
,
#
} is ordered, with order "more
funny emoji"
• holds for every pair of elements!
Partially Ordered Set
• Something ordered, something not
• Everything what could be ordered, ordered.
• Some pair of elements could be incomparable
• {1, 4, 7, Tuesday, 42}
Lattice
• Partially ordered set, for every pair of
elements there is exact upper bound (supremum)
and exact lower bound (infimum)
• So, even elements A and B are not comparable,
there is an element C, which "greater" or
"less" than both A and B
• Hard?
Values of Senior Engineer
(sarcasm)
• House, car
• Build all possible combinations of those values
(powerset)
{{no house, no car},
{house, no car},
{no house, car},
{house, car}}
Every bottom-up
path in a lattice
is an ordered set
How that's related
to Databases?
Typical Query
select city, gender, count(*) -- groups and metrics
from table -- table
where income > 1000 and state = 'P' -- filters
group by city, gender -- groups again
order by count(*) desc -- sortings
limit 10 offset 0 -- limits
having count(*) > 5 -- more filters
Typical Query: simplify 1
select city, gender, count(*)
from table
group by city, gender
1
Leave attribute groups, only count(*) from metrics, no filters, no sorting, no limits. Later
we will try to add them back.
Build lattice by groups
• Take all attribute columns from table
city, gender, state
• Build all possible group combinations, powerset
{city, gender, state}
{city, gender}, {city, state}, {gender, state}
{city}, {gender}, {state}
{}
Pregen
optimization
Pregen
• Pregen is completed query from database with data
saved somewhere (aka cache or view)
• Pros:
(1) very fast getting data from pregen
(2) fast getting data from all pregens below
• Cons:
(1) need time for caculating pregen (and
supporting)
(2) need disk space for saving pregen
Light formalization
• There is a lattice
• If no pregens exist, user query time equals to
database query time ( ), no space for
pregens needed ( )
• If pregen exist, user query time will be very fast
( ), but pregen will occupy some space ( )
and database query time needed once to prepare a
pregen ( )
Note: you can calculate all pregens below
Pregen optimization flow
• Scan table
Read metadata
Detect what columns can be groups
• Build lattice by groups
• Fill every lattice node with props
- pregen size (rows or MB)
- time to retrieve data from db
• Apply optimization algorithm
What algorithm for
optimization
should we apply?
Knapsack Problem
• 0-1 Knapsack Problem
• Detect what items take into
knapsack, to maximize its
value and do not exceed
maximum available weight
• NP-hard
• Exact and Approximate
solutions APD1
APD1
Appendix 1 - Review of algorithms for Knapsack Problem
(Practical) Knaspsack #1
• Dynamic programming with rounded values
• Power10 (log) scale
• Examples:
0.2, could be rounded to 1 (10^0)
1.2, could be rounded to 1 (10^0)
8.6, could be rounded to 10 (10^1)
123.6, could be rounded to 100 (10^2)
..and so on
(Practical) Knaspsack #2
• Genetic with Greedy init population
• Genetic behaves not worse than Greedy
• You can control the running time of algorithm
• Tradeoff (time/accuracy)
• Open for modifications
Pregen
Optimization
Problem
• Detect what pregens should be
calculated in a lattice, to
maximize lattice value and do
not exceed maximum available
weight
Lattice: Weight
• Space occupied by pregen (rows, better
megabytes)
• Time to prepare pregen (time to query db )
• Time to support pregen (depends on update
frequency)
Lattice: Value
• If pregen is taken, the value 1, otherwise 0
• How many other pregens could be calculated
• Query popularity
Pregen Optimization
• At this point we can get a list of pregens
which are most optimal to maintain
• If new properties appear in the system
(popularity, data updates) we can recalculate
new optimal lattice
• Future: filters
Problem
• For large tables it's very expensive process to
label lattice nodes with actual values
• Basically, we need to calculate all possible
group bys
• Since its only optimization we don't need exact
values
• ...
• What if we can predict them?
Predict Lattice
Values: No Input
• Input: nothing
• We can't predict anything if
we have no input
Predict Lattice
Values: count(*)
• Input: select count(*)
• Again, we can't predict count
of groups
Predict Lattice
Values: Single
group by
• Input: all single group by
• Fill the first level of
lattice
Lattice Puzzle
• select city, count(*) -- 50
• select gender, count(*) -- 2
• ...
• How many rows will return
query?
(magic: without sending the
query)
select city, gender, count(*)
Lattice Puzzle: Linear
Scaling
• select city, count(*) -- 50
• select gender, count(*) -- 2
• select city, gender, count(*) -- ???
• Cartesian product (min:50, max:50*2)
• [50..100]
Lattice Puzzle: Linear
Scaling
• From sample you can get group counts
CITY(30), GENDER(2), CITY/GENDER(56)
• Build Interval as
[max(city,gender)..max(city,gender)*min(city,gender)]
Interval = [30..60], L=30, R=60, V=56
• Calculate
• Apply ratio to real data interval [50..100] to get estimate
Lattice Puzzle: Linear
Scaling
• Ok for small cardinality usecases
• Bad for high cardinality use case, because it get
diverged very soon
• Bad for small samples
• Bad for non-uniformly distributed samples
• Bad for non-linear functions
• Bad for other metrics
Curve Fitting
Approach
Linear Regression
•
• -> min
• One more optimization problemm
• Guess the coefficients a and b
• Apply function to find estimate
Actually,
not a good fit
* Anscombe Quartet
Anscombe Quartet
• Four different datasets
• All their statistical properties are equal
(mean, variance, correlation, etc.)
• Linear regression line is the same
you can always trick algorithm
if you know how it behaves
Be optimistic
Curve Fitting
• FitFunction.java
• If linear regression is bad, try other
functions
• LINEAR:
• POLY2:
• POWER:
• LOG:
Curve Fitting Flow
• Break sample into input points (x,y)
SAMPLE_SIZE, GROUP_COUNT
• Use the last point as evaluation of your model
• Apply different curves
• Model which is more closely to the evaluation
point will be your model
• Use this model to predict the lattice values
Cardinality Levels
• SMALL (0..10)
Gender, Income Group, Status
• MEDIUM (10..5000)
City, County, Category
• HIGH (5000..)
UserID, SKU
Cardinality Levels
• SMALL x SMALL (log works well)
• SMALL x MEDIUM (log works well)
• MEDIUM x MEDIUM (log works well)
• Anything x HIGH (hard to predict)
• ...
• Maybe drop high-cardinality use cases?
Predicted Lattice
Hypothesis: If we
can predict number
of rows, we can
predict data
Predict Data
GroupBy & Count(*)
Metric: COUNT(*)
• Additive
• Easily can be predicted by sample because looks
like LINEAR
• Could be improved by better sampling techniques
• Could be improved by microqueries
• ...
• What about other metrics?
Metric: SUM
• Additive
• Linear as well
• Can be predicted the same way as count(*)
by using linear scale
• Accuracy can be improved by using numeric field
ranges
Metric: AVG
• Not additive
• But if we assume avg=sum/count (not 100% true)
could be easily predicted as well by linear
scale
Metric: MIN/MAX
• Additive
• But not linear
• Do not apply linear scale
• Min/max from sample is not a worst idea, either
• If first level lattice calculated from db,
second could be predicted by few tricks
Produce TOP5 CITY,CATEGORY by MAX(TOTAL) without sending the query?
At least 50% can be
calculated
Rest data could be filled
through
• Extra query
• Additional carried data with the lattice
Along with max value, save the row
• Even without them, you have the structure and values
so, you can start drawing the results, while query
is running
• Another form of sharpening
Metric: DISTINCT_COUNT
• Not additive
• Not linear
• We can get a raw estimate by using min/max
scale
or by applying curve fitting on sampling
similar to what we did in predicting the
lattice
For other metrics,
or custom
calculations
we still can find
workarounds
If we can predict the data
we can use it in
Sharpening™
(Better) Sharpening
• Instant fast
• No queries to DB
• No requirement on time fields only
• No requirement on indexed/partitions support
• Still can be combined with microquerying!
Well, if you can return data
without sending the queries...
Offline Mode
• Explore the data without generating DB load
• Find correct questions before asking DB
• No need for the separate "RUN" button
• No need for cancellation support if you sent
wrong query
• You can (and should) always fallback to
database
Data Dependencies
Data Dependencies
• Data is not just rows and columns
• There are hidden connections between cells,
rows, columns
• We can identify them to make software easier
• They are not 100% accurate, since there are
always excepetions, but with clever approach...
DD1: High Cardinality
• High cardinality fields has a huge amount of distinct
values
• Define what high means for your application
• If distinct count on sample equal to sample size,
it's probably high cardinlaity field (some sort of id)
• High cardinality fields has lower importance in group
bys and defaults than regular fields, also you can
drop them from pregen and lattice since they hold too
much data.
DD2: One-to-one fields
• One-to-one fields are pairs, with related data
• STORE_ID and STORE_NAME
• Having one field, you can infer other and vice verca
• If two fields have the same distinct count on a sample,
they are probably one-to-one fields
• NAME has more importance than its ID brother for UI
• Also you can drop the ID part from lattice and maintain
only NAME with mapping.
DD3: Child-Parent fields
• Parent child fiels are similar to one-to-one fields,
but you can go only from child to parent not vice verca
• CITY and COUNTRY (child-parent)
• CITY is more important than COUNTRY (having CITY you
can guess the COUNTRY)
• To find child-parent fields explore the values in a
sample
• More optimization to the lattice and to UI
recomendations
DD4: Empty Columns
• Sounds easy, but currently empty columns
treated the same way as other
• We can easily drop them from the lattice to
reduce its size
• We can remove them from UI
DD?: Define your own Data
Dependency
• Explain dependency?
• Rule to identify data dependecy?
• Where the knowledge can be applied?
How the future of
Zoomdata may look
like?
Zoomdata Applications
• Pregen management
• Sharpening with no limits
• Offline Mode
• Data Dependencies
• UX Recomendations
Hope for the best,
prepare for the
worst
— Quan Luu
Appendix 1:
Review of
algorithms for
Knapsack Problem
Exact methods
• Bruteforce
• Bound & Branch
• Dynamic Programming
Bruteforce
• KSBruteforceSolver.java
• Label each object either 1 (take) or 0 (do not take)
• Build all possible configurations
• Select the best one
• Complexity , - number of columns, -
lattice size
• For example, for columns, there are
possible combinations
Bound & Branch
• KSBoundBranchSolver.java
• Explore search space in depth-first search
manner, providing optimistic estimate for each
node to be explored, this way you can prune
nodes which optimistic estimate worse than best
of found solutions
• In most cases faster than bruteforce, but worse
case compexity is still
Dynamic Programming
• KSDynamicProgrammingSolver.java
• Explore the classic DP idea, solve smaller
problem first, then reuse results in bigger
problem.
• For some cases extra fast (pseudopolynomial
complexity)
• A lot of constraints (weights are integers,
weight limits should be relatively small)
Approximate methods
• Greedy Algorithm
• Genetic Algorithm
Greedy Algorithm
• KSGreedySolver.java
• Order all items by "value" (do not confuse with
knapsack value) and take until you reach the weight
constraint
• Strategies: VALUABLE, LIGHTER,DENSITY` (value/weight)
• Complexity , could be even faster
• Practical approach when constraints are large values
(!)
Genetic Algorithm
• KSGeneticSolver.java
• Algorithm which could be used to solve any problem.
Init. generate random population (set of solutions)
Selection. Choose the best solutions
Crossover. Produce new solutions by breeding previous
best solutions.
Mutation (Optional). Randomly change new-born
solutions.
Generation Step. If new populations is better than
previous, replace it and move to selection again.
Thanks

More Related Content

What's hot

Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
Xiang Zhang
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Abhishek Thakur
 
geekgap.io webinar #1
geekgap.io webinar #1geekgap.io webinar #1
geekgap.io webinar #1
junior Teudjio
 
DATA STRUCTURES USING C -ENGGDIGEST
DATA STRUCTURES USING C -ENGGDIGESTDATA STRUCTURES USING C -ENGGDIGEST
DATA STRUCTURES USING C -ENGGDIGEST
Swapnil Mishra
 
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
PAPIs.io
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
odsc
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
Wush Wu
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
Benjamin Bengfort
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
Andrew Ferlitsch
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
Maarten Smeets
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
Luiz Henrique Zambom Santana
 
R training3
R training3R training3
R training3
Hellen Gakuruh
 
Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in java
Muhammad Aleem Siddiqui
 
Ist year Msc,2nd sem module1
Ist year Msc,2nd sem module1Ist year Msc,2nd sem module1
Ist year Msc,2nd sem module1
blessyboban92
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
Barbara Fusinska
 
Machine Learning with JavaScript
Machine Learning with JavaScriptMachine Learning with JavaScript
Machine Learning with JavaScript
Ivo Andreev
 

What's hot (20)

Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
geekgap.io webinar #1
geekgap.io webinar #1geekgap.io webinar #1
geekgap.io webinar #1
 
DATA STRUCTURES USING C -ENGGDIGEST
DATA STRUCTURES USING C -ENGGDIGESTDATA STRUCTURES USING C -ENGGDIGEST
DATA STRUCTURES USING C -ENGGDIGEST
 
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
 
Enar short course
Enar short courseEnar short course
Enar short course
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
 
R training3
R training3R training3
R training3
 
Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in java
 
Ist year Msc,2nd sem module1
Ist year Msc,2nd sem module1Ist year Msc,2nd sem module1
Ist year Msc,2nd sem module1
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
 
Machine Learning with JavaScript
Machine Learning with JavaScriptMachine Learning with JavaScript
Machine Learning with JavaScript
 

Similar to QBIC

Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Matlab pt1
Matlab pt1Matlab pt1
Matlab pt1
Austin Baird
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
Bol.com Techlab
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
Rajendran
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
Garrett Teoh Hor Keong
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Storage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data PatternsStorage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data Patterns
Bob Burgess
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it stand
Sergey Petrunya
 
Lines and planes in space
Lines and planes in spaceLines and planes in space
Lines and planes in space
Faizan Shabbir
 
hash
 hash hash
hash
tim4911
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
Eran Shlomo
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
Babu Priyavrat
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 

Similar to QBIC (20)

Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Matlab pt1
Matlab pt1Matlab pt1
Matlab pt1
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Lecture 1 (bce-7)
Lecture   1 (bce-7)Lecture   1 (bce-7)
Lecture 1 (bce-7)
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Storage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data PatternsStorage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data Patterns
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it stand
 
Lines and planes in space
Lines and planes in spaceLines and planes in space
Lines and planes in space
 
hash
 hash hash
hash
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
MATLAB & Image Processing
MATLAB & Image ProcessingMATLAB & Image Processing
MATLAB & Image Processing
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 

More from Misha Kozik

DSL in Clojure
DSL in ClojureDSL in Clojure
DSL in Clojure
Misha Kozik
 
Writing DSL in Clojure
Writing DSL in ClojureWriting DSL in Clojure
Writing DSL in Clojure
Misha Kozik
 
Sentiments Improvement
Sentiments ImprovementSentiments Improvement
Sentiments Improvement
Misha Kozik
 
Timezone Mess
Timezone MessTimezone Mess
Timezone Mess
Misha Kozik
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in Java
Misha Kozik
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Intro
Misha Kozik
 
Unsafe Java
Unsafe JavaUnsafe Java
Unsafe Java
Misha Kozik
 

More from Misha Kozik (7)

DSL in Clojure
DSL in ClojureDSL in Clojure
DSL in Clojure
 
Writing DSL in Clojure
Writing DSL in ClojureWriting DSL in Clojure
Writing DSL in Clojure
 
Sentiments Improvement
Sentiments ImprovementSentiments Improvement
Sentiments Improvement
 
Timezone Mess
Timezone MessTimezone Mess
Timezone Mess
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in Java
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Intro
 
Unsafe Java
Unsafe JavaUnsafe Java
Unsafe Java
 

Recently uploaded

Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 

Recently uploaded (20)

Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 

QBIC

  • 1. QBIC from Labs with ❤ by Misha Kozik
  • 2. QBIC • Cubic a.k.a Query Boost Intelligent Compiler • Experimental project in Labs • Recall Math • Find Data Dependencies • Use Machine Learning • ...to speed up queries to db
  • 3. Rationale • Most BI-tools has nothing related with letter I • UI -> buttons -> query -> response -> chart • Query could run forever • There are ways to speed up (partitions, indexes, cache, cubes) • ... • Is there a way to get data without sending query?
  • 6. Set • Not related to java.util.Set • Set of elements, for example {apple, tuesday, 42} • No duplicates, {1, 1, 2} the same as {1, 2} • No order, {1, 2} the same as {2, 1} • Set could be element of another set {1, 2, {3, 4}}
  • 7. Set of all subsets • Powerset • A is subset of B ( ) • Every element from A in B • {1, 2} subset of {1, 2, 3} • Powerset for {1, 2, 3}:
  • 8. Ordered set • Set, with defined order relation ( ) • is a binary operation (a b) • Set {1, 9, 42} is ordered, with order relation "less" • Set { ! , " , # } is ordered, with order "more funny emoji" • holds for every pair of elements!
  • 9.
  • 10. Partially Ordered Set • Something ordered, something not • Everything what could be ordered, ordered. • Some pair of elements could be incomparable • {1, 4, 7, Tuesday, 42}
  • 11. Lattice • Partially ordered set, for every pair of elements there is exact upper bound (supremum) and exact lower bound (infimum) • So, even elements A and B are not comparable, there is an element C, which "greater" or "less" than both A and B • Hard?
  • 12. Values of Senior Engineer (sarcasm) • House, car • Build all possible combinations of those values (powerset) {{no house, no car}, {house, no car}, {no house, car}, {house, car}}
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Every bottom-up path in a lattice is an ordered set
  • 18.
  • 19. How that's related to Databases?
  • 20. Typical Query select city, gender, count(*) -- groups and metrics from table -- table where income > 1000 and state = 'P' -- filters group by city, gender -- groups again order by count(*) desc -- sortings limit 10 offset 0 -- limits having count(*) > 5 -- more filters
  • 21. Typical Query: simplify 1 select city, gender, count(*) from table group by city, gender 1 Leave attribute groups, only count(*) from metrics, no filters, no sorting, no limits. Later we will try to add them back.
  • 22. Build lattice by groups • Take all attribute columns from table city, gender, state • Build all possible group combinations, powerset {city, gender, state} {city, gender}, {city, state}, {gender, state} {city}, {gender}, {state} {}
  • 23.
  • 25. Pregen • Pregen is completed query from database with data saved somewhere (aka cache or view) • Pros: (1) very fast getting data from pregen (2) fast getting data from all pregens below • Cons: (1) need time for caculating pregen (and supporting) (2) need disk space for saving pregen
  • 26.
  • 27.
  • 28.
  • 29. Light formalization • There is a lattice • If no pregens exist, user query time equals to database query time ( ), no space for pregens needed ( ) • If pregen exist, user query time will be very fast ( ), but pregen will occupy some space ( ) and database query time needed once to prepare a pregen ( ) Note: you can calculate all pregens below
  • 30. Pregen optimization flow • Scan table Read metadata Detect what columns can be groups • Build lattice by groups • Fill every lattice node with props - pregen size (rows or MB) - time to retrieve data from db • Apply optimization algorithm
  • 31.
  • 33. Knapsack Problem • 0-1 Knapsack Problem • Detect what items take into knapsack, to maximize its value and do not exceed maximum available weight • NP-hard • Exact and Approximate solutions APD1 APD1 Appendix 1 - Review of algorithms for Knapsack Problem
  • 34.
  • 35. (Practical) Knaspsack #1 • Dynamic programming with rounded values • Power10 (log) scale • Examples: 0.2, could be rounded to 1 (10^0) 1.2, could be rounded to 1 (10^0) 8.6, could be rounded to 10 (10^1) 123.6, could be rounded to 100 (10^2) ..and so on
  • 36. (Practical) Knaspsack #2 • Genetic with Greedy init population • Genetic behaves not worse than Greedy • You can control the running time of algorithm • Tradeoff (time/accuracy) • Open for modifications
  • 37. Pregen Optimization Problem • Detect what pregens should be calculated in a lattice, to maximize lattice value and do not exceed maximum available weight
  • 38. Lattice: Weight • Space occupied by pregen (rows, better megabytes) • Time to prepare pregen (time to query db ) • Time to support pregen (depends on update frequency)
  • 39. Lattice: Value • If pregen is taken, the value 1, otherwise 0 • How many other pregens could be calculated • Query popularity
  • 40. Pregen Optimization • At this point we can get a list of pregens which are most optimal to maintain • If new properties appear in the system (popularity, data updates) we can recalculate new optimal lattice • Future: filters
  • 41. Problem • For large tables it's very expensive process to label lattice nodes with actual values • Basically, we need to calculate all possible group bys • Since its only optimization we don't need exact values • ... • What if we can predict them?
  • 42. Predict Lattice Values: No Input • Input: nothing • We can't predict anything if we have no input
  • 43. Predict Lattice Values: count(*) • Input: select count(*) • Again, we can't predict count of groups
  • 44. Predict Lattice Values: Single group by • Input: all single group by • Fill the first level of lattice
  • 45. Lattice Puzzle • select city, count(*) -- 50 • select gender, count(*) -- 2 • ... • How many rows will return query? (magic: without sending the query) select city, gender, count(*)
  • 46. Lattice Puzzle: Linear Scaling • select city, count(*) -- 50 • select gender, count(*) -- 2 • select city, gender, count(*) -- ??? • Cartesian product (min:50, max:50*2) • [50..100]
  • 47.
  • 48. Lattice Puzzle: Linear Scaling • From sample you can get group counts CITY(30), GENDER(2), CITY/GENDER(56) • Build Interval as [max(city,gender)..max(city,gender)*min(city,gender)] Interval = [30..60], L=30, R=60, V=56 • Calculate • Apply ratio to real data interval [50..100] to get estimate
  • 49. Lattice Puzzle: Linear Scaling • Ok for small cardinality usecases • Bad for high cardinality use case, because it get diverged very soon • Bad for small samples • Bad for non-uniformly distributed samples • Bad for non-linear functions • Bad for other metrics
  • 51.
  • 52. Linear Regression • • -> min • One more optimization problemm • Guess the coefficients a and b • Apply function to find estimate
  • 53. Actually, not a good fit * Anscombe Quartet
  • 54.
  • 55. Anscombe Quartet • Four different datasets • All their statistical properties are equal (mean, variance, correlation, etc.) • Linear regression line is the same
  • 56. you can always trick algorithm if you know how it behaves Be optimistic
  • 57. Curve Fitting • FitFunction.java • If linear regression is bad, try other functions • LINEAR: • POLY2: • POWER: • LOG:
  • 58. Curve Fitting Flow • Break sample into input points (x,y) SAMPLE_SIZE, GROUP_COUNT • Use the last point as evaluation of your model • Apply different curves • Model which is more closely to the evaluation point will be your model • Use this model to predict the lattice values
  • 59. Cardinality Levels • SMALL (0..10) Gender, Income Group, Status • MEDIUM (10..5000) City, County, Category • HIGH (5000..) UserID, SKU
  • 60. Cardinality Levels • SMALL x SMALL (log works well) • SMALL x MEDIUM (log works well) • MEDIUM x MEDIUM (log works well) • Anything x HIGH (hard to predict) • ... • Maybe drop high-cardinality use cases?
  • 62.
  • 63. Hypothesis: If we can predict number of rows, we can predict data
  • 65.
  • 66.
  • 67.
  • 68. Metric: COUNT(*) • Additive • Easily can be predicted by sample because looks like LINEAR • Could be improved by better sampling techniques • Could be improved by microqueries • ... • What about other metrics?
  • 69. Metric: SUM • Additive • Linear as well • Can be predicted the same way as count(*) by using linear scale • Accuracy can be improved by using numeric field ranges
  • 70. Metric: AVG • Not additive • But if we assume avg=sum/count (not 100% true) could be easily predicted as well by linear scale
  • 71. Metric: MIN/MAX • Additive • But not linear • Do not apply linear scale • Min/max from sample is not a worst idea, either • If first level lattice calculated from db, second could be predicted by few tricks
  • 72. Produce TOP5 CITY,CATEGORY by MAX(TOTAL) without sending the query?
  • 73. At least 50% can be calculated
  • 74. Rest data could be filled through • Extra query • Additional carried data with the lattice Along with max value, save the row • Even without them, you have the structure and values so, you can start drawing the results, while query is running • Another form of sharpening
  • 75. Metric: DISTINCT_COUNT • Not additive • Not linear • We can get a raw estimate by using min/max scale or by applying curve fitting on sampling similar to what we did in predicting the lattice
  • 76. For other metrics, or custom calculations we still can find workarounds
  • 77. If we can predict the data we can use it in Sharpening™
  • 78. (Better) Sharpening • Instant fast • No queries to DB • No requirement on time fields only • No requirement on indexed/partitions support • Still can be combined with microquerying!
  • 79. Well, if you can return data without sending the queries...
  • 80. Offline Mode • Explore the data without generating DB load • Find correct questions before asking DB • No need for the separate "RUN" button • No need for cancellation support if you sent wrong query • You can (and should) always fallback to database
  • 82. Data Dependencies • Data is not just rows and columns • There are hidden connections between cells, rows, columns • We can identify them to make software easier • They are not 100% accurate, since there are always excepetions, but with clever approach...
  • 83. DD1: High Cardinality • High cardinality fields has a huge amount of distinct values • Define what high means for your application • If distinct count on sample equal to sample size, it's probably high cardinlaity field (some sort of id) • High cardinality fields has lower importance in group bys and defaults than regular fields, also you can drop them from pregen and lattice since they hold too much data.
  • 84. DD2: One-to-one fields • One-to-one fields are pairs, with related data • STORE_ID and STORE_NAME • Having one field, you can infer other and vice verca • If two fields have the same distinct count on a sample, they are probably one-to-one fields • NAME has more importance than its ID brother for UI • Also you can drop the ID part from lattice and maintain only NAME with mapping.
  • 85. DD3: Child-Parent fields • Parent child fiels are similar to one-to-one fields, but you can go only from child to parent not vice verca • CITY and COUNTRY (child-parent) • CITY is more important than COUNTRY (having CITY you can guess the COUNTRY) • To find child-parent fields explore the values in a sample • More optimization to the lattice and to UI recomendations
  • 86. DD4: Empty Columns • Sounds easy, but currently empty columns treated the same way as other • We can easily drop them from the lattice to reduce its size • We can remove them from UI
  • 87. DD?: Define your own Data Dependency • Explain dependency? • Rule to identify data dependecy? • Where the knowledge can be applied?
  • 88. How the future of Zoomdata may look like?
  • 89. Zoomdata Applications • Pregen management • Sharpening with no limits • Offline Mode • Data Dependencies • UX Recomendations
  • 90.
  • 91. Hope for the best, prepare for the worst — Quan Luu
  • 92. Appendix 1: Review of algorithms for Knapsack Problem
  • 93. Exact methods • Bruteforce • Bound & Branch • Dynamic Programming
  • 94. Bruteforce • KSBruteforceSolver.java • Label each object either 1 (take) or 0 (do not take) • Build all possible configurations • Select the best one • Complexity , - number of columns, - lattice size • For example, for columns, there are possible combinations
  • 95. Bound & Branch • KSBoundBranchSolver.java • Explore search space in depth-first search manner, providing optimistic estimate for each node to be explored, this way you can prune nodes which optimistic estimate worse than best of found solutions • In most cases faster than bruteforce, but worse case compexity is still
  • 96. Dynamic Programming • KSDynamicProgrammingSolver.java • Explore the classic DP idea, solve smaller problem first, then reuse results in bigger problem. • For some cases extra fast (pseudopolynomial complexity) • A lot of constraints (weights are integers, weight limits should be relatively small)
  • 97. Approximate methods • Greedy Algorithm • Genetic Algorithm
  • 98. Greedy Algorithm • KSGreedySolver.java • Order all items by "value" (do not confuse with knapsack value) and take until you reach the weight constraint • Strategies: VALUABLE, LIGHTER,DENSITY` (value/weight) • Complexity , could be even faster • Practical approach when constraints are large values (!)
  • 99. Genetic Algorithm • KSGeneticSolver.java • Algorithm which could be used to solve any problem. Init. generate random population (set of solutions) Selection. Choose the best solutions Crossover. Produce new solutions by breeding previous best solutions. Mutation (Optional). Randomly change new-born solutions. Generation Step. If new populations is better than previous, replace it and move to selection again.
  • 100. Thanks