3. Factorized Databases by Example
Orders
customer day pizza
Mario Monday Capricciosa
Mario Friday Capricciosa
Pietro Friday Hawaii
Lucia Friday Hawaii
Pizzas
pizza item
Capricciosa base
Capricciosa ham
Capricciosa mushrooms
Hawaii base
Hawaii ham
Hawaii pineapple
Items
item price
base 6
ham 1
mushrooms 1
pineapple 2
Consider the natural join of the three relations above:
Orders 1 Pizzas 1 Items
customer day pizza item price
Mario Monday Capricciosa base 6
Mario Monday Capricciosa ham 1
Mario Monday Capricciosa mushrooms 1
Mario Friday Capricciosa base 6
Mario Friday Capricciosa ham 1
Mario Friday Capricciosa mushrooms 1
: : : : : : : : : : : : : : :
3 / 15
4. Factorized Databases by Example
Orders 1 Pizzas 1 Items
customer day pizza item price
Mario Monday Capricciosa base 6
Mario Monday Capricciosa ham 1
Mario Monday Capricciosa mushrooms 1
Mario Friday Capricciosa base 6
Mario Friday Capricciosa ham 1
Mario Friday Capricciosa mushrooms 1
: : : : : : : : : : : : : : :
A
at relational algebra expression encoding the above query result is:
hMarioi hMondayi hCapricciosai hbasei h6i [
hMarioi hMondayi hCapricciosai hhami h1i [
hMarioi hMondayi hCapricciosai hmushroomsi h1i [
hMarioi hFridayi hCapricciosai hbasei h6i [
hMarioi hFridayi hCapricciosai hhami h1i [
hMarioi hFridayi hCapricciosai hmushroomsi h1i [ : : :
It uses relational product (), union ([), and singleton relations (e.g., h1i).
The attribute names are not shown to avoid clutter.
4 / 15
5. Factorized Databases by Example
The previous relational expression entails lots of redundancy due to the joins:
hMarioi hMondayi hCapricciosai hbasei h6i [
hMarioi hMondayi hCapricciosai hhami h1i [
hMarioi hMondayi hCapricciosai hmushroomsi h1i [
hMarioi hFridayi hCapricciosai hbasei h6i [
hMarioi hFridayi hCapricciosai hhami h1i [
hMarioi hFridayi hCapricciosai hmushroomsi h1i [ : : :
We can factorize the expression following the join structure, e.g.,:
hCapricciosai (hMondayi hMarioi [ hFridayi hMarioi)
(hbasei h6i [ hhami h1i [ hmushroomsi h1i)
[ hHawaiii hFridayi (hLuciai [ hPietroi)
(hbasei h6i [ hhami h1i [ hpineapplei h2i)
pizza
day
customer
item
price
There are several algebraically equivalent factorized representations de
6. ned by
distributivity of product over union and commutativity of product and union.
5 / 15
7. Key Properties of Factorized Representations
Factorized representations of results for queries with select, project, join,
aggregate, groupby, and orderby operators:
Very high compression rate
I Can be exponentially more succinct than the relations they encode.
I Arbitrarily better than generic compression schemes, e.g., bzip2
I Factorized representations of asymptotically-tight size bounds computable
directly from input database and query
Querying in the compressed domain
I Factorizations are relational expressions and can be composed with queries
I We developed the FDB in-memory query engine for this purpose
6 / 15
8. Current Focus
Reduce communication cost in distributed database systems
Factorization of temporary query results exchanged between nodes
Many systems already employ limited factorizations
Google MegaStore and F1, FoundationDB, Microsoft Cloud SQL Server
Google Faculty Research Award
Reduce space requirements of large-scale feature vectors in predictive modelling
Feature vectors = relations with high cardinality
Improvements of 10-100x on LogicBlox client data
7 / 15
10. Probabilistic Data is Commonplace
Facts of life:
Real-world data is often uncertain
Currrent probabilistic databases are in the order of Billion records
Generated from web data by NELL, Google Squared Knowledge Vault
Curating before processing is a time money black hole
We would like to query uncertain data asap!
9 / 15
11. Probabilistic Data is Commonplace
Facts of life:
Real-world data is often uncertain
Currrent probabilistic databases are in the order of Billion records
Generated from web data by NELL, Google Squared Knowledge Vault
Curating before processing is a time money black hole
We would like to query uncertain data asap!
MayBMS/SPROUT probabilistic database system
Open-source, built on top of PostgreSQL
3000+ downloads (as of Dec 2013)
The PDB most benchmarked against
SPROUT2 = SPROUT on Google Squared
Caught the interest of UK Defence Science and Technology Lab
9 / 15
16. ed and declarative programming model for the enterprise tech stack
Can freely mix transactions, analytics, graph queries, mathematical
programming and optimization, probabilistic programming
Makes possible new classes of hybrid applications
Typical app in retail sector:
I 50K Datalog++ LOC (vs. millions of C++ LOC)
I One system (vs. tens)
13 / 15
17. Live Programming in the Database
Flexible spreadsheet backed by scalable full-
edged DBMS
Users can de
18. ne formulas or change schema
I Triggers addition/deletion of datalog code on the DB server
program!
edbs!
execution
graph!
revised execution
graph!
idbs! revised idbs!
(meta-data)!
(actual data)!
14 / 15
19. Our Approach
Use declarative programming to improve
the implementation of declarative systems!
Internal library for declarative and incremental maintenance of program
state, using a small datalog engine.
I In the LogicBlox engine since May 2014
15 / 15