Design of Experiments on Federator Polystore Architecture

Design of Experiments on
Federator Polystore Architecture
Luiz Henrique Zambom Santana

Agenda
● Introduction
● Federator Polystore
● Experiment Design
● Data and experiment
● Conclusions
● References

“No one size fits all”
● “A panoply of data models, and they typically operate on flexible storage
formats such as JSON” [1]
● “Increasingly, we see applications that deploy multiple engines, resulting in a
need to join data across systems.” [1]
● “Increasingly, desktop and mobile applications are using the cloud
infrastructure to take advantage of the high-availability and scalability
characteristics. In the past, these type of systems used local databases to
store information and application state. There are many new applications that
share some or all their data with applications running on other hosts or in the
cloud and use these data stores for persistence.” [2]

Federation Polystore
API
Rendezvous
NoSQL NoSQL
RDF + Sparql
Federator
Canonical Model
Application 1 Application 2 Application N
Must be
ROBUST
to delivery
<1 sec.
response

Very complex architecture
{
"user":
{
"username":"luiz",
"password":"luiz",
"address":"lagoa"
}
}
{
"post":
{"body":"hello",
"author":"12345""
}
}
Client Federator
MongoDB
Cassandra
Insert
Endpoint
Mappings
Planner Plan
Access 1
Access .. N
{“address”:”lagoa”}
{“body”:”hello”}
Transaction
management
INSERT INTO user..
INSERT INTO post...
{12345}
{123456}
Entity
Manager
{12345} + Meta
{12345} + Entire document
Cache
(graph)
Persistent
HashMap
Dictionary
Reader
Execution Flow

Experiment Design
● Measure the response time
● Factors and levels
○ Architecture
■ Local
■ Mixed
■ Fully distributed
○ Protocol
■ HTTP
■ HTTPS
■ Thrifty
○ Mapping
■ Social Network (graph)
■ E-commerce (key/value)
■ Bank (columnar)
○ Cache
■ Bloom filter
■ Hash
■ None
Robust to the
mappping

Experiment Design
● Experiment 34
with 100 replications
○ Four factors
○ Three levels each
● Goals:
○ The architecture must be robust to delivery response in <1 second in both b
installations
○ The architecture must be robust to mapping variation, because one of the most important
results of my Doctoral work is to have a real-time response.
● Data collection:
○ 100 accesses with 4 blocks: insert, update, query, and delete, 100ms delay between each
access
○ Each block with 81 experiments, 8100 queries
○ Data generated synthetically, based on LUBM

Experiment execution
https://github.com/lhzsantana/federator

Hipothesis
● A, B, C e D
● H0
: the response time for all the instalations are the same
● ...
● ABCD
○ H0
: changing the factors generates the same effect in all the installations

Creating the design in R
● Linear model
○ > Design.1 <- fac.design(nfactors= 4 ,replications= 100 ,repeat.only= FALSE ,blocks= 3 ,randomize= TRUE ,
seed= 29063 ,nlevels=c( 3,3,3,3 ), factor.names=list(architecture=c("local","mixed","distributed"), protocol=c
("http","https","thrifty"), mapping=c("socialnetwork","ecommerce","bank"), cache=c("bloom","hash","none") ) )
● Removed one of the blocks because of “Too few factors with even number of factor levels for this number of
blocks”,choosed to remove “update” because it can be implemented with the other operations

Creating the design in R
● > Design.1 <- add.response(Design.1,data, replace=FALSE)
● > LinearModel.1 <- lm(response ~ Blocks + (architecture + protocol + mapping + cache)^2, data=Design.1)
● > anova(LinearModel.1)

Plot of means
The more
distributed, the
better thrifty is
Hash better for
local, and bloom
filter for distribute
Bloom and thrifty
seems to be the
best mix

Conclusions
● Not a fractionary problem because there is virtually no limitation to the
number of replications
● There are still unmapped sources of variability (internal databases caching,
network, testing machine)
● Delete is not affected by the factors, or it not very well implemented
● Query average is very high, also the variability is unacceptable
● Local + Thrify + HashMap makes the architecture to use too much
memory
● Somehow frustating...
● The only good news: the mapping was not a relevant factor

Future work
● There is still a lot of work :)
● Limitations of the test:
○ Mixed queries not tested yet (another block is necessary for real word testing)
○ Fully distributed and parallel architecture is very expensive to test
● When I solve the “small” architecture variation, I will begin the more expensive
tests (fully distributed)
● Real word example
● Finally, I will repeat the tests to the Semantic Layer and compare how the
architecture behaves

References
1. Duggan, Jennie, et al. "The BigDAWG Polystore System." ACM SIGMOD
Record 44.2 (2015): 11-16.
2. Dey, Akon, Alan Fekete, and Uwe Röhm. "Scalable transactions across
heterogeneous NoSQL key-value data stores." Proceedings of the VLDB
Endowment 6.12 (2013): 1434-1439.
3. Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web."
Scientific american 284.5 (2001): 28-37.
4. Montgomery, Douglas C., and Douglas C. Montgomery. Design and analysis
of experiments. Vol. 7. New York: Wiley, 1984.
5. Barbetta, Pedro Alberto, Marcelo Menezes Reis, and Antonio Cezar Bornia.
Estatística: para cursos de engenharia e informática. Vol. 3. São Paulo: Atlas,
2004.

Design of Experiments on Federator Polystore Architecture

More Related Content

What's hot

Viewers also liked

Similar to Design of Experiments on Federator Polystore Architecture

More from Luiz Henrique Zambom Santana

Recently uploaded

Design of Experiments on Federator Polystore Architecture