Design of Experiments on
Federator Polystore Architecture
Luiz Henrique Zambom Santana
Agenda
● Introduction
● Federator Polystore
● Experiment Design
● Data and experiment
● Conclusions
● References
“No one size fits all”
● “A panoply of data models, and they typically operate on flexible storage
formats such as JSON” [1]
● “Increasingly, we see applications that deploy multiple engines, resulting in a
need to join data across systems.” [1]
● “Increasingly, desktop and mobile applications are using the cloud
infrastructure to take advantage of the high-availability and scalability
characteristics. In the past, these type of systems used local databases to
store information and application state. There are many new applications that
share some or all their data with applications running on other hosts or in the
cloud and use these data stores for persistence.” [2]
Federation Polystore
API
Rendezvous
NoSQL NoSQL
RDF + Sparql
Federator
Canonical Model
Application 1 Application 2 Application N
Must be
ROBUST
to delivery
<1 sec.
response
Very complex architecture
{
"user":
{
"username":"luiz",
"password":"luiz",
"address":"lagoa"
}
}
{
"post":
{"body":"hello",
"author":"12345""
}
}
Client Federator
MongoDB
Cassandra
Insert
Endpoint
Mappings
Planner Plan
Access 1
Access .. N
{“address”:”lagoa”}
{“body”:”hello”}
Transaction
management
INSERT INTO user..
INSERT INTO post...
{12345}
{123456}
Entity
Manager
{12345} + Meta
{12345} + Entire document
Cache
(graph)
Persistent
HashMap
Dictionary
Reader
Execution Flow
Experiment Design
● Measure the response time
● Factors and levels
○ Architecture
■ Local
■ Mixed
■ Fully distributed
○ Protocol
■ HTTP
■ HTTPS
■ Thrifty
○ Mapping
■ Social Network (graph)
■ E-commerce (key/value)
■ Bank (columnar)
○ Cache
■ Bloom filter
■ Hash
■ None
Robust to the
mappping
Experiment Design
● Experiment 34
with 100 replications
○ Four factors
○ Three levels each
● Goals:
○ The architecture must be robust to delivery response in <1 second in both b
installations
○ The architecture must be robust to mapping variation, because one of the most important
results of my Doctoral work is to have a real-time response.
● Data collection:
○ 100 accesses with 4 blocks: insert, update, query, and delete, 100ms delay between each
access
○ Each block with 81 experiments, 8100 queries
○ Data generated synthetically, based on LUBM
Experiment execution
https://github.com/lhzsantana/federator
Hipothesis
● A, B, C e D
● H0
: the response time for all the instalations are the same
● ...
● ABCD
○ H0
: changing the factors generates the same effect in all the installations
Creating the Design in R
Creating the design in R
● Linear model
○ > Design.1 <- fac.design(nfactors= 4 ,replications= 100 ,repeat.only= FALSE ,blocks= 3 ,randomize= TRUE ,
seed= 29063 ,nlevels=c( 3,3,3,3 ), factor.names=list(architecture=c("local","mixed","distributed"), protocol=c
("http","https","thrifty"), mapping=c("socialnetwork","ecommerce","bank"), cache=c("bloom","hash","none") ) )
● Removed one of the blocks because of “Too few factors with even number of factor levels for this number of
blocks”,choosed to remove “update” because it can be implemented with the other operations
Data distribution
Creating the design in R
● > Design.1 <- add.response(Design.1,data, replace=FALSE)
● > LinearModel.1 <- lm(response ~ Blocks + (architecture + protocol + mapping + cache)^2, data=Design.1)
● > anova(LinearModel.1)
Plot of means
The more
distributed, the
better thrifty is
Hash better for
local, and bloom
filter for distribute
Bloom and thrifty
seems to be the
best mix
Conclusions
● Not a fractionary problem because there is virtually no limitation to the
number of replications
● There are still unmapped sources of variability (internal databases caching,
network, testing machine)
● Delete is not affected by the factors, or it not very well implemented
● Query average is very high, also the variability is unacceptable
● Local + Thrify + HashMap makes the architecture to use too much
memory
● Somehow frustating...
● The only good news: the mapping was not a relevant factor
Future work
● There is still a lot of work :)
● Limitations of the test:
○ Mixed queries not tested yet (another block is necessary for real word testing)
○ Fully distributed and parallel architecture is very expensive to test
● When I solve the “small” architecture variation, I will begin the more expensive
tests (fully distributed)
● Real word example
● Finally, I will repeat the tests to the Semantic Layer and compare how the
architecture behaves
References
1. Duggan, Jennie, et al. "The BigDAWG Polystore System." ACM SIGMOD
Record 44.2 (2015): 11-16.
2. Dey, Akon, Alan Fekete, and Uwe Röhm. "Scalable transactions across
heterogeneous NoSQL key-value data stores." Proceedings of the VLDB
Endowment 6.12 (2013): 1434-1439.
3. Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web."
Scientific american 284.5 (2001): 28-37.
4. Montgomery, Douglas C., and Douglas C. Montgomery. Design and analysis
of experiments. Vol. 7. New York: Wiley, 1984.
5. Barbetta, Pedro Alberto, Marcelo Menezes Reis, and Antonio Cezar Bornia.
Estatística: para cursos de engenharia e informática. Vol. 3. São Paulo: Atlas,
2004.

Design of Experiments on Federator Polystore Architecture

  • 1.
    Design of Experimentson Federator Polystore Architecture Luiz Henrique Zambom Santana
  • 2.
    Agenda ● Introduction ● FederatorPolystore ● Experiment Design ● Data and experiment ● Conclusions ● References
  • 3.
    “No one sizefits all” ● “A panoply of data models, and they typically operate on flexible storage formats such as JSON” [1] ● “Increasingly, we see applications that deploy multiple engines, resulting in a need to join data across systems.” [1] ● “Increasingly, desktop and mobile applications are using the cloud infrastructure to take advantage of the high-availability and scalability characteristics. In the past, these type of systems used local databases to store information and application state. There are many new applications that share some or all their data with applications running on other hosts or in the cloud and use these data stores for persistence.” [2]
  • 4.
    Federation Polystore API Rendezvous NoSQL NoSQL RDF+ Sparql Federator Canonical Model Application 1 Application 2 Application N Must be ROBUST to delivery <1 sec. response
  • 5.
    Very complex architecture { "user": { "username":"luiz", "password":"luiz", "address":"lagoa" } } { "post": {"body":"hello", "author":"12345"" } } ClientFederator MongoDB Cassandra Insert Endpoint Mappings Planner Plan Access 1 Access .. N {“address”:”lagoa”} {“body”:”hello”} Transaction management INSERT INTO user.. INSERT INTO post... {12345} {123456} Entity Manager {12345} + Meta {12345} + Entire document Cache (graph) Persistent HashMap Dictionary Reader Execution Flow
  • 6.
    Experiment Design ● Measurethe response time ● Factors and levels ○ Architecture ■ Local ■ Mixed ■ Fully distributed ○ Protocol ■ HTTP ■ HTTPS ■ Thrifty ○ Mapping ■ Social Network (graph) ■ E-commerce (key/value) ■ Bank (columnar) ○ Cache ■ Bloom filter ■ Hash ■ None Robust to the mappping
  • 7.
    Experiment Design ● Experiment34 with 100 replications ○ Four factors ○ Three levels each ● Goals: ○ The architecture must be robust to delivery response in <1 second in both b installations ○ The architecture must be robust to mapping variation, because one of the most important results of my Doctoral work is to have a real-time response. ● Data collection: ○ 100 accesses with 4 blocks: insert, update, query, and delete, 100ms delay between each access ○ Each block with 81 experiments, 8100 queries ○ Data generated synthetically, based on LUBM
  • 8.
  • 9.
    Hipothesis ● A, B,C e D ● H0 : the response time for all the instalations are the same ● ... ● ABCD ○ H0 : changing the factors generates the same effect in all the installations
  • 10.
  • 11.
    Creating the designin R ● Linear model ○ > Design.1 <- fac.design(nfactors= 4 ,replications= 100 ,repeat.only= FALSE ,blocks= 3 ,randomize= TRUE , seed= 29063 ,nlevels=c( 3,3,3,3 ), factor.names=list(architecture=c("local","mixed","distributed"), protocol=c ("http","https","thrifty"), mapping=c("socialnetwork","ecommerce","bank"), cache=c("bloom","hash","none") ) ) ● Removed one of the blocks because of “Too few factors with even number of factor levels for this number of blocks”,choosed to remove “update” because it can be implemented with the other operations
  • 12.
  • 13.
    Creating the designin R ● > Design.1 <- add.response(Design.1,data, replace=FALSE) ● > LinearModel.1 <- lm(response ~ Blocks + (architecture + protocol + mapping + cache)^2, data=Design.1) ● > anova(LinearModel.1)
  • 15.
    Plot of means Themore distributed, the better thrifty is Hash better for local, and bloom filter for distribute Bloom and thrifty seems to be the best mix
  • 16.
    Conclusions ● Not afractionary problem because there is virtually no limitation to the number of replications ● There are still unmapped sources of variability (internal databases caching, network, testing machine) ● Delete is not affected by the factors, or it not very well implemented ● Query average is very high, also the variability is unacceptable ● Local + Thrify + HashMap makes the architecture to use too much memory ● Somehow frustating... ● The only good news: the mapping was not a relevant factor
  • 17.
    Future work ● Thereis still a lot of work :) ● Limitations of the test: ○ Mixed queries not tested yet (another block is necessary for real word testing) ○ Fully distributed and parallel architecture is very expensive to test ● When I solve the “small” architecture variation, I will begin the more expensive tests (fully distributed) ● Real word example ● Finally, I will repeat the tests to the Semantic Layer and compare how the architecture behaves
  • 18.
    References 1. Duggan, Jennie,et al. "The BigDAWG Polystore System." ACM SIGMOD Record 44.2 (2015): 11-16. 2. Dey, Akon, Alan Fekete, and Uwe Röhm. "Scalable transactions across heterogeneous NoSQL key-value data stores." Proceedings of the VLDB Endowment 6.12 (2013): 1434-1439. 3. Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web." Scientific american 284.5 (2001): 28-37. 4. Montgomery, Douglas C., and Douglas C. Montgomery. Design and analysis of experiments. Vol. 7. New York: Wiley, 1984. 5. Barbetta, Pedro Alberto, Marcelo Menezes Reis, and Antonio Cezar Bornia. Estatística: para cursos de engenharia e informática. Vol. 3. São Paulo: Atlas, 2004.