IncQuery-D: Incremental Queries in the Cloud

Budapest University of Technology and Economics
Department of Measurement and Information Systems
Budapest University of Technology and Economics
Fault Tolerant Systems Research Group
INCQUERY-D:
INCREMENTAL QUERIES IN THE CLOUD
Gábor Szárnyas, Benedek Izsó,
István Ráth, Dániel Varró

Overview
 Introduction
 MDE scalability challenges for model queries
 Overview: scaling out in the cloud
 Evaluation: a feasibility study
 Conclusions and future work

Scalability challenges in MDE
 Complex instance models and queries
 Instance model complexity
o Size
o Structure
 Query complexity
o MDE workloads involve much more complex queries
than typical data-driven applications (e.g. model
validation, transformations, …)
 Scalability challenges arise due to their
combination

Model sizes
 Instance models with several million elements
o AUTOSAR models [1]
o Source code models
o Sensor data
Source: Markus Scheidgen, How Big are Models – An Estimation, 2012. [2]
application model size
software models 0 – 109
sensor data 109
geo-spatial models 109 – 1012
[1] http://wiki.eclipse.org/Auto_IWG_WP2
[2] http://hwl.hu-berlin.de/fileadmin/user_upload/documents/howbig_techreport.pdf

EMF-IncQuery
 State of the art incremental graph query engine
 Open source Eclipse project by BUTE and others
 Typical use cases
o Validation
o Incremental model transformation
o Model synchronization, view maintenance

Single workstation limitations
 Majority of tools mostly work for <1M model
elements due to algorithmic complexity
 Best tools for <10M model elements due to JVM’s
limitations
o A JVM cannot handle 15+ GB heap memory efficiently
o Long GC pauses
o Specialized JVMs (e.g. Azul Systems’ Zing)
• Commercial, experimental
• May require special hardware
 Proposed solution
o Scale out: distributed system

OVERVIEW OF THE
INCQUERY-D APPROACH

In-memory
EMF model
Architecture
In-memory storage
Transaction
Rete
net
Indexer
layer
Indexing
Production network
• Stores intermediate query results
• Propagates changes
EMF-IncQuery

DB shard 0
Architecture
In-memory storageServer 1
DB shard 1
Server 2
DB shard 2
Server 3
DB shard 3
Transaction
Server 0
Rete
net
Indexer
layer
IncQuery-D middleware
Rete net
Distributed indexing,
notification
Distributed persistent
storage
Distributed production network
• Each intermediate node can be allocated
to a different host
• Remote internode communication
EMF-IncQuery IncQuery-D

Rete net
 Asynchronous communication
 Consistency guaranteed by a termination protocol
indexer indexer indexer indexer
production
DB shard 0 DB shard 1 DB shard 2 DB shard 3

IncQuery-D
 Scaling out by…
o Sharding the data
o Sharding the pattern matcher network →
Avoid memory bottleneck
 Further advantages
o Agnostic to the representation of the graph
• Property graph, (EMF, RDF)
• Information from the metamodel is only used for indexing
o Query layer decoupled from the data storage
• Storage layer freely exchangeable
• Indexing is independent of storage features

Scalability considerations
 Construction process
1. Shard the data in the storage layer
2. Derive a Rete net layout from the query
3. Allocate the middleware indexers
4. Allocate the Rete nodes in the cloud
 Design aspects for scalability
o Local resource limitations
o Load balancing
o Minimize remote communication
• Given problem characteristics, global resource requirements can
be calculated
• Approach intrinsically supports dynamic scaling

 Benchmark goal
o Evaluate the feasibility of the concept
o Measure the scalability characteristics
o Workload profile similar to real world model validation
 Scenarios
o Batch – “traditional” batch graph search
o Incremental – Rete network
 Operations
o Simulates a user’s interaction with a model
o Load and first validation; transformation; revalidation
Evaluation of IncQuery-D

 Load and first validation: load the graph to the databases
and execute the query
 Transformation: query the graph and delete some
elements
 Revalidation: execute the query
Batch graph scenarioIncremental scenario – IncQuery-D
Transformation RevalidationGraphML
DB shards Result set
Load and first
validation

 Load and first validation: load the graph to the databases
and initialize the Rete net and retrieve the results
 Revalidation: retrieve the results from the Rete net
 Transformation: incrementally query the graph and
delete some elements, propagate the changes
Batch graph scenarioIncremental scenario – IncQuery-D
Transformation RevalidationGraphML
Rete net
Load and first
validation
Rete net

Implementation
Server 1
DB shard 1
Server 2
DB shard 2
Server 3
DB shard 3
Transaction
In-memory
EMF model
DB shard 0
Server 0
Rete
net
Indexer
layer
IncQuery-D middleware
Rete net
Neo4j
4 Ubuntu Linux servers
16 GB RAM
2×2.5 GHz Intel Xeon CPU
Detailed benchmark description: http://incquery.net/publications/incquery-d
Cypher
through REST
Akka
(asynchronous
communication)
Akka
(asynchronous
communication)

1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.1 /
0.008
0.2 /
0.015
0.5 /
0.03
0.9 /
0.06
1.7 /
0.114
3.5 /
0.231
7.1 /
0.47
14.1 /
0.945
28.0 /
1.907
55.8 /
3.853
time[s]
model size [million elements / file size in GB]
Neo4j/Cypher (batch) IncQuery-D (incremental)
Load and first validation phase
Small overhead for
the Rete network’s
construction
50M+: approx. 30 minutesParallel loading of the
graph from a GraphML
representation

1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.1 /
0.008
0.2 /
0.015
0.5 /
0.03
0.9 /
0.06
1.7 /
0.114
3.5 /
0.231
7.1 /
0.47
14.1 /
0.945
28.0 /
1.907
55.8 /
3.853
time[s]
Transformation phase
1. Elementary model query
2. Model manipulation
• Both implemented with Cypher
• The query evaluation time is dominating
• Query is supported by the Rete net
• Only the manipulation implemented with Cypher
• Overhead due to change propagation is negligible
• 1.5 OOM faster
• Performs a transformation
over a 55M model in one
minute

0.25
1
4
16
64
256
1024
4096
0.1 /
0.008
0.2 /
0.015
0.5 /
0.03
0.9 /
0.06
1.7 /
0.114
3.5 /
0.231
7.1 /
0.47
14.1 /
0.945
28.0 /
1.907
55.8 /
3.853
time[s]
Revalidation phase
Near instant
response time for
very large models
Different characteristics,
4 OOM for the largest model
Revalidation time is
independent of node size

Conclusions
 Novel approach for the distributed execution of
incremental graph queries
 Distributed Rete network
o Middleware for change propagation and indexing
o Incremental query layer decoupled from a sharded
graph database
 Results
o Working proof of concept
o Near instantaneous query evaluation up to 50M+
model elements
o Improves scalability of transformations significantly

Future work
 Tooling and automation
o Evolve the prototype into a developer tool
 Explore optimization possibilities
o Allocation of Rete nodes
o Dynamic reallocation of Rete nodes
o Sharding strategy, resource usage, network
communication overhead
 Cloud readiness
 Experiment with distributed EMF model stores
o CDO, MongoEMF, Morsa, …

IncQuery-D: Incremental Queries in the Cloud

More Related Content

What's hot

Similar to IncQuery-D: Incremental Queries in the Cloud

More from Gábor Szárnyas

Recently uploaded

IncQuery-D: Incremental Queries in the Cloud