Optimization of Incremental Queries CloudMDE2015

Budapest University of Technology and Economics
Department of Measurement and Information Systems
Optimization of Incremental Queries in
the Cloud
József Makai, Gábor Szárnyas, Ákos Horváth,
István Ráth, Dániel Varró
Budapest University of Technology and Economics
Fault Tolerant Systems Research Group

INCQUERY-D: DISTRIBUTED
INCREMENTAL MODEL QUERIES

Incremental Query Evaluation by RETE
 AUTOSAR well-formedness validation rule
Communication
channel
Logical signal Mapping Physical signal
Invalid model fragment
 Instance model
Valid model fragment

Fill the input nodesFill the worker nodesRead the result setModify the modelPropagate the changes
Read the changes in the
result set (deltas)
Incremental Query Evaluation by RETE
join
join
antijoin
Result set
Communication
channel
Logical signal Mapping Physical signal

Goals of IncQuery-D
 Objectives
o Distributed incremental pattern matching
o Adaptation of IncQuery tooling to graph DBs
o Executed over cloud infrastructure (COTS hardware)
 Achieve scalability by avoiding memory bottleneck
o Sharding separately
• Data
• Indexers
• Query network
o In memory:
• Index + Query
Assumptions
• All Rete nodes fit on a server node
• Indexers can be filled efficiently
• Modification size ≪ model size
• The application requires the complete result
set of the query (opposed to just one match)

Database
shard 0
INCQUERY-D Architecture
Server 1
Database
shard 1
Server 2
Database
shard 2
Server 3
Database
shard 3
Transaction
Server 0
Rete net
Indexer
layer
INCQUERY-D
Distributed query evaluation network
Distributed indexer Model access adapter
Distributed indexing,
notification
Distributed persistent
storage
Distributed production network
• Each intermediate node can be allocated
to a different host
• Remote internode communication

INCQUERY-D Architecture
Server 1
Database
shard 1
Server 2
Database
shard 2
Server 3
Database
shard 3
Transaction
In-memory
EMF model
Database
shard 0
Server 0
Indexer
layer
INCQUERY-D
Indexer Indexer Indexer Indexer
Join
Join
Antijoin
Akka
Triple store (4store),
Document DB (Mongo),
RDF over Column family
(Cumulus)

RETE Deployment Process
Query
Language
Query
Predicates
RETE
Structure
Platform
Description
Allocation /
Mapping
Deployment
Descriptor
pattern routeSensor(sensor: Sensor) = {
TrackElement.sensor(switch,sensor);
Switch(switch);
SwitchPosition. switch(sp, switch);
SwitchPosition(sp);
Route.switchPosition(route, sp);
Route(route);
neg find head(route, sensor);
}
pattern head(R, Sen) = {
Route.routeDefinition(R, Sen);
}
route: Route sp: SwitchPosition
Switch:Switchsensor:Sensor
switchPosition
switch
sensor
routeDefinition

 Construct language-
independent constraints
 Resolution of
o syntactic sugar
o type information
Query
Language
Query
Predicates
RETE
Structure
Platform
Description
Allocation /
Mapping
Deployment
Descriptor
Variables route sp switch
Parameter sensor
Constraints
Edge: SwitchPosition.switch
Edge: TrackElement.sensor
Edge: Route.switchPosition
Negation: head

 Construct RETE structure
(platform independently)
 Optimizations:
o Model statistics
o Expected usage profile
Query
Language
Query
Predicates
RETE
Structure
Platform
Description
Allocation /
Mapping
Deployment
Descriptor
join
join
join

 Architecture model
(Cloud infrastructure)
o Virtual Machines
• Memory limits
• CPU speed
• Storage capacity
o Communication Channels
• Bandwidth
 Specified by a textual DSL
(Xtext)
Query
Language
Query
Predicates
RETE
Structure
Platform
Description
Allocation /
Mapping
Deployment
Descriptor
1 2
3 4

Machine Allocated Nodes
1 In1, In2, Join2
2 In3
3 In4
4 Join1, Join3
Query
Language
Query
Predicates
RETE
Structure
Platform
Description
Allocation /
Mapping
Deployment
Descriptor
1 2
3 4
Join1
Join3
Join2
In1 In2 In3 In4
Allocation can be optimized for
query performance and other
beneficial system characteristics!

 Configuration scripts for
o Deployment
o Communication
middleware
 Derived by automated
code generation
o Using Eclipse technology:
EMF-IncQuery + Xtend
Query
Language
Query
Predicates
RETE
Structure
Platform
Description
Allocation /
Mapping
Deployment
Descriptor

ALLOCATION OPTIMIZATION IN
INCQUERY-D

Motivation for Allocation Optimization
 Considering data-intensive
systems
o Over usage of resources
o Cost of the system
o Overhead of network
communication
Job Job
t
Local job
execution time
t’
Data transmission time
is significant component
in global execution time
~
Job
Job
Job
Network links can have
different capacities
4000 MB
Process
2000 MB
Process
500 MB
Process
2400 MB
$$$
Poor utilization leads
to expensive system

The Allocation Problem
 Inputs
 Allocation constraints
 Output: Valid allocation
 Optimization targets
500 MB
3200 MB
2400 MB600 MB
Worker node
Input nodeInput node
Production node
1 2
3
4
5000 MB6000 MB
1 2
• Rete network for the query
organized to processes
• Resource consumption
Available infrastructure with
important resource parameters

Opt. Target: Communication Minimization
1 × 1,000,000
3 × 200,000 3 × 200,000
Communication = 2,200,000
6000 MB
5000 MB
1
2500 MB
3200 MB
2400 MB600 MB
Worker node
Production node
1,000,000200,000
200,000
1 2
3
4
3 × 1,000,000
1 × 200,000
1 × 200,000
Communication = 3,400,000
5000 MB
6000 MB
1
2
Largest volume of data is
sent through faster local link

Opt. Target: Cost Minimization
500 MB
3200 MB
2400 MB600 MB
Worker node
Production node
1 2
3
4
4000 MB
$5
4000 MB
$5
6500 MB
$7
1
2
3
Cost = 10
4000 MB
$5
4000 MB
$5
6500 MB
$7
1
2
3
Cost = 12

Heuristics in Optimization
Worker node
Production
node
Input node
Worker node
Worker node
Production
node
Production
node
Worker node
Model
database
Number of model
elements
?? MB
Input node
Memory consumption of
Rete nodes and processes
1 1 1
1 1 1
1
Memory usage of Input
nodes can be estimated
Communication
intensity of network
communication
channels2 2
2
2
2 2
3 3
3
3 3
4 4

Performance Impact of Optimization
61K 213K 867K 3M 13M
Model size (number of elements)
Time(sec)
First evaluation time of a complex query
28
45
72
114
182
290
463
739
Max. memory
Naive
optimization
Communication
optimization
739
616
194
144
2 minutes gain!
This approach
doesn’t work for
larger models!

Network Traffic Statistics
300
349 371
1020
248 280
347
875
14
2
74
90
24
20
190
234
0
200
400
600
800
1000
1200
vm0 vm1 vm2 total vm0 vm1 vm2 total
Network Traffic in Megabytes
Remote Local
Unoptimized Optimized
 Unoptimized:
o Remote Traffic:
1020
o Local Traffic: 90
o Total Traffic: 1110
 Optimized:
o Remote Traffic:
875
o Local Traffic: 234
o Total Traffic: 1109

Conclusion and Future Work
 Results
o Novel approach for application-specific resource allocation optimization for
distributed Rete
o CPLEX-based implementation for IncQuery-D
o Preliminary evaluation results
• Significant improvements for local resource management
• Performance gains especially over slow / inhomogeneous networks
• Efficient optimization execution (supported by runtime cutoff in CPLEX)
 Future work
o Hadoop / YARN support (new IncQuery-D developments)
• Support configuration optimization for other Hadoop-based cloud apps
o Static allocation  Dynamic reallocation
• Take existing configuration as a starting constraint set
• Optimize for changed workload conditions

New INCQUERY-D Architecture
Docker container 1
Database
shard 1
Docker container 2
Database
shard 2
Docker container 3
Database
shard 3
Transaction
In-memory
EMF model
Database
shard 0
Docker container 0
Indexer
layer
New INCQUERY-D: “Hadoop over Docker”
Indexer Indexer Indexer Indexer
Join
Join
Antijoin
• YARN resource
management
• ZooKeeper
monitoring
Akka actors
embedded into long-
running Hadoop jobs

Optimization of Incremental Queries CloudMDE2015

More Related Content

Similar to Optimization of Incremental Queries CloudMDE2015

Recently uploaded

Optimization of Incremental Queries CloudMDE2015

Editor's Notes