Successfully reported this slideshow.
Your SlideShare is downloading. ×

Optimization of Incremental Queries CloudMDE2015

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 23 Ad

More Related Content

Similar to Optimization of Incremental Queries CloudMDE2015 (20)

Recently uploaded (20)

Advertisement

Optimization of Incremental Queries CloudMDE2015

  1. 1. Budapest University of Technology and Economics Department of Measurement and Information Systems Optimization of Incremental Queries in the Cloud József Makai, Gábor Szárnyas, Ákos Horváth, István Ráth, Dániel Varró Budapest University of Technology and Economics Fault Tolerant Systems Research Group
  2. 2. INCQUERY-D: DISTRIBUTED INCREMENTAL MODEL QUERIES
  3. 3. Incremental Query Evaluation by RETE  AUTOSAR well-formedness validation rule Communication channel Logical signal Mapping Physical signal Invalid model fragment  Instance model Valid model fragment
  4. 4. Fill the input nodesFill the worker nodesRead the result setModify the modelPropagate the changes Read the changes in the result set (deltas) Incremental Query Evaluation by RETE join join antijoin Result set Communication channel Logical signal Mapping Physical signal
  5. 5. Goals of IncQuery-D  Objectives o Distributed incremental pattern matching o Adaptation of IncQuery tooling to graph DBs o Executed over cloud infrastructure (COTS hardware)  Achieve scalability by avoiding memory bottleneck o Sharding separately • Data • Indexers • Query network o In memory: • Index + Query Assumptions • All Rete nodes fit on a server node • Indexers can be filled efficiently • Modification size ≪ model size • The application requires the complete result set of the query (opposed to just one match)
  6. 6. Database shard 0 INCQUERY-D Architecture Server 1 Database shard 1 Server 2 Database shard 2 Server 3 Database shard 3 Transaction Server 0 Rete net Indexer layer INCQUERY-D Distributed query evaluation network Distributed indexer Model access adapter Distributed indexing, notification Distributed persistent storage Distributed production network • Each intermediate node can be allocated to a different host • Remote internode communication
  7. 7. INCQUERY-D Architecture Server 1 Database shard 1 Server 2 Database shard 2 Server 3 Database shard 3 Transaction In-memory EMF model Database shard 0 Server 0 Indexer layer INCQUERY-D Indexer Indexer Indexer Indexer Join Join Antijoin Akka Triple store (4store), Document DB (Mongo), RDF over Column family (Cumulus)
  8. 8. RETE Deployment Process Query Language Query Predicates RETE Structure Platform Description Allocation / Mapping Deployment Descriptor pattern routeSensor(sensor: Sensor) = { TrackElement.sensor(switch,sensor); Switch(switch); SwitchPosition. switch(sp, switch); SwitchPosition(sp); Route.switchPosition(route, sp); Route(route); neg find head(route, sensor); } pattern head(R, Sen) = { Route.routeDefinition(R, Sen); } route: Route sp: SwitchPosition Switch:Switchsensor:Sensor switchPosition switch sensor routeDefinition
  9. 9. RETE Deployment Process  Construct language- independent constraints  Resolution of o syntactic sugar o type information Query Language Query Predicates RETE Structure Platform Description Allocation / Mapping Deployment Descriptor Variables route sp switch Parameter sensor Constraints Edge: SwitchPosition.switch Edge: TrackElement.sensor Edge: Route.switchPosition Negation: head
  10. 10. RETE Deployment Process  Construct RETE structure (platform independently)  Optimizations: o Model statistics o Expected usage profile Query Language Query Predicates RETE Structure Platform Description Allocation / Mapping Deployment Descriptor join join join
  11. 11. RETE Deployment Process  Architecture model (Cloud infrastructure) o Virtual Machines • Memory limits • CPU speed • Storage capacity o Communication Channels • Bandwidth  Specified by a textual DSL (Xtext) Query Language Query Predicates RETE Structure Platform Description Allocation / Mapping Deployment Descriptor 1 2 3 4
  12. 12. RETE Deployment Process Machine Allocated Nodes 1 In1, In2, Join2 2 In3 3 In4 4 Join1, Join3 Query Language Query Predicates RETE Structure Platform Description Allocation / Mapping Deployment Descriptor 1 2 3 4 Join1 Join3 Join2 In1 In2 In3 In4 Allocation can be optimized for query performance and other beneficial system characteristics!
  13. 13. RETE Deployment Process  Configuration scripts for o Deployment o Communication middleware  Derived by automated code generation o Using Eclipse technology: EMF-IncQuery + Xtend Query Language Query Predicates RETE Structure Platform Description Allocation / Mapping Deployment Descriptor
  14. 14. ALLOCATION OPTIMIZATION IN INCQUERY-D
  15. 15. Motivation for Allocation Optimization  Considering data-intensive systems o Over usage of resources o Cost of the system o Overhead of network communication Job Job t Local job execution time t’ Data transmission time is significant component in global execution time ~ Job Job Job Network links can have different capacities 4000 MB Process 2000 MB Process 500 MB Process 2400 MB $$$ Poor utilization leads to expensive system
  16. 16. The Allocation Problem  Inputs  Allocation constraints  Output: Valid allocation  Optimization targets 500 MB 3200 MB 2400 MB600 MB Worker node Input nodeInput node Production node 1 2 3 4 5000 MB6000 MB 1 2 • Rete network for the query organized to processes • Resource consumption Available infrastructure with important resource parameters
  17. 17. Opt. Target: Communication Minimization 1 × 1,000,000 3 × 200,000 3 × 200,000 Communication = 2,200,000 6000 MB 5000 MB 1 2500 MB 3200 MB 2400 MB600 MB Worker node Input nodeInput node Production node 1,000,000200,000 200,000 1 2 3 4 3 × 1,000,000 1 × 200,000 1 × 200,000 Communication = 3,400,000 5000 MB 6000 MB 1 2 Largest volume of data is sent through faster local link
  18. 18. Opt. Target: Cost Minimization 500 MB 3200 MB 2400 MB600 MB Worker node Input nodeInput node Production node 1 2 3 4 4000 MB $5 4000 MB $5 6500 MB $7 1 2 3 Cost = 10 4000 MB $5 4000 MB $5 6500 MB $7 1 2 3 Cost = 12
  19. 19. Heuristics in Optimization Worker node Production node Input node Worker node Input nodeInput node Worker node Production node Production node Worker node Model database Number of model elements ?? MB Input node Memory consumption of Rete nodes and processes 1 1 1 1 1 1 1 Memory usage of Input nodes can be estimated Communication intensity of network communication channels2 2 2 2 2 2 3 3 3 3 3 4 4
  20. 20. Performance Impact of Optimization 61K 213K 867K 3M 13M Model size (number of elements) Time(sec) First evaluation time of a complex query 28 45 72 114 182 290 463 739 Max. memory Naive optimization Communication optimization 739 616 194 144 2 minutes gain! This approach doesn’t work for larger models!
  21. 21. Network Traffic Statistics 300 349 371 1020 248 280 347 875 14 2 74 90 24 20 190 234 0 200 400 600 800 1000 1200 vm0 vm1 vm2 total vm0 vm1 vm2 total Network Traffic in Megabytes Remote Local Unoptimized Optimized  Unoptimized: o Remote Traffic: 1020 o Local Traffic: 90 o Total Traffic: 1110  Optimized: o Remote Traffic: 875 o Local Traffic: 234 o Total Traffic: 1109
  22. 22. Conclusion and Future Work  Results o Novel approach for application-specific resource allocation optimization for distributed Rete o CPLEX-based implementation for IncQuery-D o Preliminary evaluation results • Significant improvements for local resource management • Performance gains especially over slow / inhomogeneous networks • Efficient optimization execution (supported by runtime cutoff in CPLEX)  Future work o Hadoop / YARN support (new IncQuery-D developments) • Support configuration optimization for other Hadoop-based cloud apps o Static allocation  Dynamic reallocation • Take existing configuration as a starting constraint set • Optimize for changed workload conditions
  23. 23. New INCQUERY-D Architecture Docker container 1 Database shard 1 Docker container 2 Database shard 2 Docker container 3 Database shard 3 Transaction In-memory EMF model Database shard 0 Docker container 0 Indexer layer New INCQUERY-D: “Hadoop over Docker” Indexer Indexer Indexer Indexer Join Join Antijoin • YARN resource management • ZooKeeper monitoring Akka actors embedded into long- running Hadoop jobs

Editor's Notes

  • Ez szuper jól bemutatja azokat a fogalmakat, amivel mi is dolgozunk a végén, szóval ezt hasznos lenne bemutatni.
  • Kulcsgondolatok:
    Erőforrások túlhasználását el kell kerülni, de a rossz kihasználtság meg drága rendszerhez vezet
    Adatküldés ideje jelentős összetevő a globális végrehajtási időben, hálózati linkek is különböző sebességűek lehetnek  erre optimalizálunk
  • Ennél el kell majd mondani mit jelentenek a számok az egyes “éleken”.
  • Normalized tuple-t használjuk becsléshez. Egy node-nál ez következőképpen néz ki:
    megnézzük mennyi adat várható bemeneti csatornákon (először input node-nál, ahol biztosan tudjuk is azt)
    Abból közelítjük memória fogyasztást processeknek lineáris regresszióval
    Kiszámoljuk node típusa és bemeneti adat mennyiségének függvényében a kimenő csatornákra jutó adat mennyiségét (mindegyiken ugyanannyi lesz), input node-ra ezt is tudjuk tutira, mert mindent továbbít
    Ezt végezzük szintről szintre, háló szélességi bejárásával

    Ezt kellene itt összefoglalni

×