Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Next Generation Execution for Apache Storm

473 views

Published on

With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.

In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.

Speaker
Roshan Naik, Senior MTS, Hortonworks

Published in: Technology
  • Be the first to comment

Next Generation Execution for Apache Storm

  1. 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Generation Execution Engine for Apache Storm Roshan Naik, Hortonworks Dataworks Summit Sept 20th 2017, San Jose
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Present : Storm 1.x à Has matured into a stable and reliable system à Widely deployed and holding up well in production à Scales well horizontally à Lots of new competition – Differentiating on Features, Performance, Ease of Use etc. Storm 2.x à High performance execution engine à All Java code (transitioning away from Clojure) à Improved Backpressure, Metrics subsystems à Lots more .. – Streams API, UI improvements, RAS scheduler improvements, …
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Execution Engine - Planned Enhancements for à Umbrella Jira : STORM-2284 – https://issues.apache.org/jira/browse/STORM-2284
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases - Latency centric à 100ms+ : Factory automation à 10ms - 100ms : Real time gaming, scoring shopping carts to print coupons à 0-10 ms : Network threat detection à Java based High Frequency Trading systems – fast: under 100 micro-secs 90% of time, no GC during the trading hours – medium: under 1ms 95% of time, and rare minor GC – slow: under 10 ms 99 or 99.9% of time, minor GC every few mins – Cost of being slow • Better to turn it off than lose money by leaving it running
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance in 2.0 à How do we know if a streaming system is “fast”? – Faster than another system ? – What about Hardware potential ? • More on this later à Dimensions – Throughput – Latency – Resource utilization: CPU/Network/Memory/Disk/Power
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Areas critical to Performance à Messaging System – Need Bounded Concurrent Queues that operate as fast as hardware allows – Lock based queues not an option – Lock free queues or preferably Wait-free queues à Threading & Execution Model – Avoid unnecessary threads. Less synchronization. – Dedicated threads for spouts and bolts instead of pooled threads. – CPU Pinning. – Reduce inter-thread, inter-process and inter-host communication à Memory Model – Lowering GC Pressure: Recycling Objects in critical path. – Reducing CPU cache faults: Control Object Layout (contiguous allocation), avoid false sharing
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Messaging Subsystem (STORM-2307)
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Understanding “Fast” Component Throughput Mill/sec AKKA 90-100 threads 50 Flink per core 1.5 Apex 3.0 container local 4.3 v3.0 Gear Pump 4 nodes 18 InfoSphere Streams v3.0 Huge Gap! Component Throughput Mill/sec Not thread safe ArrayDeQueue 1 thread rd+wr 1063 Lock based ArrayBlockingQueue 1 thd rd+wr 30 1 Prod, 1 Cons 4 SleepingWaitStrategy Disruptor 1 P, 1C 25 (ProducerMode= MULTI) 3.3.x lazySet() FastQ 1 P, 1C 31 JC Tools MPSC 1P, 1c 74 2P, 59 3P 43 4P 40 6P 56 8P 65 10P 66 15P 68 20P 68
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Messaging - Current Architecture Worker Send Thd Send Q Network Bolt/Spout Executor Recv Q Bolt Executor Thread (user logic) Send Q Send Thread Worker Recv Thd Recv Q Network Worker Process - High Level View
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Bolt/Spout Executor - Detailed ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Flusher Thread Send Thread SEND QRECEIVE Q ArrayList: Current Batch CLQ : OVERFLOW BATCHER (1 per publisher) Disruptor Q Bolt Executor Thread (user logic) publish Flusher Thread ArrayList ArrayList DestID msgs msgs msgs msgs DestID msgs msgs msgs msgs Worker’s Outbound Q Local Executor’s RECEIVE Q S E N D T H R E A D local remote
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New Architecture ArrayList ArrayList DestID msgs msgs msgs msgs DestID msgs msgs msgs msgs Worker’s Outbound Q Local Executor’s RECEIVE Q S E N D T H R E A D ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Flusher Thread Send Thread SEND QRECEIVE Q ArrayList: Current Batch CLQ : OVERFLOW BATCHER (1 per publisher) Disruptor Q Bolt Executor Thread (user logic) publish Flusher Thread local remote
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Messaging - New Architecture (STORM-2306) RECEIVE Q ArrayList: Current Batch BATCHER JCTools Q Bolt Executor Thread (user logic) publish Worker’s Outbound Q local remote Local Executor’s RECEIVE Q
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Preliminary Numbers LATENCY Ã 1 spout --> 1 bolt with 1 ACKer (all in same worker) – v1.0.1 : 3.4 milliseconds – v2.0 master: ~7 milliseconds – v2.0 redesigned : under 100 micro seconds (116x improvement)
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Preliminary Numbers THROUGHPUT Ã 1 spout --> 1 bolt [w/o ACKing] – v1.0.1 : ? – v2.0 master: ~4 million /sec – v2.0 redesigned : 7 - 8 million /sec (~2x but can be much better) Ã 1 spout --> 1 bolt [with ACKing] – v1.0 : 233 K /sec – v2.0 master: 900 k/sec – v2.0 redesigned : 1.5 million /sec (again, can be much better)
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Observations à Latency: Dramatically improved. à Throughput: Discovered multiple bottlenecks preventing significantly higher throughput. – Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others, throughput can reach ~7 million/sec. – TumpleImpl : If inefficiencies here are addressed, throughput can reach ~15 mill/sec. – ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with implementation not with concept. I see room for ACKer specific fixes that can also substantially improve its throughput.
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU Pinning (STORM-2313)
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU cache access à Approximate access costs – L1 cache : 1x – L2 cache : 2.5x – Local L3 cache : 10-20x – Remote L3 cache: 25-75x
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU Affinity à For inter-thread communication – cache fault distance matters – Faster between cores on same socket • 20% latency hit when threads pinned to diff sockets à Pinning threads to CPUs – If done right, minimizes cache fault distance – Threads moving around needs to cache refreshed – Unrelated threads running on same core trash each others cache à Helps perf on NUMA machines – Pinning long running tasks reduces NUMA effects – NUMA aware allocator introduced in Java SE 6u2
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CPU Pinning Strategy à Pin executors to physical cores. à Pin each executor to a separate physical core – High throughput / very low latency topos: – Not economical for other topos. à Try to fit subsequent executor threads on same socket à Logical cores – i.e. Hyperthreading ? – Avoid hyperthreading – avoid cache trashing each other on same core – Could provide it as option in future ?
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Threading & Execution Model (STORM-2307)
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved WORKER PROCESS • Start/Stop/Monitor Executors • Manage Metrics • Topology Reconfig • Heartbeat Executor (Thd) grouper Task (Bolt)Q counters Executor (Thd) System Task (Inter host Input) Executor (Thd) Sys Task (Outbound Msgs) Q counters New Threading & Execution Model Executor (Thd) System Task (Intra host Input) Executor (Thd) (grouper) (Bolt) Task (Spout/Bolt)Q counters
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory Management
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory Management Can be decomposed into 2 key area – Object Recycling - in critical path • Avoids dynamic allocation cost • Minimizes stop-the-world GC pauses – Contiguous allocation: arrays, data members. • CPU likes it. • Pre-fetch friendly. • Fewer cache faults per object. • Natural in C++, very painful in Java.
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You ! Tomorrow: Data Guarantees And Fault Tolerance In Streaming Systems 5:10 pm Room: C4.5 Questions ? References https://issues.apache.org/jira/browse/STORM-2284

×