1. Rethinking
The Storm 2.0 Worker
Roshan Naik
HORTONWORKS
April 2017, Storm & Kafka Meetup
Santa Clara, CA
2. Present : Storm 1.x
• Has matured into a stable and reliable system
• Widely deployed and holding up well in production
• Lots of new competition
• Differentiating on Features, Performance, Ease of Use etc.
3. Performance in 2.0
• How do we know if a streaming system is “fast”?
• Faster than another system ?
• What about Hardware potential ?
• See analysis in STORM-2284
• Dimensions
• Throughput
• Latency
• Resource utilization: CPU/Network/Memory/Disk/Power
• STORM-2284
• https://issues.apache.org/jira/browse/STORM-2284
4. Overview of Proposed Enhancements
• https://issues.apache.org/jira/browse/STORM-2284
5. Areas critical to Performance
• Messaging System
• Need Bounded Concurrent Queues that operate as fast as hardware allows
• Lock based queues not an option
• Lock free queues or preferably Wait-free queues
• Threading Model
• Fewer Threads. Less synchronization.
• Dedicated threads instead of pooled threads.
• CPU Pinning.
• Memory Model
• Lowering GC Pressure: Recycling Objects
• Reducing CPU cache faults: Controlling Object Layout (contiguous allocation)
7. Messaging - Current Architecture
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
A local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
8. Messaging - New Architecture
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
A local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
9. Messaging - New Architecture
RECEIVE Q
ArrayList: Current Batch
BATCHER
JCTools Q
Bolt
Executor
Thread
(user logic)
publish
A local Executor’s
RECEIVE Q
Worker’s
Outbound Q
local
remote
12. Observations
• Latency: Dramatically improved.
• Throughput: Discovered multiple bottlenecks preventing significantly
higher throughput.
• Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along
with some others, throughput can reach ~7 million/sec.
• TumpleImpl : If inefficiencies here are addressed, throughput can reach 15
mill/sec.
• ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec.
Limitation with implementation not with concept. I see room for ACKer
specific fixes that can also substantially improve its throughput.