Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1© Copyright 2016 EMC Corporation. All rights reserved.
Improved Reliable Streaming Processing:
Apache Storm as example
Fr...
2© Copyright 2016 EMC Corporation. All rights reserved.
The technology concepts being discussed and demonstrated are
the r...
3© Copyright 2016 EMC Corporation. All rights reserved.
• Distributed Streaming System
• Reliable Processing
• Apache Stor...
4© Copyright 2016 EMC Corporation. All rights reserved.
• As service, continuously process data (a.k.a message or tuple)
i...
5© Copyright 2016 EMC Corporation. All rights reserved.
Streaming Processing
(Storm, Spark Streaming)
Batch processing
(Ha...
6© Copyright 2016 EMC Corporation. All rights reserved.
Storm Flink Spark
Streaming
Built since 2011 (Apache, Trident)
201...
7© Copyright 2016 EMC Corporation. All rights reserved.
• Every message shall be guaranteed processed
– At-most once
– At-...
8© Copyright 2016 EMC Corporation. All rights reserved.
• Scalable
• Fault-tolerant
• Guaranteed message processing
– At l...
9© Copyright 2016 EMC Corporation. All rights reserved.
Storm: designs for fault-tolerance
Nimbus
 Deploy topology
 Disp...
10© Copyright 2016 EMC Corporation. All rights reserved.
• Critical message granularity (NOT thread/task/job/node)
• Need ...
11© Copyright 2016 EMC Corporation. All rights reserved.
History of Apache Storm and lessons learned
– Nathan Marz, creato...
12© Copyright 2016 EMC Corporation. All rights reserved.
Storm reliability track algorithm
0
1
2
3
4
Status Acker
srcNodeI...
13© Copyright 2016 EMC Corporation. All rights reserved.
• RandomNum + XOR based, the key foundation of Storm that
runs fo...
14© Copyright 2016 EMC Corporation. All rights reserved.
• Network traffic, CPU overhead  latency & throughput impact
– P...
15© Copyright 2016 EMC Corporation. All rights reserved.
IS IT POSSIBLE ?
Ack only at leaf?
0
1
2
3
4
5
6
7
8
9
Data sourc...
16© Copyright 2016 EMC Corporation. All rights reserved.
• Same-level guaranteed reliable processing
• More scalable, effi...
17© Copyright 2016 EMC Corporation. All rights reserved.
• An evolution based on Random Num + XOR
Approach-1: fingerprint ...
18© Copyright 2016 EMC Corporation. All rights reserved.
• Fingerprint(FP): A digest (i.e., 8B) of {in msgs, out msgs and
...
19© Copyright 2016 EMC Corporation. All rights reserved.
Fingerprint example
0
1
2
3
4
FP0= R ⊕ A ⊕ B ⊕ C
FP1= FP0 ⊕ A ⊕ D...
20© Copyright 2016 EMC Corporation. All rights reserved.
Approach-1: failure example
0
1
2
3
4
Acker
srcNodeID : RootMsgID...
21© Copyright 2016 EMC Corporation. All rights reserved.
Approach-1: a complex example
1
2
3
4
5
6
7
8R
A
B
C
D
E
F
G
H
I
...
22© Copyright 2016 EMC Corporation. All rights reserved.
• For input rootMsg, INIT a BIG SHARE (8B), EMBED as metadata, pa...
23© Copyright 2016 EMC Corporation. All rights reserved.
• Rare case: INCREASE share if insufficient to split (also syncup...
24© Copyright 2016 EMC Corporation. All rights reserved.
• Implemented Approach-2 (share-split)
• Integrate with Storm 1.0...
25© Copyright 2016 EMC Corporation. All rights reserved.
• Function & performance
– network traffic, CPU, latency/throughp...
26© Copyright 2016 EMC Corporation. All rights reserved.
• Function: Inject error and validate reliability detection: Pass...
27© Copyright 2016 EMC Corporation. All rights reserved.
• 1/3 Ack traffic, 18% faster, 9% less CPU
Test1: 3 layers
3903
1...
28© Copyright 2016 EMC Corporation. All rights reserved.
• 1/5 Ack traffic, 23% faster, 14% less CPU
Test2: 7 layers
2685
...
29© Copyright 2016 EMC Corporation. All rights reserved.
• Larger topology? Quick test of 11 layers:
– 1/9 traffic
• Suppo...
30© Copyright 2016 EMC Corporation. All rights reserved.
End-end IoT landscape
Continuous, scalable,
Real-time processing
31© Copyright 2016 EMC Corporation. All rights reserved.
• Lambda architecture, fusion “historical ”+“new” data
– Proposed...
32© Copyright 2016 EMC Corporation. All rights reserved.
• 2 innovative & inspiring streaming reliability algorithms
– Gua...
33© Copyright 2016 EMC Corporation. All rights reserved.
• Feedback or comments? talk with us!
– Any flaw, constraints, or...
Improved Reliable Streaming Processing: Apache Storm as example
Upcoming SlideShare
Loading in …5
×

Improved Reliable Streaming Processing: Apache Storm as example

1,071 views

Published on

Improved Reliable Streaming Processing: Apache Storm as example

Published in: Technology
  • Be the first to comment

Improved Reliable Streaming Processing: Apache Storm as example

  1. 1. 1© Copyright 2016 EMC Corporation. All rights reserved. Improved Reliable Streaming Processing: Apache Storm as example Frank Zhao, EMC CTO Office, Fenghao Zhang*, Microsoft Bing, Yusong Lv*, Peking University Special thanks to EMC Ken Taylor, John Cardente and Lincourt Robert *Zhang and Lv contributed to the research when they worked at EMC China COE
  2. 2. 2© Copyright 2016 EMC Corporation. All rights reserved. The technology concepts being discussed and demonstrated are the result of research conducted by the Advanced Research & Development (ARD) team from the EMC Office of the CTO. Any demonstrated capability is only for research purpose and at a prototype phase, therefore : THERE ARE NO IMMEDIATE PLANS NOR INDICATION OF SUCH PLANS FOR PRODUCTIZATION OF THESE CAPABILITIES AT THE TIME OF PRESENTATION. THINGS MAY OR MAY NOT CHANGE IN THE FUTURE. DISCLAIMER
  3. 3. 3© Copyright 2016 EMC Corporation. All rights reserved. • Distributed Streaming System • Reliable Processing • Apache Storm’s Solution, the Challenge • New Proposed Approaches – Fingerprint, and share-split • Prototyping with Apache Storm and Benchmark • Summary and Outlook Agenda
  4. 4. 4© Copyright 2016 EMC Corporation. All rights reserved. • As service, continuously process data (a.k.a message or tuple) in scalable, reliable and high-performance way (msec) – Open-source: Storm, Flink, Spark-Streaming, Samza Streaming processing
  5. 5. 5© Copyright 2016 EMC Corporation. All rights reserved. Streaming Processing (Storm, Spark Streaming) Batch processing (Hadoop MR) Type Continuous(never-stop), real-time (ms level) Batch/Period Model DAG/graph MapReduce like Jobs Workload CPU/Memory intensive CPU/mem and IO internsive State Stateless, may period ckpt Stateful Cluster Master-Slave w/ Zookeeper (Storm) Master-Slave or Job-task Fault- tolerance Fault-tolerance/HA Fault-tolerance/HA Streaming vs. batch processing
  6. 6. 6© Copyright 2016 EMC Corporation. All rights reserved. Storm Flink Spark Streaming Built since 2011 (Apache, Trident) 2016 (Twitter Heron) 2014 (Apache) ~2013 Streaming Native (micro-batch, Trident) Native Micro-batch Guarantee At least once (exactly-once w/ Trident) Exactly-once Exactly-once Fault-Tolerance Ack per message Checkpoint Checkpoint Latency 5 4 3 Throughput 4 5 5 Ecosystem 5 3 3 Storm, Flink, Spark streaming* *Personal observations for reference only
  7. 7. 7© Copyright 2016 EMC Corporation. All rights reserved. • Every message shall be guaranteed processed – At-most once – At-least once – Exactly once Reliable processing May save result Topology (DAG) 0 1 2 3 4 5 6 7 8 9Data source B C D E F G H I J K L M Spout R Bolt (worker, task, op)
  8. 8. 8© Copyright 2016 EMC Corporation. All rights reserved. • Scalable • Fault-tolerant • Guaranteed message processing – At least once (default) • Fast: ms level – Pure memory computing, no checkkpoint • Simple programming model – Topology - Spouts – Bolts – Clojure, Java, Ruby, Python … Apache Storm
  9. 9. 9© Copyright 2016 EMC Corporation. All rights reserved. Storm: designs for fault-tolerance Nimbus  Deploy topology  Dispatch tasks  Monitor cluster Zookeeper cluster  Coordination  States of Nimbus  State of supervisor  … Supervisor Executor Task Task WorkersMaster Those FT are about thread/task/ job or node, NOT message
  10. 10. 10© Copyright 2016 EMC Corporation. All rights reserved. • Critical message granularity (NOT thread/task/job/node) • Need an efficient method, considering – Every component may fault – Large topology, continuous flooding messages – Network temp unavailable, traffic out-of-order, … – Minimized resource usage (network, cpu, mem) Track processing status in DAG 0 1 2 3 4 5 6 7 8 9Data source B C D E F G H I J K L M Spout R Bolt
  11. 11. 11© Copyright 2016 EMC Corporation. All rights reserved. History of Apache Storm and lessons learned – Nathan Marz, creator of Storm Tough problem and Storm’s answer!
  12. 12. 12© Copyright 2016 EMC Corporation. All rights reserved. Storm reliability track algorithm 0 1 2 3 4 Status Acker srcNodeID: R, R A B C D E F R ⊕ A ⊕ B ⊕ C A ⊕ D B ⊕ E C ⊕ F D⊕ E ⊕ F R Status = R ⊕ R ⊕ A ⊕ B ⊕ C ⊕ A ⊕ D ⊕ B ⊕ E ⊕ C ⊕ F ⊕ D ⊕ E ⊕ F = 1. Each msg has ID (8B random number) 2. Each bolt runs XOR (inMsgID, outMsgID[]) per inMsg 3. Each bolt sends XOR (per inMsg) result to Acker 4. Acker runs XOR: always 8B (regardless topology size) 5. Finally, given timeout, Acker.status shall be 0 means OK otherwise something failed (may false-alarm, but never miss) 0
  13. 13. 13© Copyright 2016 EMC Corporation. All rights reserved. • RandomNum + XOR based, the key foundation of Storm that runs for 5+Y – Smart, simple and pretty good! – Least memory footprint at Acker, regardless of topology – Reliable*, regardless of Ack traffic order – XOR op: commutative law, associative law • Easy to handle any Out-of-order Ingenious! *: in theory, random ID may collision
  14. 14. 14© Copyright 2016 EMC Corporation. All rights reserved. • Network traffic, CPU overhead  latency & throughput impact – Possibility of random number collision Limitations 25000 msg/sec 9300 msg/sec Non-reliable processing reliable processing *3rd party benchmark in 2012, things may change now
  15. 15. 15© Copyright 2016 EMC Corporation. All rights reserved. IS IT POSSIBLE ? Ack only at leaf? 0 1 2 3 4 5 6 7 8 9 Data source B C D E F G H I J K L M R Current algorithm is fantastic, however
  16. 16. 16© Copyright 2016 EMC Corporation. All rights reserved. • Same-level guaranteed reliable processing • More scalable, efficient and fast – Much less Ack traffic; usually only at leaf nodes – Same memory footprint, less CPU usage – Eventually better latency/throughput 2 new proposed approaches Currently in research & quick validation phase
  17. 17. 17© Copyright 2016 EMC Corporation. All rights reserved. • An evolution based on Random Num + XOR Approach-1: fingerprint based Currently, XOR in-pair (send, recv), then it’s 0 Further, XOR in multiple pairs (2, 4, 6, …), still 0
  18. 18. 18© Copyright 2016 EMC Corporation. All rights reserved. • Fingerprint(FP): A digest (i.e., 8B) of {in msgs, out msgs and parent.fp}, to encode & represent the context then recursively pass- down. That each downstream inherits genes from all ancestors – Still use XOR of IDs, redundant in scalable way – 3-rule: Embedded, Recursively inherited and Append-only update Approach-1: fingerprint idea iMsg <Mj, FPj > Msg < Mj+1, FPj:i > Msg < Mj+2, FPj:i > Msg < Mj+2, FPj:i > Msg <…> Ni i+1 i+2 i+3 Ni+1 Ni+2 Ni+3 Pass-down FP InMsgID XOR [outMsgIDs] • Embedded: as part of metadata • Recursive-inherit: pass-down • Append-update: via XOR Append update
  19. 19. 19© Copyright 2016 EMC Corporation. All rights reserved. Fingerprint example 0 1 2 3 4 FP0= R ⊕ A ⊕ B ⊕ C FP1= FP0 ⊕ A ⊕ D FP2= FP0 ⊕ B ⊕ E FP3= FP0 ⊕ C ⊕ F Leaf has 3 Ack traffic: FP4-D= FP1 ⊕ D FP4-E= FP2 ⊕ E FP4-F = FP3 ⊕ F  Acker.status = R ⊕ (FP0 ⊕ A ⊕ D) ⊕ D ⊕ (FP0 ⊕ B ⊕ E) ⊕ E ⊕ (FP0 ⊕ C ⊕ F) ⊕ F = Acker srcNodeID: RootMsgID, R A, FP0 C, FP0 B, FP0 D, FP1 E, FP2 F, FP3 FP4-D FP4-E FP4-F Init: R Calculate FP 0 R May batch
  20. 20. 20© Copyright 2016 EMC Corporation. All rights reserved. Approach-1: failure example 0 1 2 3 4 Acker srcNodeID : RootMsgID, R A, FP0 C, FP0 B, FP0 D, FP1 E, FP2 F, FP3 FP4-D FP4-E FP4-F Init = R if msg D failed, then node4 only Ack FP4-E and FP4-F, finally Acker.status = = R ⊕ FP4-E ⊕ FP4-F = R ⊕ FP2 ⊕ E ⊕ FP3 ⊕ F = R ⊕ (FP ⊕ B ⊕ E ⊕ E) ⊕ (FP ⊕ C ⊕ F ⊕ F) = R ⊕ B ⊕ C != 0 Another example, if all message failed, Ack is R !=0 R  Missing info about A/D path, due to failure!!
  21. 21. 21© Copyright 2016 EMC Corporation. All rights reserved. Approach-1: a complex example 1 2 3 4 5 6 7 8R A B C D E F G H I X Initial : R FP1= R ⊕ A ⊕ B ⊕ C FP2= FP1 ⊕ A ⊕ D FP3= FP1 ⊕ B ⊕ X FP4= FP1 ⊕ C ⊕ E //update FP5 to Acker since even number of downstreams (2) FP5= FP2 ⊕ D ⊕ FP3 ⊕ X ⊕ FP4 ⊕ E ⊕ (F ⊕ G) FP6= FP5 ⊕ F ⊕ H FP7= FP5 ⊕ G ⊕ I // blot8 sends FP8 to Acker FP8= FP6 ⊕ H ⊕ FP7 ⊕ I Final Status = R ⊕ FP5 ⊕ FP8 = R ⊕ FP5 ⊕ (FP5 ⊕ F) ⊕ (FP5 ⊕ G) = R ⊕ FP5 ⊕ (F ⊕ G) = R ⊕ FP2 ⊕ D ⊕ FP3 ⊕ X ⊕ FP4 ⊕ E = R ⊕ (FP1 ⊕ A ⊕ B ⊕ C ) = 0 Acker FP5 FP8 Limit and note: 1) downstream msg shall be odd number (1,3, 5, …); otherwise, bolt must send the new FP to Acker, where Acker would run XOR with the new FP; 2) To implement such approach, ideally bolt needs to know the total downstream number to generate FP before emit.
  22. 22. 22© Copyright 2016 EMC Corporation. All rights reserved. • For input rootMsg, INIT a BIG SHARE (8B), EMBED as metadata, pass-down • SPLIT attached share by Storm at each bolt, EMBED, repeat this until leaf ... • Only leaf ACK to Acker about received share at hand • Acker REDO: decrease the reported share, finally 0 means ok; or-else failure – No random(no collision), no XOR; inline embedded; split is transparent to App – +/- (mod): follow commutative & associative law, resolve out-of-order issue Approach-2: share split 0 1 2 3 4 5 6 7 8 9 Acker srcNodeID: rootMsgID,BIG-Share A B, 50 C, 50 D, 25 E, 25 F, 17 G, 17 H, 16 I, 25 J, 25 K, 17 L, 17 A,1, 100 A, 0, 16 A, 0, 84M, 16 Like: IPO/stock share, split, increase share
  23. 23. 23© Copyright 2016 EMC Corporation. All rights reserved. • Rare case: INCREASE share if insufficient to split (also syncup the Acker) • Acker then ADD the newly increased share (NOT decrease) Approach-2: share split (con’t) 0 1 2 3 4 5 6 7 8 9 Acker srcNodeID, RootMsgID,Share A B, 99 C, 1 F, 33 G, 33 H, 34 A, 100 A, +99 increase share; Sync-up Acker If S - S1 - S2 - … = Sn, then S - S1 - S2 - … - Sn = AckerDAG 0 (Ack may batch)
  24. 24. 24© Copyright 2016 EMC Corporation. All rights reserved. • Implemented Approach-2 (share-split) • Integrate with Storm 1.0.1 (Released in May 2016) – Storm core (~200 LOC in Clojure: LISP-like) and Java APIs (~200 LOC including some traces/tests) • Implementation notes: – Support BasicBolt, remove randomNum, re-use some existing structures/APIs i.e., Anchors-to-ids (RootID:shareAttached), Ack sending – Global pre-defined split share at all bolts (equally split) • Next, configurable split approach per bolt – To exactly split share, build 1-step delay emit • Pre-split the input share • Once new tuple generated, emit internally queue it until next tuple come out • Finally explicitly call emitDone(), thus last tuple takes over all left share and emit Prototyping
  25. 25. 25© Copyright 2016 EMC Corporation. All rights reserved. • Function & performance – network traffic, CPU, latency/throughput • Reference IBM whitepaper (Storm vs. IBM InfoSphere): 7 layers – We use Wikipedia as data source; words processing Benchmark 1000 Mbps Ubuntu 15.10 (4.2.0) Storm 1.0.1 Ubuntu 15.10 (4.2.0) Storm 1.0.1 E5-2643 @ 3.40GHz, 24 cores; 256GB DRAM E5-2643 @ 3.40GHz, 24 cores; 256GB DRAM Ubuntu 15.10 (4.2.0) Storm 1.0.1 E5-2643 @ 3.40GHz, 24 cores; 256GB DRAM
  26. 26. 26© Copyright 2016 EMC Corporation. All rights reserved. • Function: Inject error and validate reliability detection: Pass – Same-level reliability as existing approach • • Performance: same HW/SW config and processing logic – 16KB tuple, 100 pending, 48 parallelism per bolt – 4 workers & 12 Ackers per host Result: function & performance
  27. 27. 27© Copyright 2016 EMC Corporation. All rights reserved. • 1/3 Ack traffic, 18% faster, 9% less CPU Test1: 3 layers 3903 1301 Current New Ack traffic(Mil) 241 197 Current New End-end Latency(ms) 350% 320% Current New CPU (per Java worker)
  28. 28. 28© Copyright 2016 EMC Corporation. All rights reserved. • 1/5 Ack traffic, 23% faster, 14% less CPU Test2: 7 layers 2685 537 Current New Ack traffic(Mil) 197 151 Current New End-end latency(ms) 250% 215% Current New CPU (per Java worker)
  29. 29. 29© Copyright 2016 EMC Corporation. All rights reserved. • Larger topology? Quick test of 11 layers: – 1/9 traffic • Suppose the larger of topology, the more gains to achieve • Next – Refine multi-Acker – Implement “Increase Share” operation – Configurable split method per bolt • So Dev can specify desired split way rather than fixed/global • May integrate with Twitter Heron? Or apply to other areas? – i.e., function call graph? performance trace? (more…) MORE
  30. 30. 30© Copyright 2016 EMC Corporation. All rights reserved. End-end IoT landscape Continuous, scalable, Real-time processing
  31. 31. 31© Copyright 2016 EMC Corporation. All rights reserved. • Lambda architecture, fusion “historical ”+“new” data – Proposed by Nathan Marz (5y ago), batch + streaming – widely adopted in many Internet company Unified data processing
  32. 32. 32© Copyright 2016 EMC Corporation. All rights reserved. • 2 innovative & inspiring streaming reliability algorithms – Guaranteed with minimized mem footprint – More scalable, efficient & fast, and even beautiful • Demonstrate in Storm – 1/N Ack traffic, only needed at leaf nodes • N is topology depth. Usually a few leaf for aggregation, DB saving etc • meanwhile, 23% faster, 14% less CPU – Transparent to App except the last explicit emitDone() call • Applying to other interesting areas... – Distributed replication, tx, exact-state tracking, … SUMMARY
  33. 33. 33© Copyright 2016 EMC Corporation. All rights reserved. • Feedback or comments? talk with us! – Any flaw, constraints, or room to improve? – then discuss with Storm community; Codes can be shared if needed Junping.Zhao@emc.com ZhaoJP@gmail.com THANK YOU!

×