Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Comparison of various streaming technologies

7,878 views

Published on

Comparison of various streaming technologies

This meetup will take us through the various streaming technologies such as Storm, Flink, Infosphere Streams and Spark Streaming.

Agenda

• Characteristics of streaming technologies

• Introduction to Apache Storm, Trident and Flink

• Examples of Code and API

• Deep-dive of Spark Streaming

• Comparison of Spark Streaming with other streaming technologies

• Benchmark of Spark Streaming (with code walkthrough)

We will supplement theory concepts with sufficient examples

Published in: Engineering
  • Be the first to comment

Comparison of various streaming technologies

  1. 1. © 2015 IBM Corporation0 Power of data. Simplicity of design. Speed of innovation. IBM Spark Comparison of various streaming analytics technologies Mario Briggs Sachin Aggarwal March 12, 2016
  2. 2. © 2015 IBM Corporation1 Agenda  Streaming Analytics System Architecture  Features needed in Streaming Analytic Applications (What and Why?)  Apache Storm, Trident and Flink vis-à-vis Streaming features  Spark Streaming  Summing it all up  How to benchmark Spark Streaming?
  3. 3. © 2015 IBM Corporation2 What is a Streaming Analytics Application? Event Stream Collecting /Processing Event Result
  4. 4. © 2015 IBM Corporation3 Streaming System Architecture  Continuous Operator Architecture – Static Scheduling Source Sink Filter, Transform, Aggregate processing Filter, Transform, Aggregate processing
  5. 5. © 2015 IBM Corporation4 Task Scheduler Architecture Task Set Task Scheduler Thread + Data Task Executors Thread + Data Thread + Data
  6. 6. © 2015 IBM Corporation5 Streaming System Architecture’s – Pros & Cons  Continuous Operator Architecture  No Task scheduling overheads  Task Scheduler Architecture  Dynamic Data partitions can add more parallelism
  7. 7. © 2015 IBM Corporation6 Features Needed By Streaming Analytics Applications  Fault Tolerance  Message Processing Guarantees  Back Pressure  Stateful vs Stateless  Built-in Primitives vs Roll your own  Lambda Architecture  Better Resource Utilization  Tuple Level processing vs Micro-batch
  8. 8. © 2015 IBM Corporation7 Fault Tolerance Typical Architecture Master Node Cluster Co-ordination Worker Worker Worker Worker What happens when a worker or node dies ? What if the worker is a receiver ? What happens master dies ?
  9. 9. © 2015 IBM Corporation8 Message Processing Guarantees  At most once processing  There could be messages lost / not processed. No message will be processed in duplicate.  At least once processing  Messages can be processed in duplicate.  Exactly once processing  No Messages lost. No duplicate processing of messages  End-to-end message processing guarantees depends on processing guarantees of the 3 main elements  Source: requires reliable sources.  Processing: processing gaurantee of the streaming system.  Sink: requires Sink with atomic-write capability
  10. 10. © 2015 IBM Corporation9 Back Pressure  Processing conditions in a streaming application can change over time  Unexpected slowdowns in the processing components • For e.g. Data stores being written to or looked-up became slow  Unexpected surge in rate/load of input data.  Back pressure determines what happens when the above occurs  Nothing special. Lose data or system becomes unstable and components start crashing.  Make upstream components keep repeating same work?  Or all components upstream of the slowed-down component, throttle themselves by some mechanism
  11. 11. © 2015 IBM Corporation10 Stateful vs Stateless, Built-in primitives vs. Roll your own  Some of the most common logic you implement in your streaming application Calculate and maintain aggregates over time, by a Key. Join multiple input streams by a key. Lookup master data ‘table’ Rolling counts over time windows (by time or count). Trigger on thresholds breached condition.  Stateful vs Stateless is linked to fault tolerance  What happens when you maintain aggregates and the node holding that does down. Do you have to replay from start or can continue from last saved / checkpointed.
  12. 12. © 2015 IBM Corporation11 Lambda Architecture  Many systems today need to do both real time as well as historical/batch processing on the same data.  Can you share ‘same’ implementation and logic across these two?  If you can’t, then chances are you have 2 separate implementations of similar logic that is not in sync  Will have different answers from realTime vs historical system
  13. 13. © 2015 IBM Corporation12 Better Resource Utilization & Tuple Level vs. Micro-batch  Can your Streaming application share same clusters with other jobs via a common Resource Manager?  Can you reserve resource at the required granular level.
  14. 14. © 2015 IBM Corporation13 Apache Storm – Lets understand the programming model
  15. 15. © 2015 IBM Corporation14 Apache Storm – Process Model Topology : Network of spouts and Bolts
  16. 16. © 2015 IBM Corporation15  True streaming i.e. message level.  Stateless  Cannot maintain state across message’s  Executor failures, require replay from start to build state (fault tolerance of executor)  No built-in primitives  All user code (no built-in aggregates/joins/grouping)  Message processing guarantee's  at most once. Yes (no tracking & replay)  at least once. Yes ( tracks every tuple and children and replays if necessary)  exactly once. No  Run on Shared clusters  No. Scheduler component (Nimbus) doesn’t work with YARN/Mesos  Hortonworks distro allows to run as Yarn application Apache Storm - Features
  17. 17. © 2015 IBM Corporation16  Fault Tolerance (Master)  When Nimbus node dies, if a worker node dies then no reassignment of worker to other nodes happen.  Existing workers continue to run and if failed, ‘Supervisor’ will restarted them  No new jobs can be submitted  Back Pressure  No  Unified Programming Model  No Apache Storm - Features
  18. 18. © 2015 IBM Corporation17  Batching support for Apache Storm  Stateful  Can maintain state across message’s/batches  Failure of executor, state recoverable from external stores rather than replay from start Apache Trident
  19. 19. © 2015 IBM Corporation18 Apache Trident – What’s new in the program model?
  20. 20. © 2015 IBM Corporation19  Built-in primitives – Yes Aggregates/grouping/function/filters/joins  Message processing guarantee's  All three levels  Exactly once (atomic txn’s to external store, using a unique batchId & gauranteed ordering of updates among batches)  Run on Shared clusters, Back Pressure  No (Same as Storm)  Fault Tolerance (Master)  Same as Storm Apache Trident
  21. 21. © 2015 IBM Corporation20 Apache Storm and Trident drawbacks Source of Storm’s problems – Multiple topology’s tasks, run in a Single JVM process
  22. 22. © 2015 IBM Corporation21 Apache Storm and Trident drawbacks  One Storm ‘Worker’ (a Single JVM process) runs too many different components and multiple different tasks, each requiring different resources.  Each tuple needs to pass through 4 threads in a worker  Single global queues and log files across all tasks in a worker.  hard to debug and size correctly. Thus have to oversize a lot, not efficient. • E.g. oversizing memory means when doing a stack dump will cause missed heartbeats.  Nimbus scheduler does not support resource reservation and isolation at a worker level.  Because of above, twitter runs each topology (stream application) on dedicated cluster. http://dl.acm.org/citation.cfm?id=2742788
  23. 23. © 2015 IBM Corporation22 Heron Topology
  24. 24. © 2015 IBM Corporation23  True streaming  Data Exchange buffer control  Stateful - Yes  Light-weight async checkpointing mechanism (Barriers)  Built-in primitives. Yes  Transformation, aggregations, windows, joins, connect, split  Checkpoint local variables.  Tumbling windows, sliding windows.  Window triggers.  Time windows (Event time, ingestion time, processing time support)  Train models, update & predict.  Message processing guarantee's  Exactly once  Sink exactly-once -> HDFS Apache Flink
  25. 25. © 2015 IBM Corporation24 Apache Flink Process Model
  26. 26. © 2015 IBM Corporation25  Fault Tolerance of Master  Yes. (since 0.10.0. Only for YARN & standalone)  Back Pressure  Yes. (simple watermark for transfer buffers)  Run on Shared clusters – Yes  Programming Model  Overlapping programming model (Batch does not have SQL support) Apache Flink
  27. 27. © 2015 IBM Corporation26 Spark Streaming  Micro batching  Stateful  Yes. (UpdateStateByKey , mapWithState functions)  Built-in primitives – Yes Aggregates/grouping/function/filters/joins Sliding / Tumbling Windows Train models & predict in a streaming app. Event Time (slated for 2.0)  Message processing guarantee's  Exactly once  Run on Shared clusters  Yes
  28. 28. © 2015 IBM Corporation27 Spark Streaming  Back Pressure  Yes  Fault Tolerance of Master  Yes. (Dstream state saved to checkpoint file, used to restart master) Programming Model  Integrate wholly with Spark (MLlib, SparkSQL/DataFrame, RDD)  Overlapping model for Batch and Streaming version of application
  29. 29. © 2015 IBM Corporation28 Feature Storm Trident Flink Spark Streaming Stateful No Yes (with External Stores) YES YES Message Processing Guarantees No Exactly Once Exactly Once (with external Stores) Exactly Once Exactly Once Back Pressure No No YES YES Built-in Primitives No Yes YES YES Overlapping Programming Model No No YES YES Work with Resource Schedulers No No YES YES True Streaming YES Micro-batch YES Micro-batch Summing it all up
  30. 30. © 2015 IBM Corporation29 How to benchmark Spark Streaming?
  31. 31. © 2015 IBM Corporation30 Setup Information load-data Pulls data from twitter and stores as fixed dataset push-to-kafka Reads fixed dataset and push to Kafka at specific rate spark-benchmarks Reads data from Kafka and execute benchmark code flink-benchmarks –(WIP) Reads data from Kafka and execute benchmark code
  32. 32. © 2015 IBM Corporation31 Listener Interface in Spark • StreamingListener-A listener interface for receiving information about an ongoing streaming computation. • Functions we need to override:- – onReceiverStarted: Called when processing of a job of a batch has completed – onReceiverError: Called when a receiver has reported an error – onReceiverStopped: Called when a receiver has been stopped – onBatchSubmitted: Called when a batch of jobs has been submitted for processing – onBatchStarted: Called when processing of a batch of jobs has started. – onBatchCompleted: Called when processing of a batch of jobs has completed – onOutputOperationStarted: Called when processing of a job of a batch has started – onOutputOperationCompleted: Called when processing of a job of a batch has completed
  33. 33. © 2015 IBM Corporation32 Implementation Details Flag to check first batch  Record startTime For each batch  Update totalRecords + batchCompleted.batchInfo.numRecords  Increment batchCount If totalRecords >= recordLimit  Record Endtime  avgLatency = totalDelay / totalRecords  recordThroughput = totalRecords / totalTime
  34. 34. © 2015 IBM Corporation33 Observation: Execution with various batch intervals Batch interval (ms) Batch Count Total Records Total consumed time (s) Average Latency/batch interval avg delay per batch Average records/second 1000 1,056 100,000,020 1,056 1,429 429 94688.7392386 900 1,175 100,000,020 1,058 1,362 462 94541.2998858 800 1,322 100,000,020 1,058 1,291 491 94541.2998858 700 1,487 100,000,020 1,041 1,184 484 96034.5609108 600 1,770 100,000,020 1,062 1,102 502 94159.4948532 500 1,681 100,000,020 1,042 1,104 604 95943.1594458 400 1,355 100,000,020 1,033 1,202 802 96774.6187322 300 979 100,000,020 1,062 1,712 1,412 94190.3586768 200 755 100,000,020 1,092 2,212 2,012 91577.2064378 100 718 100,000,020 1,076 2,397 2,297 92963.5183339
  35. 35. © 2015 IBM Corporation34 Observation: Execution with various input rate 500 60 1144 1457733645809 1457733645982 1457734424894 1457734424894 100017510 779.085 1165.648763 128378.1744 500 54 1056 1457734624836 1457734624971 1457735504549 1457735504549 108014596 879.713 1378.846423 122783.9034 500 45 1552 1457735724327 1457735724487 1457736730486 1457736730486 112535122 1006.159 1130.425913 111846.2609 500 36 2290 1457736944320 1457736944450 1457738209266 1457738209266 120010057 1264.946 1007.914431 94873.66022 500 24 3600 1457738405811 1457738405990 1457740206047 1457740206047 120005209 1800.236 962.950761 66660.82058 500 12 6517 1457740349824 1457740349984 1457743608456 1457743608456 120000935 3258.632 875.2863643 36825.55594 500 6 12956 1457743741277 1457743741433 1457750219374 1457750219374 120000857 6478.097 766.6043024 18524.09079
  36. 36. © 2015 IBM Corporation35
  37. 37. © 2015 IBM Corporation36 Back up
  38. 38. © 2015 IBM Corporation37

×