© 2015 IBM Corporation0
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Comparison of various streaming analytics
technologies
Mario Briggs
Sachin Aggarwal
March 12, 2016
© 2015 IBM Corporation1
Agenda
 Streaming Analytics System Architecture
 Features needed in Streaming Analytic Applications (What and Why?)
 Apache Storm, Trident and Flink vis-à-vis Streaming features
 Spark Streaming
 Summing it all up
 How to benchmark Spark Streaming?
© 2015 IBM Corporation2
What is a Streaming Analytics Application?
Event
Stream
Collecting
/Processing
Event Result
© 2015 IBM Corporation3
Streaming System Architecture
 Continuous Operator Architecture – Static Scheduling
Source Sink
Filter, Transform,
Aggregate processing
Filter, Transform,
Aggregate processing
© 2015 IBM Corporation4
Task Scheduler Architecture
Task Set
Task Scheduler
Thread
+ Data
Task Executors
Thread
+ Data
Thread
+ Data
© 2015 IBM Corporation5
Streaming System Architecture’s – Pros & Cons
 Continuous Operator Architecture
 No Task scheduling overheads
 Task Scheduler Architecture
 Dynamic Data partitions can add more parallelism
© 2015 IBM Corporation6
Features Needed By Streaming Analytics Applications
 Fault Tolerance
 Message Processing Guarantees
 Back Pressure
 Stateful vs Stateless
 Built-in Primitives vs Roll your own
 Lambda Architecture
 Better Resource Utilization
 Tuple Level processing vs Micro-batch
© 2015 IBM Corporation7
Fault Tolerance
Typical Architecture
Master
Node
Cluster
Co-ordination
Worker
Worker
Worker
Worker
What happens when a worker
or node dies ? What if the
worker is a receiver ?
What happens master dies ?
© 2015 IBM Corporation8
Message Processing Guarantees
 At most once processing
 There could be messages lost / not processed. No message will be processed in duplicate.
 At least once processing
 Messages can be processed in duplicate.
 Exactly once processing
 No Messages lost. No duplicate processing of messages
 End-to-end message processing guarantees depends on processing guarantees of the 3 main elements
 Source: requires reliable sources.
 Processing: processing gaurantee of the streaming system.
 Sink: requires Sink with atomic-write capability
© 2015 IBM Corporation9
Back Pressure
 Processing conditions in a streaming application can change over time
 Unexpected slowdowns in the processing components
• For e.g. Data stores being written to or looked-up became slow
 Unexpected surge in rate/load of input data.
 Back pressure determines what happens when the above occurs
 Nothing special. Lose data or system becomes unstable and components start crashing.
 Make upstream components keep repeating same work?
 Or all components upstream of the slowed-down component, throttle themselves by some mechanism
© 2015 IBM Corporation10
Stateful vs Stateless, Built-in primitives vs. Roll your own
 Some of the most common logic you implement in your streaming application
Calculate and maintain aggregates over time, by a Key.
Join multiple input streams by a key. Lookup master data ‘table’
Rolling counts over time windows (by time or count). Trigger on thresholds breached condition.
 Stateful vs Stateless is linked to fault tolerance
 What happens when you maintain aggregates and the node holding that does down.
Do you have to replay from start or can continue from last saved / checkpointed.
© 2015 IBM Corporation11
Lambda Architecture
 Many systems today need to do both real time as well as historical/batch processing on the same data.
 Can you share ‘same’ implementation and logic across these two?
 If you can’t, then chances are you have 2 separate implementations of similar logic that is not in sync
 Will have different answers from realTime vs historical system
© 2015 IBM Corporation12
Better Resource Utilization & Tuple Level vs. Micro-batch
 Can your Streaming application share same clusters with other jobs via a common Resource Manager?
 Can you reserve resource at the required granular level.
© 2015 IBM Corporation13
Apache Storm – Lets understand the programming model
© 2015 IBM Corporation14
Apache Storm – Process Model
Topology : Network of spouts and Bolts
© 2015 IBM Corporation15
 True streaming i.e. message level.
 Stateless
 Cannot maintain state across message’s
 Executor failures, require replay from start to build state (fault tolerance of executor)
 No built-in primitives
 All user code (no built-in aggregates/joins/grouping)
 Message processing guarantee's
 at most once. Yes (no tracking & replay)
 at least once. Yes ( tracks every tuple and children and replays if necessary)
 exactly once. No
 Run on Shared clusters
 No. Scheduler component (Nimbus) doesn’t work with YARN/Mesos
 Hortonworks distro allows to run as Yarn application
Apache Storm - Features
© 2015 IBM Corporation16
 Fault Tolerance (Master)
 When Nimbus node dies, if a worker node dies then no reassignment of worker to other nodes happen.
 Existing workers continue to run and if failed, ‘Supervisor’ will restarted them
 No new jobs can be submitted
 Back Pressure
 No
 Unified Programming Model
 No
Apache Storm - Features
© 2015 IBM Corporation17
 Batching support for Apache Storm
 Stateful
 Can maintain state across message’s/batches
 Failure of executor, state recoverable from external stores rather than replay from start
Apache Trident
© 2015 IBM Corporation18
Apache Trident – What’s new in the program model?
© 2015 IBM Corporation19
 Built-in primitives – Yes
Aggregates/grouping/function/filters/joins
 Message processing guarantee's
 All three levels
 Exactly once (atomic txn’s to external store, using a unique batchId & gauranteed ordering of
updates among batches)
 Run on Shared clusters, Back Pressure
 No (Same as Storm)
 Fault Tolerance (Master)
 Same as Storm
Apache Trident
© 2015 IBM Corporation20
Apache Storm and Trident drawbacks
Source of Storm’s problems – Multiple topology’s tasks, run in a Single JVM process
© 2015 IBM Corporation21
Apache Storm and Trident drawbacks
 One Storm ‘Worker’ (a Single JVM process) runs too many different components and multiple
different tasks, each requiring different resources.
 Each tuple needs to pass through 4 threads in a worker
 Single global queues and log files across all tasks in a worker.
 hard to debug and size correctly. Thus have to oversize a lot, not efficient.
• E.g. oversizing memory means when doing a stack dump will cause missed heartbeats.
 Nimbus scheduler does not support resource reservation and isolation at a worker level.
 Because of above, twitter runs each topology (stream application) on dedicated cluster.
http://dl.acm.org/citation.cfm?id=2742788
© 2015 IBM Corporation22
Heron Topology
© 2015 IBM Corporation23
 True streaming
 Data Exchange buffer control
 Stateful - Yes
 Light-weight async checkpointing mechanism (Barriers)
 Built-in primitives. Yes
 Transformation, aggregations, windows, joins, connect, split
 Checkpoint local variables.
 Tumbling windows, sliding windows.
 Window triggers.
 Time windows (Event time, ingestion time, processing time support)
 Train models, update & predict.
 Message processing guarantee's
 Exactly once
 Sink exactly-once -> HDFS
Apache Flink
© 2015 IBM Corporation24
Apache Flink Process Model
© 2015 IBM Corporation25
 Fault Tolerance of Master
 Yes. (since 0.10.0. Only for YARN & standalone)
 Back Pressure
 Yes. (simple watermark for transfer buffers)
 Run on Shared clusters – Yes
 Programming Model
 Overlapping programming model (Batch does not have SQL support)
Apache Flink
© 2015 IBM Corporation26
Spark Streaming
 Micro batching
 Stateful
 Yes. (UpdateStateByKey , mapWithState functions)
 Built-in primitives – Yes
Aggregates/grouping/function/filters/joins
Sliding / Tumbling Windows
Train models & predict in a streaming app.
Event Time (slated for 2.0)
 Message processing guarantee's
 Exactly once
 Run on Shared clusters
 Yes
© 2015 IBM Corporation27
Spark Streaming
 Back Pressure
 Yes
 Fault Tolerance of Master
 Yes. (Dstream state saved to checkpoint file, used to restart master)
Programming Model
 Integrate wholly with Spark (MLlib, SparkSQL/DataFrame, RDD)
 Overlapping model for Batch and Streaming version of application
© 2015 IBM Corporation28
Feature Storm Trident Flink Spark
Streaming
Stateful No Yes (with
External
Stores)
YES YES
Message Processing
Guarantees
No Exactly
Once
Exactly Once
(with external
Stores)
Exactly Once Exactly Once
Back Pressure No No YES YES
Built-in Primitives No Yes YES YES
Overlapping
Programming Model
No No YES YES
Work with Resource
Schedulers
No No YES YES
True Streaming YES Micro-batch YES Micro-batch
Summing it all up
© 2015 IBM Corporation29
How to benchmark Spark Streaming?
© 2015 IBM Corporation30
Setup Information
load-data
Pulls data from twitter and stores as fixed dataset
push-to-kafka
Reads fixed dataset and push to Kafka at specific rate
spark-benchmarks
Reads data from Kafka and execute benchmark code
flink-benchmarks –(WIP)
Reads data from Kafka and execute benchmark code
© 2015 IBM Corporation31
Listener Interface in Spark
• StreamingListener-A listener interface for receiving information about an ongoing
streaming computation.
• Functions we need to override:-
– onReceiverStarted: Called when processing of a job of a batch has completed
– onReceiverError: Called when a receiver has reported an error
– onReceiverStopped: Called when a receiver has been stopped
– onBatchSubmitted: Called when a batch of jobs has been submitted for processing
– onBatchStarted: Called when processing of a batch of jobs has started.
– onBatchCompleted: Called when processing of a batch of jobs has completed
– onOutputOperationStarted: Called when processing of a job of a batch has started
– onOutputOperationCompleted: Called when processing of a job of a batch has completed
© 2015 IBM Corporation32
Implementation Details
Flag to check first batch
 Record startTime
For each batch
 Update totalRecords + batchCompleted.batchInfo.numRecords
 Increment batchCount
If totalRecords >= recordLimit
 Record Endtime
 avgLatency = totalDelay / totalRecords
 recordThroughput = totalRecords / totalTime
© 2015 IBM Corporation33
Observation: Execution with various batch intervals
Batch interval (ms) Batch Count Total Records Total consumed time (s)
Average Latency/batch
interval
avg delay per batch Average records/second
1000 1,056 100,000,020 1,056 1,429 429 94688.7392386
900 1,175 100,000,020 1,058 1,362 462 94541.2998858
800 1,322 100,000,020 1,058 1,291 491 94541.2998858
700 1,487 100,000,020 1,041 1,184 484 96034.5609108
600 1,770 100,000,020 1,062 1,102 502 94159.4948532
500 1,681 100,000,020 1,042 1,104 604 95943.1594458
400 1,355 100,000,020 1,033 1,202 802 96774.6187322
300 979 100,000,020 1,062 1,712 1,412 94190.3586768
200 755 100,000,020 1,092 2,212 2,012 91577.2064378
100 718 100,000,020 1,076 2,397 2,297 92963.5183339
© 2015 IBM Corporation34
Observation: Execution with various input rate
500 60 1144 1457733645809 1457733645982 1457734424894 1457734424894 100017510 779.085 1165.648763 128378.1744
500 54 1056 1457734624836 1457734624971 1457735504549 1457735504549 108014596 879.713 1378.846423 122783.9034
500 45 1552 1457735724327 1457735724487 1457736730486 1457736730486 112535122 1006.159 1130.425913 111846.2609
500 36 2290 1457736944320 1457736944450 1457738209266 1457738209266 120010057 1264.946 1007.914431 94873.66022
500 24 3600 1457738405811 1457738405990 1457740206047 1457740206047 120005209 1800.236 962.950761 66660.82058
500 12 6517 1457740349824 1457740349984 1457743608456 1457743608456 120000935 3258.632 875.2863643 36825.55594
500 6 12956 1457743741277 1457743741433 1457750219374 1457750219374 120000857 6478.097 766.6043024 18524.09079
© 2015 IBM Corporation35
© 2015 IBM Corporation36
Back up
© 2015 IBM Corporation37

Comparison of various streaming technologies

  • 1.
    © 2015 IBMCorporation0 Power of data. Simplicity of design. Speed of innovation. IBM Spark Comparison of various streaming analytics technologies Mario Briggs Sachin Aggarwal March 12, 2016
  • 2.
    © 2015 IBMCorporation1 Agenda  Streaming Analytics System Architecture  Features needed in Streaming Analytic Applications (What and Why?)  Apache Storm, Trident and Flink vis-à-vis Streaming features  Spark Streaming  Summing it all up  How to benchmark Spark Streaming?
  • 3.
    © 2015 IBMCorporation2 What is a Streaming Analytics Application? Event Stream Collecting /Processing Event Result
  • 4.
    © 2015 IBMCorporation3 Streaming System Architecture  Continuous Operator Architecture – Static Scheduling Source Sink Filter, Transform, Aggregate processing Filter, Transform, Aggregate processing
  • 5.
    © 2015 IBMCorporation4 Task Scheduler Architecture Task Set Task Scheduler Thread + Data Task Executors Thread + Data Thread + Data
  • 6.
    © 2015 IBMCorporation5 Streaming System Architecture’s – Pros & Cons  Continuous Operator Architecture  No Task scheduling overheads  Task Scheduler Architecture  Dynamic Data partitions can add more parallelism
  • 7.
    © 2015 IBMCorporation6 Features Needed By Streaming Analytics Applications  Fault Tolerance  Message Processing Guarantees  Back Pressure  Stateful vs Stateless  Built-in Primitives vs Roll your own  Lambda Architecture  Better Resource Utilization  Tuple Level processing vs Micro-batch
  • 8.
    © 2015 IBMCorporation7 Fault Tolerance Typical Architecture Master Node Cluster Co-ordination Worker Worker Worker Worker What happens when a worker or node dies ? What if the worker is a receiver ? What happens master dies ?
  • 9.
    © 2015 IBMCorporation8 Message Processing Guarantees  At most once processing  There could be messages lost / not processed. No message will be processed in duplicate.  At least once processing  Messages can be processed in duplicate.  Exactly once processing  No Messages lost. No duplicate processing of messages  End-to-end message processing guarantees depends on processing guarantees of the 3 main elements  Source: requires reliable sources.  Processing: processing gaurantee of the streaming system.  Sink: requires Sink with atomic-write capability
  • 10.
    © 2015 IBMCorporation9 Back Pressure  Processing conditions in a streaming application can change over time  Unexpected slowdowns in the processing components • For e.g. Data stores being written to or looked-up became slow  Unexpected surge in rate/load of input data.  Back pressure determines what happens when the above occurs  Nothing special. Lose data or system becomes unstable and components start crashing.  Make upstream components keep repeating same work?  Or all components upstream of the slowed-down component, throttle themselves by some mechanism
  • 11.
    © 2015 IBMCorporation10 Stateful vs Stateless, Built-in primitives vs. Roll your own  Some of the most common logic you implement in your streaming application Calculate and maintain aggregates over time, by a Key. Join multiple input streams by a key. Lookup master data ‘table’ Rolling counts over time windows (by time or count). Trigger on thresholds breached condition.  Stateful vs Stateless is linked to fault tolerance  What happens when you maintain aggregates and the node holding that does down. Do you have to replay from start or can continue from last saved / checkpointed.
  • 12.
    © 2015 IBMCorporation11 Lambda Architecture  Many systems today need to do both real time as well as historical/batch processing on the same data.  Can you share ‘same’ implementation and logic across these two?  If you can’t, then chances are you have 2 separate implementations of similar logic that is not in sync  Will have different answers from realTime vs historical system
  • 13.
    © 2015 IBMCorporation12 Better Resource Utilization & Tuple Level vs. Micro-batch  Can your Streaming application share same clusters with other jobs via a common Resource Manager?  Can you reserve resource at the required granular level.
  • 14.
    © 2015 IBMCorporation13 Apache Storm – Lets understand the programming model
  • 15.
    © 2015 IBMCorporation14 Apache Storm – Process Model Topology : Network of spouts and Bolts
  • 16.
    © 2015 IBMCorporation15  True streaming i.e. message level.  Stateless  Cannot maintain state across message’s  Executor failures, require replay from start to build state (fault tolerance of executor)  No built-in primitives  All user code (no built-in aggregates/joins/grouping)  Message processing guarantee's  at most once. Yes (no tracking & replay)  at least once. Yes ( tracks every tuple and children and replays if necessary)  exactly once. No  Run on Shared clusters  No. Scheduler component (Nimbus) doesn’t work with YARN/Mesos  Hortonworks distro allows to run as Yarn application Apache Storm - Features
  • 17.
    © 2015 IBMCorporation16  Fault Tolerance (Master)  When Nimbus node dies, if a worker node dies then no reassignment of worker to other nodes happen.  Existing workers continue to run and if failed, ‘Supervisor’ will restarted them  No new jobs can be submitted  Back Pressure  No  Unified Programming Model  No Apache Storm - Features
  • 18.
    © 2015 IBMCorporation17  Batching support for Apache Storm  Stateful  Can maintain state across message’s/batches  Failure of executor, state recoverable from external stores rather than replay from start Apache Trident
  • 19.
    © 2015 IBMCorporation18 Apache Trident – What’s new in the program model?
  • 20.
    © 2015 IBMCorporation19  Built-in primitives – Yes Aggregates/grouping/function/filters/joins  Message processing guarantee's  All three levels  Exactly once (atomic txn’s to external store, using a unique batchId & gauranteed ordering of updates among batches)  Run on Shared clusters, Back Pressure  No (Same as Storm)  Fault Tolerance (Master)  Same as Storm Apache Trident
  • 21.
    © 2015 IBMCorporation20 Apache Storm and Trident drawbacks Source of Storm’s problems – Multiple topology’s tasks, run in a Single JVM process
  • 22.
    © 2015 IBMCorporation21 Apache Storm and Trident drawbacks  One Storm ‘Worker’ (a Single JVM process) runs too many different components and multiple different tasks, each requiring different resources.  Each tuple needs to pass through 4 threads in a worker  Single global queues and log files across all tasks in a worker.  hard to debug and size correctly. Thus have to oversize a lot, not efficient. • E.g. oversizing memory means when doing a stack dump will cause missed heartbeats.  Nimbus scheduler does not support resource reservation and isolation at a worker level.  Because of above, twitter runs each topology (stream application) on dedicated cluster. http://dl.acm.org/citation.cfm?id=2742788
  • 23.
    © 2015 IBMCorporation22 Heron Topology
  • 24.
    © 2015 IBMCorporation23  True streaming  Data Exchange buffer control  Stateful - Yes  Light-weight async checkpointing mechanism (Barriers)  Built-in primitives. Yes  Transformation, aggregations, windows, joins, connect, split  Checkpoint local variables.  Tumbling windows, sliding windows.  Window triggers.  Time windows (Event time, ingestion time, processing time support)  Train models, update & predict.  Message processing guarantee's  Exactly once  Sink exactly-once -> HDFS Apache Flink
  • 25.
    © 2015 IBMCorporation24 Apache Flink Process Model
  • 26.
    © 2015 IBMCorporation25  Fault Tolerance of Master  Yes. (since 0.10.0. Only for YARN & standalone)  Back Pressure  Yes. (simple watermark for transfer buffers)  Run on Shared clusters – Yes  Programming Model  Overlapping programming model (Batch does not have SQL support) Apache Flink
  • 27.
    © 2015 IBMCorporation26 Spark Streaming  Micro batching  Stateful  Yes. (UpdateStateByKey , mapWithState functions)  Built-in primitives – Yes Aggregates/grouping/function/filters/joins Sliding / Tumbling Windows Train models & predict in a streaming app. Event Time (slated for 2.0)  Message processing guarantee's  Exactly once  Run on Shared clusters  Yes
  • 28.
    © 2015 IBMCorporation27 Spark Streaming  Back Pressure  Yes  Fault Tolerance of Master  Yes. (Dstream state saved to checkpoint file, used to restart master) Programming Model  Integrate wholly with Spark (MLlib, SparkSQL/DataFrame, RDD)  Overlapping model for Batch and Streaming version of application
  • 29.
    © 2015 IBMCorporation28 Feature Storm Trident Flink Spark Streaming Stateful No Yes (with External Stores) YES YES Message Processing Guarantees No Exactly Once Exactly Once (with external Stores) Exactly Once Exactly Once Back Pressure No No YES YES Built-in Primitives No Yes YES YES Overlapping Programming Model No No YES YES Work with Resource Schedulers No No YES YES True Streaming YES Micro-batch YES Micro-batch Summing it all up
  • 30.
    © 2015 IBMCorporation29 How to benchmark Spark Streaming?
  • 31.
    © 2015 IBMCorporation30 Setup Information load-data Pulls data from twitter and stores as fixed dataset push-to-kafka Reads fixed dataset and push to Kafka at specific rate spark-benchmarks Reads data from Kafka and execute benchmark code flink-benchmarks –(WIP) Reads data from Kafka and execute benchmark code
  • 32.
    © 2015 IBMCorporation31 Listener Interface in Spark • StreamingListener-A listener interface for receiving information about an ongoing streaming computation. • Functions we need to override:- – onReceiverStarted: Called when processing of a job of a batch has completed – onReceiverError: Called when a receiver has reported an error – onReceiverStopped: Called when a receiver has been stopped – onBatchSubmitted: Called when a batch of jobs has been submitted for processing – onBatchStarted: Called when processing of a batch of jobs has started. – onBatchCompleted: Called when processing of a batch of jobs has completed – onOutputOperationStarted: Called when processing of a job of a batch has started – onOutputOperationCompleted: Called when processing of a job of a batch has completed
  • 33.
    © 2015 IBMCorporation32 Implementation Details Flag to check first batch  Record startTime For each batch  Update totalRecords + batchCompleted.batchInfo.numRecords  Increment batchCount If totalRecords >= recordLimit  Record Endtime  avgLatency = totalDelay / totalRecords  recordThroughput = totalRecords / totalTime
  • 34.
    © 2015 IBMCorporation33 Observation: Execution with various batch intervals Batch interval (ms) Batch Count Total Records Total consumed time (s) Average Latency/batch interval avg delay per batch Average records/second 1000 1,056 100,000,020 1,056 1,429 429 94688.7392386 900 1,175 100,000,020 1,058 1,362 462 94541.2998858 800 1,322 100,000,020 1,058 1,291 491 94541.2998858 700 1,487 100,000,020 1,041 1,184 484 96034.5609108 600 1,770 100,000,020 1,062 1,102 502 94159.4948532 500 1,681 100,000,020 1,042 1,104 604 95943.1594458 400 1,355 100,000,020 1,033 1,202 802 96774.6187322 300 979 100,000,020 1,062 1,712 1,412 94190.3586768 200 755 100,000,020 1,092 2,212 2,012 91577.2064378 100 718 100,000,020 1,076 2,397 2,297 92963.5183339
  • 35.
    © 2015 IBMCorporation34 Observation: Execution with various input rate 500 60 1144 1457733645809 1457733645982 1457734424894 1457734424894 100017510 779.085 1165.648763 128378.1744 500 54 1056 1457734624836 1457734624971 1457735504549 1457735504549 108014596 879.713 1378.846423 122783.9034 500 45 1552 1457735724327 1457735724487 1457736730486 1457736730486 112535122 1006.159 1130.425913 111846.2609 500 36 2290 1457736944320 1457736944450 1457738209266 1457738209266 120010057 1264.946 1007.914431 94873.66022 500 24 3600 1457738405811 1457738405990 1457740206047 1457740206047 120005209 1800.236 962.950761 66660.82058 500 12 6517 1457740349824 1457740349984 1457743608456 1457743608456 120000935 3258.632 875.2863643 36825.55594 500 6 12956 1457743741277 1457743741433 1457750219374 1457750219374 120000857 6478.097 766.6043024 18524.09079
  • 36.
    © 2015 IBMCorporation35
  • 37.
    © 2015 IBMCorporation36 Back up
  • 38.
    © 2015 IBMCorporation37

Editor's Notes

  • #8 Add Dynamic Allocation.
  • #15 https://github.com/davidkiss/storm-twitter-word-count
  • #20 https://github.com/Blackmist/TwitterTrending
  • #25 https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html#controlling-latency https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html#configuring-taskmanager-processing-slots
  • #26 https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html#configuring-taskmanager-processing-slots
  • #27 Flink uses effectively distributed blocking queues with bounded capacity The output side never puts too much data on the wire by a simple watermark mechanism. If enough data is in-flight, we wait before we copy more data to the wire until it is below a threshold. This guarantees that there is never too much data in-flight. If new data is not consumed on the receiving side (because there is no buffer available), this slows down the sender. http://data-artisans.com/how-flink-handles-backpressure/
  • #37 35