Comparison of various streaming technologies

© 2015 IBM Corporation0
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Comparison of various streaming analytics
technologies
Mario Briggs
Sachin Aggarwal
March 12, 2016

Agenda
 Streaming Analytics System Architecture
 Features needed in Streaming Analytic Applications (What and Why?)
 Apache Storm, Trident and Flink vis-à-vis Streaming features
 Spark Streaming
 Summing it all up
 How to benchmark Spark Streaming?

What is a Streaming Analytics Application?
Event
Stream
Collecting
/Processing
Event Result

Streaming System Architecture
 Continuous Operator Architecture – Static Scheduling
Source Sink
Filter, Transform,
Aggregate processing
Filter, Transform,
Aggregate processing

Task Scheduler Architecture
Task Set
Task Scheduler
Thread
+ Data
Task Executors
Thread
+ Data
Thread
+ Data

Streaming System Architecture’s – Pros & Cons
 Continuous Operator Architecture
 No Task scheduling overheads
 Task Scheduler Architecture
 Dynamic Data partitions can add more parallelism

Features Needed By Streaming Analytics Applications
 Fault Tolerance
 Message Processing Guarantees
 Back Pressure
 Stateful vs Stateless
 Built-in Primitives vs Roll your own
 Lambda Architecture
 Better Resource Utilization
 Tuple Level processing vs Micro-batch

Fault Tolerance
Typical Architecture
Master
Node
Cluster
Co-ordination
Worker
Worker
Worker
Worker
What happens when a worker
or node dies ? What if the
worker is a receiver ?
What happens master dies ?

Message Processing Guarantees
 At most once processing
 There could be messages lost / not processed. No message will be processed in duplicate.
 At least once processing
 Messages can be processed in duplicate.
 Exactly once processing
 No Messages lost. No duplicate processing of messages
 End-to-end message processing guarantees depends on processing guarantees of the 3 main elements
 Source: requires reliable sources.
 Processing: processing gaurantee of the streaming system.
 Sink: requires Sink with atomic-write capability

Back Pressure
 Processing conditions in a streaming application can change over time
 Unexpected slowdowns in the processing components
• For e.g. Data stores being written to or looked-up became slow
 Unexpected surge in rate/load of input data.
 Back pressure determines what happens when the above occurs
 Nothing special. Lose data or system becomes unstable and components start crashing.
 Make upstream components keep repeating same work?
 Or all components upstream of the slowed-down component, throttle themselves by some mechanism

Stateful vs Stateless, Built-in primitives vs. Roll your own
 Some of the most common logic you implement in your streaming application
Calculate and maintain aggregates over time, by a Key.
Join multiple input streams by a key. Lookup master data ‘table’
Rolling counts over time windows (by time or count). Trigger on thresholds breached condition.
 Stateful vs Stateless is linked to fault tolerance
 What happens when you maintain aggregates and the node holding that does down.
Do you have to replay from start or can continue from last saved / checkpointed.

Lambda Architecture
 Many systems today need to do both real time as well as historical/batch processing on the same data.
 Can you share ‘same’ implementation and logic across these two?
 If you can’t, then chances are you have 2 separate implementations of similar logic that is not in sync
 Will have different answers from realTime vs historical system

Better Resource Utilization & Tuple Level vs. Micro-batch
 Can your Streaming application share same clusters with other jobs via a common Resource Manager?
 Can you reserve resource at the required granular level.

Apache Storm – Lets understand the programming model

Apache Storm – Process Model
Topology : Network of spouts and Bolts

 True streaming i.e. message level.
 Stateless
 Cannot maintain state across message’s
 Executor failures, require replay from start to build state (fault tolerance of executor)
 No built-in primitives
 All user code (no built-in aggregates/joins/grouping)
 Message processing guarantee's
 at most once. Yes (no tracking & replay)
 at least once. Yes ( tracks every tuple and children and replays if necessary)
 exactly once. No
 Run on Shared clusters
 No. Scheduler component (Nimbus) doesn’t work with YARN/Mesos
 Hortonworks distro allows to run as Yarn application
Apache Storm - Features

 Fault Tolerance (Master)
 When Nimbus node dies, if a worker node dies then no reassignment of worker to other nodes happen.
 Existing workers continue to run and if failed, ‘Supervisor’ will restarted them
 No new jobs can be submitted
 Back Pressure
 No
 Unified Programming Model
 No
Apache Storm - Features

 Batching support for Apache Storm
 Stateful
 Can maintain state across message’s/batches
 Failure of executor, state recoverable from external stores rather than replay from start
Apache Trident

Apache Trident – What’s new in the program model?

 Built-in primitives – Yes
Aggregates/grouping/function/filters/joins
 All three levels
 Exactly once (atomic txn’s to external store, using a unique batchId & gauranteed ordering of
updates among batches)
 Run on Shared clusters, Back Pressure
 No (Same as Storm)
 Fault Tolerance (Master)
 Same as Storm
Apache Trident

Apache Storm and Trident drawbacks
Source of Storm’s problems – Multiple topology’s tasks, run in a Single JVM process

Apache Storm and Trident drawbacks
 One Storm ‘Worker’ (a Single JVM process) runs too many different components and multiple
different tasks, each requiring different resources.
 Each tuple needs to pass through 4 threads in a worker
 Single global queues and log files across all tasks in a worker.
 hard to debug and size correctly. Thus have to oversize a lot, not efficient.
• E.g. oversizing memory means when doing a stack dump will cause missed heartbeats.
 Nimbus scheduler does not support resource reservation and isolation at a worker level.
 Because of above, twitter runs each topology (stream application) on dedicated cluster.
http://dl.acm.org/citation.cfm?id=2742788

Heron Topology

 True streaming
 Data Exchange buffer control
 Stateful - Yes
 Light-weight async checkpointing mechanism (Barriers)
 Built-in primitives. Yes
 Transformation, aggregations, windows, joins, connect, split
 Checkpoint local variables.
 Tumbling windows, sliding windows.
 Window triggers.
 Time windows (Event time, ingestion time, processing time support)
 Train models, update & predict.
 Exactly once
 Sink exactly-once -> HDFS
Apache Flink

Apache Flink Process Model

 Fault Tolerance of Master
 Yes. (since 0.10.0. Only for YARN & standalone)
 Back Pressure
 Yes. (simple watermark for transfer buffers)
 Run on Shared clusters – Yes
 Programming Model
 Overlapping programming model (Batch does not have SQL support)
Apache Flink

Spark Streaming
 Micro batching
 Stateful
 Yes. (UpdateStateByKey , mapWithState functions)
 Built-in primitives – Yes
Aggregates/grouping/function/filters/joins
Sliding / Tumbling Windows
Train models & predict in a streaming app.
Event Time (slated for 2.0)
 Exactly once
 Run on Shared clusters
 Yes

Spark Streaming
 Back Pressure
 Yes
 Fault Tolerance of Master
 Yes. (Dstream state saved to checkpoint file, used to restart master)
Programming Model
 Integrate wholly with Spark (MLlib, SparkSQL/DataFrame, RDD)
 Overlapping model for Batch and Streaming version of application

Feature Storm Trident Flink Spark
Streaming
Stateful No Yes (with
External
Stores)
YES YES
Message Processing
Guarantees
No Exactly
Once
Exactly Once
(with external
Stores)
Exactly Once Exactly Once
Back Pressure No No YES YES
Built-in Primitives No Yes YES YES
Overlapping
Programming Model
No No YES YES
Work with Resource
Schedulers
No No YES YES
True Streaming YES Micro-batch YES Micro-batch
Summing it all up

How to benchmark Spark Streaming?

Setup Information
load-data
Pulls data from twitter and stores as fixed dataset
push-to-kafka
Reads fixed dataset and push to Kafka at specific rate
spark-benchmarks
Reads data from Kafka and execute benchmark code
flink-benchmarks –(WIP)
Reads data from Kafka and execute benchmark code

Listener Interface in Spark
• StreamingListener-A listener interface for receiving information about an ongoing
streaming computation.
• Functions we need to override:-
– onReceiverStarted: Called when processing of a job of a batch has completed
– onReceiverError: Called when a receiver has reported an error
– onReceiverStopped: Called when a receiver has been stopped
– onBatchSubmitted: Called when a batch of jobs has been submitted for processing
– onBatchStarted: Called when processing of a batch of jobs has started.
– onBatchCompleted: Called when processing of a batch of jobs has completed
– onOutputOperationStarted: Called when processing of a job of a batch has started
– onOutputOperationCompleted: Called when processing of a job of a batch has completed

Implementation Details
Flag to check first batch
 Record startTime
For each batch
 Update totalRecords + batchCompleted.batchInfo.numRecords
 Increment batchCount
If totalRecords >= recordLimit
 Record Endtime
 avgLatency = totalDelay / totalRecords
 recordThroughput = totalRecords / totalTime

Observation: Execution with various batch intervals
Batch interval (ms) Batch Count Total Records Total consumed time (s)
Average Latency/batch
interval
avg delay per batch Average records/second
1000 1,056 100,000,020 1,056 1,429 429 94688.7392386
900 1,175 100,000,020 1,058 1,362 462 94541.2998858
800 1,322 100,000,020 1,058 1,291 491 94541.2998858
700 1,487 100,000,020 1,041 1,184 484 96034.5609108
600 1,770 100,000,020 1,062 1,102 502 94159.4948532
500 1,681 100,000,020 1,042 1,104 604 95943.1594458
400 1,355 100,000,020 1,033 1,202 802 96774.6187322
300 979 100,000,020 1,062 1,712 1,412 94190.3586768
200 755 100,000,020 1,092 2,212 2,012 91577.2064378
100 718 100,000,020 1,076 2,397 2,297 92963.5183339

Observation: Execution with various input rate
500 60 1144 1457733645809 1457733645982 1457734424894 1457734424894 100017510 779.085 1165.648763 128378.1744
500 54 1056 1457734624836 1457734624971 1457735504549 1457735504549 108014596 879.713 1378.846423 122783.9034
500 45 1552 1457735724327 1457735724487 1457736730486 1457736730486 112535122 1006.159 1130.425913 111846.2609
500 36 2290 1457736944320 1457736944450 1457738209266 1457738209266 120010057 1264.946 1007.914431 94873.66022
500 24 3600 1457738405811 1457738405990 1457740206047 1457740206047 120005209 1800.236 962.950761 66660.82058
500 12 6517 1457740349824 1457740349984 1457743608456 1457743608456 120000935 3258.632 875.2863643 36825.55594
500 6 12956 1457743741277 1457743741433 1457750219374 1457750219374 120000857 6478.097 766.6043024 18524.09079

Back up

Comparison of various streaming technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Comparison of various streaming technologies

Similar to Comparison of various streaming technologies (20)

Recently uploaded

Recently uploaded (20)

Comparison of various streaming technologies

Editor's Notes