Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Stream Processing
DAVID OSTROVSKY | COUCHBASE
Why Streaming?
Streaming Data
Stream Processing
Stream
Processing
Engines
Complex Event
Processing
Engines
Types of Data Processing
Throughput / sec
Time frame
100s
1000s
100000s
daysec min hrms
Real-Time
Processing
(CEP, ESP)
In...
All Apache, all the Time
No Love for Microsoft?
Orleans
Processing Model
Operator
Events
OperatorOperator
Operator
Operator
Events
OperatorOperator
Operator
Collector
Batches
(Ti...
Programming Model
Continuous Micro-Batch Micro-Batch Continuous Continuous*
* Has a batch abstraction on top of streaming
API and Expressiveness
public class PrinterBolt extends BaseBasicBolt
{
public void execute(Tuple tuple, ...) {
System.out...
API and Expressiveness
Compositional Compositional Declarative Compositional Declarative
JVM, Python,
Ruby, JS, Perl
JVM J...
Storm + Trident
Topology:
◦ Spouts
◦ Bolts
Stream Groupings:
◦ Shuffle
◦ Fields
◦ All
◦ …
Nimbus (Master)
◦ Workers
Spark Streaming
Resilient Distributed Datasets (RDD)
DStreams – sequences of RDDs
Samza
Uses Kafka for streaming
◦ Topics (streams)
◦ Partitioned across Brokers
◦ Producers
◦ Consumers
Uses YARN for resou...
Flink
Dataflows
◦ Streams
◦ Source(s)
◦ Sink(s)
◦ Transformations (operators)
Orleans
Virtual Actor System in .NET
◦ Grains (operators)
◦ Silos (containers)
◦ Streams
Message Delivery Guarantees
At Most Once At Least Once Exactly Once
Source
Sockets
Twitter Streaming API
Any non-repeatabl...
Highest Possible Guarantee
At least once Exactly once* Exactly once** At least once Exactly once*
* Doesn’t apply to side-...
Reliability and Fault Tolerance
ACK per tuple RDD checkpoints
Partition offset
checkpoints
Barrier
checkpoints
State Management
Manual
Dedicated state
providers
(memory,
external)
RDD with per-key
state
Local K/V store
+ changelog in...
Performance
Latency Low Medium Medium-High* Low Low**
Throughput Medium Medium High High High
* Depends on batching
** For...
Extended Ecosystem
SAMOA (ML) Trident-ML
Spark SQL,
MLlib
GraphX
SAMOA (ML)
CEP
Gelly*
FlinkML*
Table API (SQL)*
* DataSet...
Production and Maturity
Mature,
many users,
224 contributors
Relatively mature,
many users
957 contributors*
Newer,
built ...
Stream Processing Frameworks
Upcoming SlideShare
Loading in …5
×

Stream Processing Frameworks

1,205 views

Published on

An overview of the most use stream processing frameworks in the industry today.

Published in: Software
  • Be the first to comment

Stream Processing Frameworks

  1. 1. Stream Processing DAVID OSTROVSKY | COUCHBASE
  2. 2. Why Streaming?
  3. 3. Streaming Data Stream Processing Stream Processing Engines Complex Event Processing Engines
  4. 4. Types of Data Processing Throughput / sec Time frame 100s 1000s 100000s daysec min hrms Real-Time Processing (CEP, ESP) Interactive Query DBMS In-Memory Computing Batch Processing (MapReduce)
  5. 5. All Apache, all the Time
  6. 6. No Love for Microsoft? Orleans
  7. 7. Processing Model Operator Events OperatorOperator Operator Operator Events OperatorOperator Operator Collector Batches (Time Window) Continuous Micro-Batching
  8. 8. Programming Model Continuous Micro-Batch Micro-Batch Continuous Continuous* * Has a batch abstraction on top of streaming
  9. 9. API and Expressiveness public class PrinterBolt extends BaseBasicBolt { public void execute(Tuple tuple, ...) { System.out.println(tuple); } } topology.setBolt("print", new PrinterBolt()) .shuffleGrouping("twitter"); val ssc = new StreamingContext(conf, Seconds(1)) ssc.socketTextStream("localhost", 9999) .flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .print() Compositional Declarative
  10. 10. API and Expressiveness Compositional Compositional Declarative Compositional Declarative JVM, Python, Ruby, JS, Perl JVM JVM, Python JVM JVM, Python* * Only for the DataSet API (batch)
  11. 11. Storm + Trident Topology: ◦ Spouts ◦ Bolts Stream Groupings: ◦ Shuffle ◦ Fields ◦ All ◦ … Nimbus (Master) ◦ Workers
  12. 12. Spark Streaming Resilient Distributed Datasets (RDD) DStreams – sequences of RDDs
  13. 13. Samza Uses Kafka for streaming ◦ Topics (streams) ◦ Partitioned across Brokers ◦ Producers ◦ Consumers Uses YARN for resource management ◦ ResourceManager ◦ NodeManager ◦ ApplicationMaster
  14. 14. Flink Dataflows ◦ Streams ◦ Source(s) ◦ Sink(s) ◦ Transformations (operators)
  15. 15. Orleans Virtual Actor System in .NET ◦ Grains (operators) ◦ Silos (containers) ◦ Streams
  16. 16. Message Delivery Guarantees At Most Once At Least Once Exactly Once Source Sockets Twitter Streaming API Any non-repeatable Files Simple Queues Any forward-only Kafka, RabbitMQ Collections Stateful Sink Data Stores Sockets Files HDFS rolling sink
  17. 17. Highest Possible Guarantee At least once Exactly once* Exactly once** At least once Exactly once* * Doesn’t apply to side-effects ** Only at the batch level
  18. 18. Reliability and Fault Tolerance ACK per tuple RDD checkpoints Partition offset checkpoints Barrier checkpoints
  19. 19. State Management Manual Dedicated state providers (memory, external) RDD with per-key state Local K/V store + changelog in Kafka Stored with snapshots, configurable backends
  20. 20. Performance Latency Low Medium Medium-High* Low Low** Throughput Medium Medium High High High * Depends on batching ** For streaming, not micro-batching
  21. 21. Extended Ecosystem SAMOA (ML) Trident-ML Spark SQL, MLlib GraphX SAMOA (ML) CEP Gelly* FlinkML* Table API (SQL)* * DataSet API (batch) ** Currently v0.0.4
  22. 22. Production and Maturity Mature, many users, 224 contributors Relatively mature, many users 957 contributors* Newer, built on mature components, fewer users, 57 contributors New, high momentum, few users, 219 contributors * Spark, not just spark streaming ** Contributor numbers as of 5/9/2016

×