SlideShare a Scribd company logo
1 of 23
Stream Processing
DAVID OSTROVSKY | COUCHBASE
Why Streaming?
Streaming Data
Stream Processing
Stream
Processing
Engines
Complex Event
Processing
Engines
Types of Data Processing
Throughput / sec
Time frame
100s
1000s
100000s
daysec min hrms
Real-Time
Processing
(CEP, ESP)
Interactive
Query
DBMS
In-Memory
Computing
Batch
Processing
(MapReduce)
All Apache, all the Time
No Love for Microsoft?
Orleans
Processing Model
Operator
Events
OperatorOperator
Operator
Operator
Events
OperatorOperator
Operator
Collector
Batches
(Time Window)
Continuous Micro-Batching
Programming Model
Continuous Micro-Batch Micro-Batch Continuous Continuous*
* Has a batch abstraction on top of streaming
API and Expressiveness
public class PrinterBolt extends BaseBasicBolt
{
public void execute(Tuple tuple, ...) {
System.out.println(tuple);
}
}
topology.setBolt("print", new PrinterBolt())
.shuffleGrouping("twitter");
val ssc = new StreamingContext(conf, Seconds(1))
ssc.socketTextStream("localhost", 9999)
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.print()
Compositional Declarative
API and Expressiveness
Compositional Compositional Declarative Compositional Declarative
JVM, Python,
Ruby, JS, Perl
JVM JVM, Python JVM JVM, Python*
* Only for the DataSet API (batch)
Storm + Trident
Topology:
◦ Spouts
◦ Bolts
Stream Groupings:
◦ Shuffle
◦ Fields
◦ All
◦ …
Nimbus (Master)
◦ Workers
Spark Streaming
Resilient Distributed Datasets (RDD)
DStreams – sequences of RDDs
Samza
Uses Kafka for streaming
◦ Topics (streams)
◦ Partitioned across Brokers
◦ Producers
◦ Consumers
Uses YARN for resource management
◦ ResourceManager
◦ NodeManager
◦ ApplicationMaster
Flink
Dataflows
◦ Streams
◦ Source(s)
◦ Sink(s)
◦ Transformations (operators)
Orleans
Virtual Actor System in .NET
◦ Grains (operators)
◦ Silos (containers)
◦ Streams
Message Delivery Guarantees
At Most Once At Least Once Exactly Once
Source
Sockets
Twitter Streaming API
Any non-repeatable
Files
Simple Queues
Any forward-only
Kafka, RabbitMQ
Collections
Stateful
Sink
Data Stores
Sockets
Files
HDFS rolling sink
Highest Possible Guarantee
At least once Exactly once* Exactly once** At least once Exactly once*
* Doesn’t apply to side-effects
** Only at the batch level
Reliability and Fault Tolerance
ACK per tuple RDD checkpoints
Partition offset
checkpoints
Barrier
checkpoints
State Management
Manual
Dedicated state
providers
(memory,
external)
RDD with per-key
state
Local K/V store
+ changelog in
Kafka
Stored with
snapshots,
configurable
backends
Performance
Latency Low Medium Medium-High* Low Low**
Throughput Medium Medium High High High
* Depends on batching
** For streaming, not micro-batching
Extended Ecosystem
SAMOA (ML) Trident-ML
Spark SQL,
MLlib
GraphX
SAMOA (ML)
CEP
Gelly*
FlinkML*
Table API (SQL)*
* DataSet API (batch)
** Currently v0.0.4
Production and Maturity
Mature,
many users,
224 contributors
Relatively mature,
many users
957 contributors*
Newer,
built on mature
components,
fewer users,
57 contributors
New,
high momentum,
few users,
219 contributors
* Spark, not just spark streaming
** Contributor numbers as of 5/9/2016
Stream Processing Frameworks

More Related Content

What's hot

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent
 

What's hot (20)

Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center   Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Zookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance
Zookeeper vs Raft: Stateful distributed coordination with HA and Fault ToleranceZookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance
Zookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Real-Time Streaming Data on AWS
Real-Time Streaming Data on AWSReal-Time Streaming Data on AWS
Real-Time Streaming Data on AWS
 
Kafka and Machine Learning in Banking and Insurance Industry
Kafka and Machine Learning in Banking and Insurance IndustryKafka and Machine Learning in Banking and Insurance Industry
Kafka and Machine Learning in Banking and Insurance Industry
 
Apache flink
Apache flinkApache flink
Apache flink
 

Similar to Stream Processing Frameworks

Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 

Similar to Stream Processing Frameworks (20)

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice Way
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Capacity Planning for Linux Systems
Capacity Planning for Linux SystemsCapacity Planning for Linux Systems
Capacity Planning for Linux Systems
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 

Recently uploaded

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Stream Processing Frameworks

Editor's Notes

  1. Talk about sources and use-cases of streaming data: web/social, fraud detection, log and machine data, real-time aggregation, etc 6k+ tweets p/s 50k+ google searches p/s 120k+ youtube videos viewed p/s 200+ MILLION emails per second (mostly spam) Not all data has value. Value of data decays over time, sometimes very fast. Newer data often supersedes older. It can be enough to process data without processing, especially since it’s often impractical to store so much data.
  2. Stream processing is not a new concept. Complex event processing engines have been around for a long time (early 90s), although they mostly derive their origins from stock market related use-cases. The main differences between CEP and ESP engines are that CEP engines tend to focus more on higher level querying of multiple data streams, such as with SQL, whereas ESP engines have been more geared towards running (ordered) events through a processing operator graph. This isn’t a clear distinction, and it’s coming more and more blurred as things like Spark SQL and Flink CEP come into play.
  3. Newer frameworks include: Apache Apex , Apache Beam (formerly part of Google Dataflow), Kafka Streams Source: http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them. Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs. As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python. Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported. And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like. 
  4. Source: http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them. Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs. As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python. Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported. And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like. 
  5. Continuous model generally provides lower latency processing, better expressiveness, and easier state management. On the other hand, it has lower throughput and expensive fault tolerance due to per-event overhead, and is harder to load-balance. Micro-batching provides higher throughput and simpler load balancing, but has higher latency (depending on the batch interval) and makes it harder to maintain state due to the fact that state updates aren’t per-event.
  6. Compositional approach provides basic building blocks like sources or operators and they must be tied together in order to create expected topology. New components can be usually defined by implementing some kind of interfaces. Provides low level control over execution and parallelism. On the contrary, operators in declarative API are defined as higher order functions. It allows us to write functional code with abstract types and all its fancy stuff and the system creates and optimizes topology itself. Also declarative APIs usually provides more advanced operations like windowing or state management out of the box. Less control over precise execution parameters, but usually has support for advanced abstractions, like windowing (batching), etc.
  7. Topology – a directed acyclic graph (DAG) of operators, each operator can have multiple instances which execute in parallel Spout – a source of streaming data (tuples), can be reliable or unreliable, that is can re-send data from a specified point or not. Bolt – a custom operator that consumes 1 or more streams and potentially emits new streams Stream groupings Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks. There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGroupinginterface: Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible). Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to). Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
  8. A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
  9. Different colors == different machines YARN ResourceManager (RM) YARN NodeManager (NM) Samza ApplicationMaster (AM) The Samza client uses YARN to run a Samza job: YARN starts and supervises one or more SamzaContainers, and your processing code (using the StreamTask API) runs inside those containers. The input and output for the Samza StreamTasks come from Kafka brokers that are (usually) co-located on the same machines as the YARN NMs.
  10. At least once semantics message guarantee that every message will be processed (eventually), but some may be processed more than once due to various factors, such as timing, concurrency, or failures. At most once
  11. At least once semantics message guarantee that every message will be processed (eventually), but some may be processed more than once due to various factors, such as timing, concurrency, or failures. At most once
  12. Storm spouts keep a record of all in-flight tuples until every operator sends back an acknowledgement that it has processed the tuple successfully. The ACKs are handled by ACKer tasks. Each acker task holds a mapping from each spout tuple to an id and an ‘ack val’. The ack val is the XOR of all the spout and tuple ids anchored to the entire tuple tree derived from the source tuple, which have been emitted and/or acked. When the ack val becomes 0, that means every tuple id that was emitted has also been acked. If it doesn’t do so after a certain time, the spout tuple is replayed. Spark checkpointing is only relevant for stateful Dstreams. It persists each batch to HDFS (by default) every X seconds. Typically the checkpoint interval should be set to 5-10 times the sliding window interval. Samza uses Kafka’s partitioned, offset-based messaging system for fault tolerance. Each Samza job container has one or more stream tasks, which correspond to message partitions in the kafka topic. Each task periodically checkpoints the offset in each partition it’s processing and can then replay messages back from the last stored offset if needed. Flink splits streams into discrete segments, or snapshots, by injecting a barrier marker into streams at certain intervals. Each barrier carries the ID of the snapshot whose records are pushed in front of it. When an intermediate operator has received a barrier for a particular snapshot from ALL of its input streams, it emits a new barrier for that snapshot into all of its outgoing streams. Once a sink operator receives barrier N from all input streams, it acknowledges that snapshot N to the checkpoint coordinator. When all sinks do that, it’s considered completed. (Operators can align input streams, buffering some until all get to snapshot N.)
  13. Storm provides no built-in state mechanism, so it’s quite common to use an external state (aka. Database), particularly fast key-value stores. Trident adds a dedicated state operator, such as persistentAggregate, which can use one of several state providers, including MemoryState, which is replicated periodically, MemcachedState, and other custom providers, such as Kafka or Cassandra. Spark can attach state to keyed RDDs, which is then stored together with the checkpoints. Version 1.6 introduced a brand new mechanism, mapwithState, which has much higher performance than updateStateByKey. Samza uses a combination of local state (LevelDB) together with a compacted changelog stored as a kafka topic. The state locality improves performance, especially in memory, and the changelog can be used to restore the local state store on a new machine in the event of failure. Each task explicitly gets a reference to the state and uses it as a normal K/V store. Flink lets you register any instance field in an operator as a managed state by implementing an interface. It also has a built-in key/value API for tracking state. Local state is stored per-operator, while partitioned state is stored pet-key globally. Can use a MemoryStateBackend, which is replicated to the master, FsStateBackend which can write to file or HDFS, or RocksDBStateBackend
  14. Trident-ML currently supports : Linear classification (Perceptron, Passive-Aggressive, Winnow, AROW) Linear regression (Perceptron, Passive-Aggressive) Clustering (KMeans) Feature scaling (standardization, normalization) Text feature extraction Stream statistics (mean, variance) Pre-Trained Twitter sentiment classifier
  15. Storm is the de-factor standard streaming framework today. Interesting to see what Twitter’s Heron does if/when they opensource it. Spark is hugely popular and included in everything Hadoop related today. Samza is built on top of Kafka, which is a hugely popular and mature message queue. Flink is very promising, fixes a lot of pain points from older technologies like Storm, seems to have impressive performance.