SlideShare a Scribd company logo
1 of 23
Stream Processing
DAVID OSTROVSKY | COUCHBASE
Why Streaming?
Streaming Data
Stream Processing
Stream
Processing
Engines
Complex Event
Processing
Engines
Types of Data Processing
Throughput / sec
Time frame
100s
1000s
100000s
daysec min hrms
Real-Time
Processing
(CEP, ESP)
Interactive
Query
DBMS
In-Memory
Computing
Batch
Processing
(MapReduce)
All Apache, all the Time
No Love for Microsoft?
Orleans
Processing Model
Operator
Events
OperatorOperator
Operator
Operator
Events
OperatorOperator
Operator
Collector
Batches
(Time Window)
Continuous Micro-Batching
Programming Model
Continuous Micro-Batch Micro-Batch Continuous Continuous*
* Has a batch abstraction on top of streaming
API and Expressiveness
public class PrinterBolt extends BaseBasicBolt
{
public void execute(Tuple tuple, ...) {
System.out.println(tuple);
}
}
topology.setBolt("print", new PrinterBolt())
.shuffleGrouping("twitter");
val ssc = new StreamingContext(conf, Seconds(1))
ssc.socketTextStream("localhost", 9999)
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.print()
Compositional Declarative
API and Expressiveness
Compositional Compositional Declarative Compositional Declarative
JVM, Python,
Ruby, JS, Perl
JVM JVM, Python JVM JVM, Python*
* Only for the DataSet API (batch)
Storm + Trident
Topology:
◦ Spouts
◦ Bolts
Stream Groupings:
◦ Shuffle
◦ Fields
◦ All
◦ …
Nimbus (Master)
◦ Workers
Spark Streaming
Resilient Distributed Datasets (RDD)
DStreams – sequences of RDDs
Samza
Uses Kafka for streaming
◦ Topics (streams)
◦ Partitioned across Brokers
◦ Producers
◦ Consumers
Uses YARN for resource management
◦ ResourceManager
◦ NodeManager
◦ ApplicationMaster
Flink
Dataflows
◦ Streams
◦ Source(s)
◦ Sink(s)
◦ Transformations (operators)
Orleans
Virtual Actor System in .NET
◦ Grains (operators)
◦ Silos (containers)
◦ Streams
Message Delivery Guarantees
At Most Once At Least Once Exactly Once
Source
Sockets
Twitter Streaming API
Any non-repeatable
Files
Simple Queues
Any forward-only
Kafka, RabbitMQ
Collections
Stateful
Sink
Data Stores
Sockets
Files
HDFS rolling sink
Highest Possible Guarantee
At least once Exactly once* Exactly once** At least once Exactly once*
* Doesn’t apply to side-effects
** Only at the batch level
Reliability and Fault Tolerance
ACK per tuple RDD checkpoints
Partition offset
checkpoints
Barrier
checkpoints
State Management
Manual
Dedicated state
providers
(memory,
external)
RDD with per-key
state
Local K/V store
+ changelog in
Kafka
Stored with
snapshots,
configurable
backends
Performance
Latency Low Medium Medium-High* Low Low**
Throughput Medium Medium High High High
* Depends on batching
** For streaming, not micro-batching
Extended Ecosystem
SAMOA (ML) Trident-ML
Spark SQL,
MLlib
GraphX
SAMOA (ML)
CEP
Gelly*
FlinkML*
Table API (SQL)*
* DataSet API (batch)
** Currently v0.0.4
Production and Maturity
Mature,
many users,
224 contributors
Relatively mature,
many users
957 contributors*
Newer,
built on mature
components,
fewer users,
57 contributors
New,
high momentum,
few users,
219 contributors
* Spark, not just spark streaming
** Contributor numbers as of 5/9/2016
Stream Processing Frameworks

More Related Content

What's hot

Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMwareEvent Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
HostedbyConfluent
 
Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...
HostedbyConfluent
 
FOSDEM 2012 - OpenNebula Project
FOSDEM 2012 - OpenNebula ProjectFOSDEM 2012 - OpenNebula Project
FOSDEM 2012 - OpenNebula Project
OpenNebula Project
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 

What's hot (20)

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
The Real Cost of Slow Time vs Downtime
The Real Cost of Slow Time vs DowntimeThe Real Cost of Slow Time vs Downtime
The Real Cost of Slow Time vs Downtime
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMwareEvent Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
 
Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
FOSDEM 2012 - OpenNebula Project
FOSDEM 2012 - OpenNebula ProjectFOSDEM 2012 - OpenNebula Project
FOSDEM 2012 - OpenNebula Project
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache KafkaBuilding event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 

Similar to Stream Processing Frameworks

Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 

Similar to Stream Processing Frameworks (20)

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice Way
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Capacity Planning for Linux Systems
Capacity Planning for Linux SystemsCapacity Planning for Linux Systems
Capacity Planning for Linux Systems
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 

Recently uploaded

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 

Stream Processing Frameworks

Editor's Notes

  1. Talk about sources and use-cases of streaming data: web/social, fraud detection, log and machine data, real-time aggregation, etc 6k+ tweets p/s 50k+ google searches p/s 120k+ youtube videos viewed p/s 200+ MILLION emails per second (mostly spam) Not all data has value. Value of data decays over time, sometimes very fast. Newer data often supersedes older. It can be enough to process data without processing, especially since it’s often impractical to store so much data.
  2. Stream processing is not a new concept. Complex event processing engines have been around for a long time (early 90s), although they mostly derive their origins from stock market related use-cases. The main differences between CEP and ESP engines are that CEP engines tend to focus more on higher level querying of multiple data streams, such as with SQL, whereas ESP engines have been more geared towards running (ordered) events through a processing operator graph. This isn’t a clear distinction, and it’s coming more and more blurred as things like Spark SQL and Flink CEP come into play.
  3. Newer frameworks include: Apache Apex , Apache Beam (formerly part of Google Dataflow), Kafka Streams Source: http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them. Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs. As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python. Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported. And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like. 
  4. Source: http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them. Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs. As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python. Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported. And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like. 
  5. Continuous model generally provides lower latency processing, better expressiveness, and easier state management. On the other hand, it has lower throughput and expensive fault tolerance due to per-event overhead, and is harder to load-balance. Micro-batching provides higher throughput and simpler load balancing, but has higher latency (depending on the batch interval) and makes it harder to maintain state due to the fact that state updates aren’t per-event.
  6. Compositional approach provides basic building blocks like sources or operators and they must be tied together in order to create expected topology. New components can be usually defined by implementing some kind of interfaces. Provides low level control over execution and parallelism. On the contrary, operators in declarative API are defined as higher order functions. It allows us to write functional code with abstract types and all its fancy stuff and the system creates and optimizes topology itself. Also declarative APIs usually provides more advanced operations like windowing or state management out of the box. Less control over precise execution parameters, but usually has support for advanced abstractions, like windowing (batching), etc.
  7. Topology – a directed acyclic graph (DAG) of operators, each operator can have multiple instances which execute in parallel Spout – a source of streaming data (tuples), can be reliable or unreliable, that is can re-send data from a specified point or not. Bolt – a custom operator that consumes 1 or more streams and potentially emits new streams Stream groupings Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks. There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGroupinginterface: Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible). Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to). Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
  8. A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
  9. Different colors == different machines YARN ResourceManager (RM) YARN NodeManager (NM) Samza ApplicationMaster (AM) The Samza client uses YARN to run a Samza job: YARN starts and supervises one or more SamzaContainers, and your processing code (using the StreamTask API) runs inside those containers. The input and output for the Samza StreamTasks come from Kafka brokers that are (usually) co-located on the same machines as the YARN NMs.
  10. At least once semantics message guarantee that every message will be processed (eventually), but some may be processed more than once due to various factors, such as timing, concurrency, or failures. At most once
  11. At least once semantics message guarantee that every message will be processed (eventually), but some may be processed more than once due to various factors, such as timing, concurrency, or failures. At most once
  12. Storm spouts keep a record of all in-flight tuples until every operator sends back an acknowledgement that it has processed the tuple successfully. The ACKs are handled by ACKer tasks. Each acker task holds a mapping from each spout tuple to an id and an ‘ack val’. The ack val is the XOR of all the spout and tuple ids anchored to the entire tuple tree derived from the source tuple, which have been emitted and/or acked. When the ack val becomes 0, that means every tuple id that was emitted has also been acked. If it doesn’t do so after a certain time, the spout tuple is replayed. Spark checkpointing is only relevant for stateful Dstreams. It persists each batch to HDFS (by default) every X seconds. Typically the checkpoint interval should be set to 5-10 times the sliding window interval. Samza uses Kafka’s partitioned, offset-based messaging system for fault tolerance. Each Samza job container has one or more stream tasks, which correspond to message partitions in the kafka topic. Each task periodically checkpoints the offset in each partition it’s processing and can then replay messages back from the last stored offset if needed. Flink splits streams into discrete segments, or snapshots, by injecting a barrier marker into streams at certain intervals. Each barrier carries the ID of the snapshot whose records are pushed in front of it. When an intermediate operator has received a barrier for a particular snapshot from ALL of its input streams, it emits a new barrier for that snapshot into all of its outgoing streams. Once a sink operator receives barrier N from all input streams, it acknowledges that snapshot N to the checkpoint coordinator. When all sinks do that, it’s considered completed. (Operators can align input streams, buffering some until all get to snapshot N.)
  13. Storm provides no built-in state mechanism, so it’s quite common to use an external state (aka. Database), particularly fast key-value stores. Trident adds a dedicated state operator, such as persistentAggregate, which can use one of several state providers, including MemoryState, which is replicated periodically, MemcachedState, and other custom providers, such as Kafka or Cassandra. Spark can attach state to keyed RDDs, which is then stored together with the checkpoints. Version 1.6 introduced a brand new mechanism, mapwithState, which has much higher performance than updateStateByKey. Samza uses a combination of local state (LevelDB) together with a compacted changelog stored as a kafka topic. The state locality improves performance, especially in memory, and the changelog can be used to restore the local state store on a new machine in the event of failure. Each task explicitly gets a reference to the state and uses it as a normal K/V store. Flink lets you register any instance field in an operator as a managed state by implementing an interface. It also has a built-in key/value API for tracking state. Local state is stored per-operator, while partitioned state is stored pet-key globally. Can use a MemoryStateBackend, which is replicated to the master, FsStateBackend which can write to file or HDFS, or RocksDBStateBackend
  14. Trident-ML currently supports : Linear classification (Perceptron, Passive-Aggressive, Winnow, AROW) Linear regression (Perceptron, Passive-Aggressive) Clustering (KMeans) Feature scaling (standardization, normalization) Text feature extraction Stream statistics (mean, variance) Pre-Trained Twitter sentiment classifier
  15. Storm is the de-factor standard streaming framework today. Interesting to see what Twitter’s Heron does if/when they opensource it. Spark is hugely popular and included in everything Hadoop related today. Samza is built on top of Kafka, which is a hugely popular and mature message queue. Flink is very promising, fixes a lot of pain points from older technologies like Storm, seems to have impressive performance.