SlideShare a Scribd company logo
Data Stream Processing with
Apache Flink
Fabian Hueske
@fhueske
Apache Flink Meetup Madrid, 25.02.2016
What is Apache Flink?
Apache Flink is an open source platform for
scalable stream and batch processing.
2
• The core of Flink is a distributed
streaming dataflow engine.
• Executes dataflows in
parallel on clusters
• Provides a reliable backend
for various workloads
• DataStream and DataSet
programming abstractions are
the foundation for user programs
and higher layers
What is Apache Flink?
3
Streaming topologies
Long batch pipelines
Machine Learning at scale
A stream processor with many faces
Graph Analysis
 resource utilization
 iterative algorithms
 Mutable state
 low-latency processing
History & Community of Flink
From incubation until now
4
5
Apr ‘14 Jun ‘15Dec ‘14
0.70.60.5 0.9 0.10
Nov ‘15
Top level
0.8
Mar ‘15
1.0!
Growing and Vibrant Community
Flink is one of the largest and most active Apache big data projects:
• more than 150 contributors
• more than 600 forks
• more than 1000 Github stars (since yesterday)
6
Flink Meetups around the Globe
7
Flink Meetups around the Globe
8
✔ 
Organizations at Flink Forward
9
The streaming era
Coming soon…
10
What is Stream Processing?
11
 Today, most data is continuously produced
• user activity logs, web logs, sensors, database
transactions, …
 The common approach to analyze such data so far
• Record data stream to stable storage (DBMS, HDFS, …)
• Periodically analyze data with batch processing engine
(DBMS, MapReduce, ...)
 Streaming processing engines analyze data
while it arrives
Why do Stream Processing?
 Decreases the overall latency to obtain results
• No need to persist data in stable storage
• No periodic batch analysis jobs
 Simplifies the data infrastructure
• Fewer moving parts to be maintained and coordinated
 Makes time dimension of data explicit
• Each event has a timestamp
• Data can be processed based on timestamps
12
What are the Requirements?
 Low latency
• Results in millisecond
 High throughput
• Millions of events per second
 Exactly-once consistency
• Correct results in case of failures
 Out-of-order events
• Process events based on their associated time
 Intuitive APIs
13
OS Stream Processors so far
 Either low latency or high throughput
 Exactly-once guarantees only with high latency
 Lacking time semantics
• Processing by wall clock time only
• Events are processed in arrival order, not in the order they were
created
 Shortcomings lead to complicated system designs
• Lambda architecture
14
Stream Processing with Flink
15
Stream Processing with Flink
 Low latency
• Pipelined processing engine
 High throughput
• Controllable checkpointing overhead
 Exactly-once guarantees
• Distributed snapshots
 Support for out-of-order streams
• Processing semantics based on event-time
 Programmability
• APIs similar to those known from the batch world
16
Flink in Streaming Architectures
17
Flink
Flink Flink
Elasticsearch, Hbase,
Cassandra, …
HDFS
Kafka
Analytics on static data
Data ingestion
and ETL
Analytics on data
in motion
The DataStream API
Concise and easy-to-grasp code
18
The DataStream API
19
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
The DataStream API
20
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
.keyBy("location")
.timeWindow(Time.minutes(15), Time.minutes(5))
.sum("numVehicles")
The DataStream API
21
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
.keyBy("location")
.timeWindow(Time.minutes(15), Time.minutes(5))
.sum("numVehicles")
.keyBy("location")
.mapWithState { (evt, state: Option[Model]) => {
val model = state.orElse(new Model())
(model.classify(evt), Some(model.update(evt)))
}}
Event-time processing
Consistent and sound results
22
Event-time Processing
 Most data streams consist of events
• log entries, sensor data, user actions, …
• Events have an associated timestamp
 Many analysis tasks are based on time
• “Average temperature every minute”
• “Count of processed parcels per hour”
• ...
 Events often arrive out-of-order at processor
• Distributed sources, network delays, non-synced clocks, …
 Stream processor must respect time of events for
consistent and sound results
• Most stream processors use wall clock time
23
Event Processing
24
Events occur on devices
Queue / Log
Events analyzed in a
stream processor
Stream Analysis
Events stored in a log
Event Processing
25
Event Processing
26
Event Processing
27
Event Processing
28
Out of order!!!
First burst of events
Second burst of events
Event Processing
29
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out-of-order streams (event time) windows,
arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events
Event-time Processing
 Event-time processing decouples job semantics
from processing speed
 Analyze events from static data store and
online stream using the same program
 Semantically sound and consistent results
 Details:
http://data-artisans.com/how-apache-flink-enables-new-
streaming-applications-part-1
30
Operational Features
Running Flink 24*7*52
31
Monitoring & Dashboard
 Many metrics exposed via REST interface
 Web dashboard
• Submit, stop, and cancel jobs
• Inspect running and completed jobs
• Analyze performance
• Check exceptions
• Inspect configuration
• …
32
Highly-available Cluster Setup
 Stream applications run for weeks, months, …
• Application must never fail!
• No single-point-of-failure component allowed
 Flink supports highly-available cluster setups
• Master failures are resolved using Apache Zookeeper
• Worker failures are resolved by master
 Stand-alone cluster setup
• Requires (manually started) stand-by masters and workers
 YARN cluster setup
• Masters and workers are automatically restarted
33
 A save point is a consistent snapshot of a job
• Includes source offsets and operator state
• Stop job
• Restart job from save point
 What can I use it for?
• Fix or update your job
• A/B testing
• Update Flink
• Migrate cluster
• …
 Details:
http://data-artisans.com/how-apache-flink-enables-new-
streaming-applications
Save Points
34
Performance: Summary
35
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With configurable throughput/latency tradeoff
Details:
http://data-artisans.com/high-throughput-low-latency-
and-exactly-once-stream-processing-with-apache-flink
Integration (picture not complete)
36
POSIX Java/Scala
Collections
POSIX
Post v1.0 Roadmap
What’s coming next?
37
Stream SQL and Table API
 Structured queries over data streams
• LINQ-style Table API
• Stream SQL
 Based on Apache Calcite
• SQL Parser and optimizer
 “Compute every hour the number of orders and
number ordered units for each product.”
38
SELECT STREAM
productId,
TUMBLE_END(rowtime, INTERVAL '1' HOUR) AS rowtime,
COUNT(*) AS cnt,
SUM(units) AS units
FROM
Orders
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR),
productId;
Complex Event Processing
 Identify complex patterns in event streams
• Correlations & sequences
 Many applications
• Network intrusion detection via access patterns
• Item tracking (parcels, devices, …)
• …
 CEP depends on low latency processing
• Most CEP system are not distributed
 CEP in Flink
• Easy-to-use API to define CEP patterns
• Integration with Table API for structured analytics
• Low-latency and high-throughput engine
39
Dynamic Job Parallelism
 Adjusting parallelism of tasks without (significantly)
interrupting the program
 Initial version based on save points
• Trigger save point
• Stop job
• Restart job with adjusted parallelism
 Later change parallelism while job is running
 Vision is automatic adaption based on throughput
40
Wrap up!
 Flink is a kick-ass stream processor…
• Low latency & high throughput
• Exactly-once consistency
• Event-time processing
• Support for out-of-order streams
• Intuitive API
 with lots of features in the pipeline…
 and a reliable batch processor as well!
41
I ♥ Squirrels, do you?
 More Information at
• http://flink.apache.org/
 Free Flink training at
• http://dataartisans.github.io/flink-training
 Sign up for user/dev mailing list
 Get involved and contribute
 Follow @ApacheFlink on Twitter
42
43

More Related Content

What's hot

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
mxmxm
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
Kai Wähner
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 ActsUnlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeDataWorks Summit
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Deep dive into flink interval join
Deep dive into flink interval joinDeep dive into flink interval join
Deep dive into flink interval join
yeomii
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
Flink Forward
 

What's hot (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache flink
Apache flinkApache flink
Apache flink
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 ActsUnlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Deep dive into flink interval join
Deep dive into flink interval joinDeep dive into flink interval join
Deep dive into flink interval join
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 

Similar to Data Stream Processing with Apache Flink

Stream Processing with Apache Flink
Stream Processing with Apache FlinkStream Processing with Apache Flink
Stream Processing with Apache Flink
C4Media
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
Jamie Grier
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
data Artisans Product Announcement
data Artisans Product Announcementdata Artisans Product Announcement
data Artisans Product Announcement
Flink Forward
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Zurich Flink Meetup
Zurich Flink MeetupZurich Flink Meetup
Zurich Flink Meetup
Konstantinos Kloudas
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
Kostas Tzoumas
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 

Similar to Data Stream Processing with Apache Flink (20)

Stream Processing with Apache Flink
Stream Processing with Apache FlinkStream Processing with Apache Flink
Stream Processing with Apache Flink
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
data Artisans Product Announcement
data Artisans Product Announcementdata Artisans Product Announcement
data Artisans Product Announcement
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Zurich Flink Meetup
Zurich Flink MeetupZurich Flink Meetup
Zurich Flink Meetup
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 

More from Fabian Hueske

Flink SQL in Action
Flink SQL in ActionFlink SQL in Action
Flink SQL in Action
Fabian Hueske
 
Flink's Journey from Academia to the ASF
Flink's Journey from Academia to the ASFFlink's Journey from Academia to the ASF
Flink's Journey from Academia to the ASF
Fabian Hueske
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkWhy and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
Stream Analytics with SQL on Apache Flink
 Stream Analytics with SQL on Apache Flink Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesApache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated Queries
Fabian Hueske
 
Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!
Fabian Hueske
 
Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015
Fabian Hueske
 

More from Fabian Hueske (13)

Flink SQL in Action
Flink SQL in ActionFlink SQL in Action
Flink SQL in Action
 
Flink's Journey from Academia to the ASF
Flink's Journey from Academia to the ASFFlink's Journey from Academia to the ASF
Flink's Journey from Academia to the ASF
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkWhy and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
 
Stream Analytics with SQL on Apache Flink
 Stream Analytics with SQL on Apache Flink Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesApache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated Queries
 
Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!
 
Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015
 

Recently uploaded

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 

Recently uploaded (20)

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 

Data Stream Processing with Apache Flink

  • 1. Data Stream Processing with Apache Flink Fabian Hueske @fhueske Apache Flink Meetup Madrid, 25.02.2016
  • 2. What is Apache Flink? Apache Flink is an open source platform for scalable stream and batch processing. 2 • The core of Flink is a distributed streaming dataflow engine. • Executes dataflows in parallel on clusters • Provides a reliable backend for various workloads • DataStream and DataSet programming abstractions are the foundation for user programs and higher layers
  • 3. What is Apache Flink? 3 Streaming topologies Long batch pipelines Machine Learning at scale A stream processor with many faces Graph Analysis  resource utilization  iterative algorithms  Mutable state  low-latency processing
  • 4. History & Community of Flink From incubation until now 4
  • 5. 5 Apr ‘14 Jun ‘15Dec ‘14 0.70.60.5 0.9 0.10 Nov ‘15 Top level 0.8 Mar ‘15 1.0!
  • 6. Growing and Vibrant Community Flink is one of the largest and most active Apache big data projects: • more than 150 contributors • more than 600 forks • more than 1000 Github stars (since yesterday) 6
  • 7. Flink Meetups around the Globe 7
  • 8. Flink Meetups around the Globe 8 ✔ 
  • 11. What is Stream Processing? 11  Today, most data is continuously produced • user activity logs, web logs, sensors, database transactions, …  The common approach to analyze such data so far • Record data stream to stable storage (DBMS, HDFS, …) • Periodically analyze data with batch processing engine (DBMS, MapReduce, ...)  Streaming processing engines analyze data while it arrives
  • 12. Why do Stream Processing?  Decreases the overall latency to obtain results • No need to persist data in stable storage • No periodic batch analysis jobs  Simplifies the data infrastructure • Fewer moving parts to be maintained and coordinated  Makes time dimension of data explicit • Each event has a timestamp • Data can be processed based on timestamps 12
  • 13. What are the Requirements?  Low latency • Results in millisecond  High throughput • Millions of events per second  Exactly-once consistency • Correct results in case of failures  Out-of-order events • Process events based on their associated time  Intuitive APIs 13
  • 14. OS Stream Processors so far  Either low latency or high throughput  Exactly-once guarantees only with high latency  Lacking time semantics • Processing by wall clock time only • Events are processed in arrival order, not in the order they were created  Shortcomings lead to complicated system designs • Lambda architecture 14
  • 16. Stream Processing with Flink  Low latency • Pipelined processing engine  High throughput • Controllable checkpointing overhead  Exactly-once guarantees • Distributed snapshots  Support for out-of-order streams • Processing semantics based on event-time  Programmability • APIs similar to those known from the batch world 16
  • 17. Flink in Streaming Architectures 17 Flink Flink Flink Elasticsearch, Hbase, Cassandra, … HDFS Kafka Analytics on static data Data ingestion and ETL Analytics on data in motion
  • 18. The DataStream API Concise and easy-to-grasp code 18
  • 19. The DataStream API 19 case class Event(location: Location, numVehicles: Long) val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) }
  • 20. The DataStream API 20 case class Event(location: Location, numVehicles: Long) val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.minutes(15), Time.minutes(5)) .sum("numVehicles")
  • 21. The DataStream API 21 case class Event(location: Location, numVehicles: Long) val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.minutes(15), Time.minutes(5)) .sum("numVehicles") .keyBy("location") .mapWithState { (evt, state: Option[Model]) => { val model = state.orElse(new Model()) (model.classify(evt), Some(model.update(evt))) }}
  • 23. Event-time Processing  Most data streams consist of events • log entries, sensor data, user actions, … • Events have an associated timestamp  Many analysis tasks are based on time • “Average temperature every minute” • “Count of processed parcels per hour” • ...  Events often arrive out-of-order at processor • Distributed sources, network delays, non-synced clocks, …  Stream processor must respect time of events for consistent and sound results • Most stream processors use wall clock time 23
  • 24. Event Processing 24 Events occur on devices Queue / Log Events analyzed in a stream processor Stream Analysis Events stored in a log
  • 28. Event Processing 28 Out of order!!! First burst of events Second burst of events
  • 29. Event Processing 29 Event time windows Arrival time windows Instant event-at-a-time Flink supports out-of-order streams (event time) windows, arrival time windows (and mixtures) plus low latency processing. First burst of events Second burst of events
  • 30. Event-time Processing  Event-time processing decouples job semantics from processing speed  Analyze events from static data store and online stream using the same program  Semantically sound and consistent results  Details: http://data-artisans.com/how-apache-flink-enables-new- streaming-applications-part-1 30
  • 32. Monitoring & Dashboard  Many metrics exposed via REST interface  Web dashboard • Submit, stop, and cancel jobs • Inspect running and completed jobs • Analyze performance • Check exceptions • Inspect configuration • … 32
  • 33. Highly-available Cluster Setup  Stream applications run for weeks, months, … • Application must never fail! • No single-point-of-failure component allowed  Flink supports highly-available cluster setups • Master failures are resolved using Apache Zookeeper • Worker failures are resolved by master  Stand-alone cluster setup • Requires (manually started) stand-by masters and workers  YARN cluster setup • Masters and workers are automatically restarted 33
  • 34.  A save point is a consistent snapshot of a job • Includes source offsets and operator state • Stop job • Restart job from save point  What can I use it for? • Fix or update your job • A/B testing • Update Flink • Migrate cluster • …  Details: http://data-artisans.com/how-apache-flink-enables-new- streaming-applications Save Points 34
  • 35. Performance: Summary 35 Continuous streaming Latency-bound buffering Distributed Snapshots High Throughput & Low Latency With configurable throughput/latency tradeoff Details: http://data-artisans.com/high-throughput-low-latency- and-exactly-once-stream-processing-with-apache-flink
  • 36. Integration (picture not complete) 36 POSIX Java/Scala Collections POSIX
  • 37. Post v1.0 Roadmap What’s coming next? 37
  • 38. Stream SQL and Table API  Structured queries over data streams • LINQ-style Table API • Stream SQL  Based on Apache Calcite • SQL Parser and optimizer  “Compute every hour the number of orders and number ordered units for each product.” 38 SELECT STREAM productId, TUMBLE_END(rowtime, INTERVAL '1' HOUR) AS rowtime, COUNT(*) AS cnt, SUM(units) AS units FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), productId;
  • 39. Complex Event Processing  Identify complex patterns in event streams • Correlations & sequences  Many applications • Network intrusion detection via access patterns • Item tracking (parcels, devices, …) • …  CEP depends on low latency processing • Most CEP system are not distributed  CEP in Flink • Easy-to-use API to define CEP patterns • Integration with Table API for structured analytics • Low-latency and high-throughput engine 39
  • 40. Dynamic Job Parallelism  Adjusting parallelism of tasks without (significantly) interrupting the program  Initial version based on save points • Trigger save point • Stop job • Restart job with adjusted parallelism  Later change parallelism while job is running  Vision is automatic adaption based on throughput 40
  • 41. Wrap up!  Flink is a kick-ass stream processor… • Low latency & high throughput • Exactly-once consistency • Event-time processing • Support for out-of-order streams • Intuitive API  with lots of features in the pipeline…  and a reliable batch processor as well! 41
  • 42. I ♥ Squirrels, do you?  More Information at • http://flink.apache.org/  Free Flink training at • http://dataartisans.github.io/flink-training  Sign up for user/dev mailing list  Get involved and contribute  Follow @ApacheFlink on Twitter 42
  • 43. 43

Editor's Notes

  1. Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  2. People previously made the case that high throughput and low latency are mutually exclusive