SlideShare a Scribd company logo
1 of 136
Flink Pure
Streaming
Paco Guerrero
Big Data & Solutions Architect 9/21/16
Not for
Geeks
Life as Time
4
Anything as Time
Flink as Time
Streaming vs Batch
7
“Abstraction of reality used to facilitate information processing”
Batch
Batch
Batch
Batch
All
Input
Batch
Batch Job
All
Input
Batch
Batch Job
All
Input
All
Output
Nothing about time
Timestamps used as trick to
keep real time fingerprint
Streaming
“Continuous processing of data that is continuously produced”
Streaming
“Streaming is the next programming paradigm for
data applications, and you need to start thinking in
terms of streams”
“Continuous processing of data that is continuously produced”
Streaming
“Streaming is the next programming paradigm for
data applications, and you need to start thinking in
terms of streams”
“Continuous processing of data that is continuously produced”
Data Stream: Infinite sequence of data arriving in a
continuous fashion.
Streaming
“Streaming is the next programming paradigm for
data applications, and you need to start thinking in
terms of streams”
“Continuous processing of data that is continuously produced”
Data Stream: Infinite sequence of data arriving in a
continuous fashion.
Stream processing is the backbone of the new data
infrastructure.
Streaming
“Streaming is the next programming paradigm for
data applications, and you need to start thinking in
terms of streams”
“Continuous processing of data that is continuously produced”
Data Stream: Infinite sequence of data arriving in a
continuous fashion.
Stream processing is the backbone of the new data
infrastructure.
“The world beyond batch”
A high-level tour of modern data-processing concepts. By Tyler Akidau
August 5, 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Streaming
Streaming
Streaming Job
Streaming
Streaming Job
Streaming
Streaming Job
Real Life Time !!
Streaming is the biggest change in
data infraestructure since Hadoop
Streaming
The biggest change is moving from
batch to streaming is handling time explicitly
Streaming
Micro Batch
Micro Batch
Micro Batch
Batch Job 1
Batch Job
Batch Job 2
All
Output
Batch Job 1
Micro Batch
Batch Job
Batch Job 2
All
Input
All
OutputBatch Job 3
All
Output
All
Output
Batch Job 1
Batch Frequency ?
Timestamps keeps real time
fingerprint
Micro Batch
Streaming Technologies
Batch StreamingMicro Batch
StateLess –
Record acknowledgements
CPU bounded performance
Not expressive declarative
functional API – Low Level API
Not auto scaling
Low level programmatic topology
Poor Streaming Windows
funcionalities
Not compatible with Hadoop APIs
Streams
Streaming Technologies
Batch StreamingMicro Batch
StateLess –
Record acknowledgements
CPU bounded performance
Not expressive declarative
functional API – Low Level API
Not auto scaling
Low level programmatic topology
Poor Streaming Windows
funcionalities
Not compatible with Hadoop APIs
Streams
Streaming Technologies
Batch StreamingMicro Batch
StateLess –
Record acknowledgements
CPU bounded performance
Not expressive declarative
functional API – Low Level API
Not auto scaling
Low level programmatic topology
Poor Streaming Windows
funcionalities
Not compatible with Hadoop APIs
Streams
Streaming Technologies
Batch StreamingMicro Batch
StateLess –
Record acknowledgements
CPU bounded performance
Not expressive declarative
functional API – Low Level API
Not auto scaling
Low level programmatic topology
Poor Streaming Windows
funcionalities
Not compatible with Hadoop APIs
Streams
Streaming Technologies
Streaming Technologies
Batch StreamingMicro Batch
Streaming Technologies
Batch StreamingMicro Batch
Streaming Technologies
Batch StreamingMicro Batch
Apache Flink
38
Flink
Open Source Stream Processing Framework.
Last available Release 1.1.1
Top Level Apache Project since Dec '14
Flink
Open Source Stream Processing Framework.
Last available Release 1.1.1
Top Level Apache Project since Dec '14
Main Features
Native Stream
Low Latency
High throughput
Stateful
Exactly-one guarantees
Distributed
Expressive APIs
And more ….
Flink
Open Source Stream Processing Framework.
Last available Release 1.1.1
Top Level Apache Project since Dec '14
Main Features
Native Stream
Low Latency
High throughput
Stateful
Exactly-one guarantees
Distributed
Expressive APIs
And more ….
Flink
Flink
Flink Integration
YARN upcoming..
.
Flink Integration
Flink Integration
YARN upcoming..
.
Flink Integration
YARN upcoming..
.
upcoming..
.
Flink Integration
YARN upcoming..
.
Flink Stack
Flink Runtime Engine
Flink Runtime
Engine
Distributed pipelined processing
Execute everything as Stream
Iterative ( cyclic ) dataflows
Mutable state in operations
Operate on managed memory (*)
Also works on batch !!
Job Manager
Client
Optimizer
Dataflow
Graph
Flink Runtime Engine
Distributed pipelined processing
Execute everything as Stream
Iterative ( cyclic ) dataflows
Mutable state in operations
Operate on managed memory (*)
Also works on batch !!
Workers ( Task Managers )
Job Manager
Client
Optimizer
Dataflow
Graph
Execution
Graph
Flink Runtime Engine
Stream Job
Batch Job
ML Job
Flink Runtime Engine
Graph Job
optimizer
optimizer
optimizer
optimizer
Stream Job
Batch Job
ML Job
Flink Runtime Engine
Graph Job
optimizer
optimizer
optimizer
optimizer
Tasks scheduled and executed in workers ( slots )
Tasks as chain of operators
Run operator logic in a pipelined fashion
Stream Job
Batch Job
ML Job
Flink Runtime Engine
Graph Job
optimizer
optimizer
optimizer
optimizer
Tasks scheduled and executed in workers ( slots )
Tasks as chain of operators
Run operator logic in a pipelined fashion
Stream Job
Batch Job
ML Job
Flink Runtime Engine
Graph Job
optimizer
optimizer
optimizer
optimizer
Tasks scheduled and executed in workers ( slots )
Tasks as chain of operators
Run operator logic in a pipelined fashion
Stream Job
Batch Job
ML Job
Flink Runtime Engine
Graph Job
optimizer
optimizer
optimizer
optimizer
If you want to know one thing about Flink
is that you don't need to know
the internals of Flink
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
lTime references
lOut of order events
lPowerful Windowing
Event Times & Windowing
Event Times & Windowing
Event
Time
Event
Time
Event Times & Windowing
Flink
Data Source
Event
Time
Event
Time
Ingestion
Time
Event Times & Windowing
Flink
Data Source
Flink
Window Operator
Event
Time
Event
Time
Ingestion
Time
Processing
Time
Event Time: when data is generated
Ingestion time: when data is loaded from source
Processing time: when data is processed
Event time help to process out- of-order events and replay elements as the ocurred (
deterministic results )
Explicit handling of time. 3
choices:
Event Times & Windowing
Event Times & Windowing
Event time. Out or Order
1 2 3 5 7
4 6 8 9 10
Event time. Out or Order
1 2 3 5 7
4 6 8 9 10
Event time. Out or Order
Out or Order
1 2 3 5 74 6 8 9 10
1 2 3 5 7
4 6 8 9 10 1 2 3 5 74 6 8 9 104
Event time. Out or Order
Ingestion Time WindowsOut or Order
1 2 3 5 74 6 8 9 10
1 2 3 5 7
4 6 8 9 10 1 2 3 5 74 6 8 9 10
1 2 3
4
4 5
Event time. Out or Order
6 7 8 9 10
Event Time Windows
Ingestion Time WindowsOut or Order
1 2 3 5 74 6 8 9 10
Event time. Watermarks
1 2 3 5 7
4 6 8 9 10
Event time. Watermarks
1 2 3 5 7
4 6 8 9 10
1 2 3 54 6 8
1 2 3 54 6 8
1 2 3
4
4 5
Event time. Watermarks
6 8
Event Time Windows
Ingestion Time WindowsOut or Order
1 2 3 5 7
4 6 8 9 10
1 2 3 54 6 8 910
1 2 3 54 6 8 910
1 2 3
4
4 5
Event time. Watermarks
6 8 9 10
Event Time Windows
Ingestion Time WindowsOut or Order
Not event time
before 5 will come
Late Time of 2
5
1 2 3 5 7
4 6 8 9 10
1 2 3 5 74 6 8 910
1 2 3 5 74 6 8 910
1 2 3
4
4 5
Event time. Watermarks
6 7 8 9 10
Event Time Windows
Ingestion Time WindowsOut or Order
Not event time
before 10 will
come
Late Time of 2
10
Windowing
Windows: grouping of events according to time, session*, count
Windowing
Windows: grouping of events according to time, session*, count
Powerful built-in windows:
Windowing
Windows: grouping of events according to time, session*, count
Powerful built-in windows:
Count: number of events to trigger the window. Process X last events each Y events.
Windowing
Windows: grouping of events according to time, session*, count
Powerful built-in windows:
Count: number of events to trigger the window. Process X last events each Y events.
Time:
lTumbling: trigger every X time with received events
lSliding: trigger every X time with received events in last Y time
Windowing
Windows: grouping of events according to time, session*, count
Powerful built-in windows:
Count: number of events to trigger the window. Process X last events each Y events.
Time:
lTumbling: trigger every X time with received events
lSliding: trigger every X time with received events in last Y time
Session: all events from session/user X until session time expired ( Gap )
Windowing
Windows: grouping of events according to time, session*, count
Powerful built-in windows:
Count: number of events to trigger the window. Process X last events each Y events.
Time:
lTumbling: trigger every X time with received events
lSliding: trigger every X time with received events in last Y time
Session: all events from session/user X until session time expired ( Gap )
High level API for user windows: Window Assigner, Trigger, Evictor
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
lManaged operator state for
backup/recovery
lSavepoints
Stateful Streaming
Op
Stateless Stream
Processing
Stateful Streaming
Op Op
State
Stateless Stream
Processing
Stateful Stream
Processing
lBuilt-in internal state in each operator for
exactly-once semantics
lUser state can be declared in each operator to be
saved locally in memory ( API, key/value pars )
lSnapshots: periodically local states
in memory are persisted in lightweight
distributed snapshots. No global pause !!
lCheckpoint as global consistent point-in-time
snapshot build by set of distributed snapshots.
lPluggable state backend for snapshots:
JobManager, HDFS, RocksDB
lSavepoints: user-triggered retained checkpoint
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
lExactly-once semantics with
managed operator state
lDistributed Snapshotting
Algorithm
Periodically
Chandy-Lamport Snapshots
“The global-state-detection algorithm is to be superimposed
on the underlying computation:
It must run concurrently with, but no alter, this underlying
computation”
. Triggers snapshots asynchronously
. Embedded snapshots algorithm in stream of data ( barriers )
. No global pause, lightweight impact in performance
Handling Checkpoints
Periodically
Chandy-Lamport Snapshots
“The global-state-detection algorithm is to be superimposed
on the underlying computation:
It must run concurrently with, but no alter, this underlying
computation”
. Triggers snapshots asynchronously
. Embedded snapshots algorithm in stream of data ( barriers )
. No global pause, lightweight impact in performance
Handling Checkpoints
Periodically
Chandy-Lamport Snapshots
“The global-state-detection algorithm is to be superimposed
on the underlying computation:
It must run concurrently with, but no alter, this underlying
computation”
. Triggers snapshots asynchronously
. Embedded snapshots algorithm in stream of data ( barriers )
. No global pause, lightweight impact in performance
Handling Checkpoints
snapshot
Job Manager
Periodically
pushes
barriers
for new state
New state X+1
Ack for Snapshot state X from Task N
Handling Checkpoints
snapshot
Job Manager
Handling Checkpoints
snapshot
Job Manager
Handling Checkpoints
snapshot
Job Manager
Handling Checkpoints
All Acks received
Register Checkpoint for restore
in case of fail
Streaming Fault Tolerance
In case of fail, last global checkpoint is recovered
( recovery from partial checkpoint / individual snapshots is coming )
Need of stateful source like kafka to ensure end-to-end exactly-once
semantic in case of fail.
Kafka sink doesn't guarantee end-to-end exactly-once ( multiple writes in
topic ) ( at least-once )
Semantics in Flink:
At Least Once: never loses events, events might be reprocessed
Exactly once: neither reprocessed nor lost events.
Exactly once by default, with low impact in performance
If you want to know one thing about Flink
is that you don't need to know
the internals of Flink
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
lPipelined runtime
lLatency vs throughput tunning
Exactly-once semantic with low impact in performance
Controllable checkpointing overhead
Higher throughput using processing time
Performance improvements thanks to:
. operator chaining during optimization phase
. own optimized serialization stack with code generation
Performance
Tunning
Benchmark for “Streaming Computation” published by Yahoo. Dec 18, 2015
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Production use-case
lcounting ad impressions group by
campaign
laggregations over a 10 second
window
lsave current aggregate value to Redis
every second
Streaming
Benchmark
Throughput vs Latency Graph
Throughput ( 1000 events / sec )
99 Percentile
Latency ( ms )
Not Operator combinig in Storm, more complicate topology, more steps for events and more overhead
Apache Storm Without Trident
lAt least once / Double counting after fail / Lost state after Failures
lCPU bounded
Apache Spark
lLatency increase with throughput
Apache Flink
lExactly once / No double counting / No state loss
lLimited by bandwidth between Kafka and Flink cluster
l(1 GigE).
lkafka brokers within Kafka Cluster ( 10 GigE )
lAchieved 15 million messages /sec
l( before 3 million m/sec) with exactly once semantic
10,000,000 20,000,000
1 GigE
10 GigE
Performance
Tunning
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
lHigh Level API
lWide range of basic and advanced
operators
lJava , Scala. Python soon !!
API
API
Working on data streams ( bounded ? )
API
Working on data streams ( bounded ? )
Stream Processing: Explicit Handling of Time
API
Working on data streams ( bounded ? )
Stream Processing: Explicit Handling of Time
Java & Scala. Python coming.
Java: Bean type classes vs Tuples with position addresses.
Scala: case classes.
API
Working on data streams ( bounded ? )
Stream Processing: Explicit Handling of Time
Java & Scala. Python coming.
Java: Bean type classes vs Tuples with position addresses.
Scala: case classes.
Operators:
Sources: kafka, FileSystem, Cassandra …
Sinks: Kafka, HDFS, Cassandra ….
Transformations:
Basic: map, flatmap, filter, grouping, iterate, project, join, cross, …
Streaming: Windowing + Aggregations, Temporal Binary
Iterative Stream operators
API
Working on data streams ( bounded ? )
Stream Processing: Explicit Handling of Time
Java & Scala. Python coming.
Java: Bean type classes vs Tuples with position addresses.
Scala: case classes.
Operators:
Sources: kafka, FileSystem, Cassandra …
Sinks: Kafka, HDFS, Cassandra ….
Transformations:
Basic: map, flatmap, filter, grouping, iterate, project, join, cross, …
Streaming: Windowing + Aggregations, Temporal Binary
Iterative Stream operators
DataStream<?> DataSet<?>
Core API
1 implementation*, 2 interfaces
Source Map Reduce
Fliter
Join Sum Sink
Map
Source
Operators
Source Map Reduce
Fliter
Join Sum Sink
Map
Source
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Filter
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Filter
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Reduce
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Reduce
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Join
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Join
Operators
Source Map Reduce
Fliter
Join Sum Sink
Source
Operators
Events Time
&
Windows
Fault Tolerance
&
Correctness
State Handling
Low Latency
&
High Throughput
API Libraries SQL
Building Blocks
lEasy to use. SQL !!
lBased on Apache Calcite
API extension for DataSets y DataStreams
Based on relational Table abstraction
Table <=> Source / DataSet / DataStream
Operators like: where, select, as, groupBy, join, union, minus, distinct, orderBy, ...
Table API
Execute SQL-Like sentences on DataSets and Datastreams
Resuts returned as Table ( Table API ), convertible to DataStream or DataSets
SQL and Table API can be seamlessly mixed over DataStream/DataSets
Flink’s SQL support is not feature complete, yet.
Queries that include unsupported SQL will fail !!
SQL Support
SQL
Parsing and Logical plan for Table operators and SQL are optimized using Apache Calcite
Only supported a Subset of the comprehensive SQL standard
Apache Calcite provides with:
SQL Parsing
API for building expressions in relational algebra
Query planning engine
Provides SQL for Streaming Queries with windows aggregations
SELECT STREAM TUMBLE_END(rowtime, INTERVAL '1' HOUR) AS rowtime, productId, COUNT(*) AS c, SUM(units) AS units
FROM Orders
Apache Calcite
SQL Sentence
Apache Calcite:
SQL to Logical
Plan as Relational Algebra
Flink Optimizer: Logical Plan to
Execution Plan
If you want to know
one thing about Flink
is that you don't need
to know the internals of Flink
So … Batch
Batch on Stream
Stream: Unbounded Data Stream
Unbounded
Data Stream
Batch on Stream
Stream: Unbounded Data Stream
Batch: Bounded stream ( dataset ) on a stream processor
Global window over the entire dataset
Optimization in operators for joins and grouping,
with blocking data exchange if needed
Unbounded
Data Stream
Bounded
Data Set
Batch on Stream
Stream: Unbounded Data Stream
Batch: Bounded stream ( dataset ) on a stream processor
Global window over the entire dataset
Optimization in operators for joins and grouping,
with blocking data exchange if needed
Unbounded
Data Stream
Bounded
Data Set
Batch on Stream
Stream: Unbounded Data Stream
Batch: Bounded stream ( dataset ) on a stream processor
Global window over the entire dataset
Optimization in operators for joins and grouping,
with blocking data exchange if needed
Batch specific optimizations:
Cost-based optimizer: dataset size known before hand
Manage memory on / off-heap for join, sort, …
Optimization serialization stack for user-types
Bounded
Data Set
Batch on Stream
Unbounded
Data Stream
Conclusions
Conclusions
Conclusions
Flink Pure streaming engine matches real life. No Abstraction
Conclusions
Flink Pure streaming engine matches real life. No Abstraction
Batch on streaming
Conclusions
Flink Pure streaming engine matches real life. No Abstraction
Batch on streaming
Flexible Windowing Semantics with Explicit Time handling
Conclusions
Flink Pure streaming engine matches real life. No Abstraction
Batch on streaming
Flexible Windowing Semantics with Explicit Time handling
Competitive Performance, low latency and hight throughput
Conclusions
Flink Pure streaming engine matches real life. No Abstraction
Batch on streaming
Flexible Windowing Semantics with Explicit Time handling
Competitive Performance, low latency and hight throughput
Apache Beam, open sourced by Google, uses Flink as its first order runner for
Batch and Streaming processing in partnership with Data Artisans.
100% Compliance of data processing model “what, where, when, how “
¡gracias
!
136

More Related Content

What's hot

Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 

What's hot (20)

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Big Data Warsaw
Big Data WarsawBig Data Warsaw
Big Data Warsaw
 
data Artisans Product Announcement
data Artisans Product Announcementdata Artisans Product Announcement
data Artisans Product Announcement
 
Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache Flink
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Jamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkJamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache Flink
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Stream Processing with Apache Flink
Stream Processing with Apache FlinkStream Processing with Apache Flink
Stream Processing with Apache Flink
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Flink sql for continuous sql etl apps & Apache NiFi devops
Flink sql for continuous sql etl apps & Apache NiFi devopsFlink sql for continuous sql etl apps & Apache NiFi devops
Flink sql for continuous sql etl apps & Apache NiFi devops
 
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 22018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
 
Robust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkRobust Stream Processing with Apache Flink
Robust Stream Processing with Apache Flink
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 

Viewers also liked

Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 

Viewers also liked (14)

Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile ActorsCloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analyticsGraph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analytics
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPop
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
 
ETL into Neo4j
ETL into Neo4jETL into Neo4j
ETL into Neo4j
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Converting Relational to Graph Databases
Converting Relational to Graph DatabasesConverting Relational to Graph Databases
Converting Relational to Graph Databases
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Semantic Complex Event Processing
Semantic Complex Event ProcessingSemantic Complex Event Processing
Semantic Complex Event Processing
 

Similar to Flink. Pure Streaming

Similar to Flink. Pure Streaming (20)

Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
 
About time
About timeAbout time
About time
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 ActsUnlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
 
Scaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per DayScaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per Day
 
Making Sense of Apache Flink: A Fearless Introduction
Making Sense of Apache Flink: A Fearless IntroductionMaking Sense of Apache Flink: A Fearless Introduction
Making Sense of Apache Flink: A Fearless Introduction
 
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
Building Advanced Serverless Workflows with AWS Step Functions | AWS Floor28
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Flink. Pure Streaming

  • 1. Flink Pure Streaming Paco Guerrero Big Data & Solutions Architect 9/21/16
  • 3.
  • 8. “Abstraction of reality used to facilitate information processing” Batch
  • 10. Batch
  • 13. Batch Batch Job All Input All Output Nothing about time Timestamps used as trick to keep real time fingerprint
  • 14. Streaming “Continuous processing of data that is continuously produced”
  • 15. Streaming “Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams” “Continuous processing of data that is continuously produced”
  • 16. Streaming “Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams” “Continuous processing of data that is continuously produced” Data Stream: Infinite sequence of data arriving in a continuous fashion.
  • 17. Streaming “Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams” “Continuous processing of data that is continuously produced” Data Stream: Infinite sequence of data arriving in a continuous fashion. Stream processing is the backbone of the new data infrastructure.
  • 18. Streaming “Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams” “Continuous processing of data that is continuously produced” Data Stream: Infinite sequence of data arriving in a continuous fashion. Stream processing is the backbone of the new data infrastructure. “The world beyond batch” A high-level tour of modern data-processing concepts. By Tyler Akidau August 5, 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 23. Streaming is the biggest change in data infraestructure since Hadoop Streaming
  • 24. The biggest change is moving from batch to streaming is handling time explicitly Streaming
  • 28. Batch Job Batch Job 2 All Output Batch Job 1 Micro Batch
  • 29. Batch Job Batch Job 2 All Input All OutputBatch Job 3 All Output All Output Batch Job 1 Batch Frequency ? Timestamps keeps real time fingerprint Micro Batch
  • 30. Streaming Technologies Batch StreamingMicro Batch StateLess – Record acknowledgements CPU bounded performance Not expressive declarative functional API – Low Level API Not auto scaling Low level programmatic topology Poor Streaming Windows funcionalities Not compatible with Hadoop APIs Streams
  • 31. Streaming Technologies Batch StreamingMicro Batch StateLess – Record acknowledgements CPU bounded performance Not expressive declarative functional API – Low Level API Not auto scaling Low level programmatic topology Poor Streaming Windows funcionalities Not compatible with Hadoop APIs Streams
  • 32. Streaming Technologies Batch StreamingMicro Batch StateLess – Record acknowledgements CPU bounded performance Not expressive declarative functional API – Low Level API Not auto scaling Low level programmatic topology Poor Streaming Windows funcionalities Not compatible with Hadoop APIs Streams
  • 33. Streaming Technologies Batch StreamingMicro Batch StateLess – Record acknowledgements CPU bounded performance Not expressive declarative functional API – Low Level API Not auto scaling Low level programmatic topology Poor Streaming Windows funcionalities Not compatible with Hadoop APIs Streams
  • 39. Flink Open Source Stream Processing Framework. Last available Release 1.1.1 Top Level Apache Project since Dec '14
  • 40. Flink Open Source Stream Processing Framework. Last available Release 1.1.1 Top Level Apache Project since Dec '14 Main Features Native Stream Low Latency High throughput Stateful Exactly-one guarantees Distributed Expressive APIs And more ….
  • 41. Flink Open Source Stream Processing Framework. Last available Release 1.1.1 Top Level Apache Project since Dec '14 Main Features Native Stream Low Latency High throughput Stateful Exactly-one guarantees Distributed Expressive APIs And more ….
  • 42. Flink
  • 43. Flink
  • 52. Distributed pipelined processing Execute everything as Stream Iterative ( cyclic ) dataflows Mutable state in operations Operate on managed memory (*) Also works on batch !! Job Manager Client Optimizer Dataflow Graph Flink Runtime Engine
  • 53. Distributed pipelined processing Execute everything as Stream Iterative ( cyclic ) dataflows Mutable state in operations Operate on managed memory (*) Also works on batch !! Workers ( Task Managers ) Job Manager Client Optimizer Dataflow Graph Execution Graph Flink Runtime Engine
  • 54. Stream Job Batch Job ML Job Flink Runtime Engine Graph Job optimizer optimizer optimizer optimizer
  • 55. Stream Job Batch Job ML Job Flink Runtime Engine Graph Job optimizer optimizer optimizer optimizer
  • 56. Tasks scheduled and executed in workers ( slots ) Tasks as chain of operators Run operator logic in a pipelined fashion Stream Job Batch Job ML Job Flink Runtime Engine Graph Job optimizer optimizer optimizer optimizer
  • 57. Tasks scheduled and executed in workers ( slots ) Tasks as chain of operators Run operator logic in a pipelined fashion Stream Job Batch Job ML Job Flink Runtime Engine Graph Job optimizer optimizer optimizer optimizer
  • 58. Tasks scheduled and executed in workers ( slots ) Tasks as chain of operators Run operator logic in a pipelined fashion Stream Job Batch Job ML Job Flink Runtime Engine Graph Job optimizer optimizer optimizer optimizer
  • 59. If you want to know one thing about Flink is that you don't need to know the internals of Flink
  • 60. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks
  • 61. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks lTime references lOut of order events lPowerful Windowing
  • 62. Event Times & Windowing
  • 63. Event Times & Windowing Event Time Event Time
  • 64. Event Times & Windowing Flink Data Source Event Time Event Time Ingestion Time
  • 65. Event Times & Windowing Flink Data Source Flink Window Operator Event Time Event Time Ingestion Time Processing Time
  • 66. Event Time: when data is generated Ingestion time: when data is loaded from source Processing time: when data is processed Event time help to process out- of-order events and replay elements as the ocurred ( deterministic results ) Explicit handling of time. 3 choices: Event Times & Windowing
  • 67. Event Times & Windowing
  • 68. Event time. Out or Order
  • 69. 1 2 3 5 7 4 6 8 9 10 Event time. Out or Order
  • 70. 1 2 3 5 7 4 6 8 9 10 Event time. Out or Order Out or Order 1 2 3 5 74 6 8 9 10
  • 71. 1 2 3 5 7 4 6 8 9 10 1 2 3 5 74 6 8 9 104 Event time. Out or Order Ingestion Time WindowsOut or Order 1 2 3 5 74 6 8 9 10
  • 72. 1 2 3 5 7 4 6 8 9 10 1 2 3 5 74 6 8 9 10 1 2 3 4 4 5 Event time. Out or Order 6 7 8 9 10 Event Time Windows Ingestion Time WindowsOut or Order 1 2 3 5 74 6 8 9 10
  • 74. 1 2 3 5 7 4 6 8 9 10 Event time. Watermarks
  • 75. 1 2 3 5 7 4 6 8 9 10 1 2 3 54 6 8 1 2 3 54 6 8 1 2 3 4 4 5 Event time. Watermarks 6 8 Event Time Windows Ingestion Time WindowsOut or Order
  • 76. 1 2 3 5 7 4 6 8 9 10 1 2 3 54 6 8 910 1 2 3 54 6 8 910 1 2 3 4 4 5 Event time. Watermarks 6 8 9 10 Event Time Windows Ingestion Time WindowsOut or Order Not event time before 5 will come Late Time of 2 5
  • 77. 1 2 3 5 7 4 6 8 9 10 1 2 3 5 74 6 8 910 1 2 3 5 74 6 8 910 1 2 3 4 4 5 Event time. Watermarks 6 7 8 9 10 Event Time Windows Ingestion Time WindowsOut or Order Not event time before 10 will come Late Time of 2 10
  • 78. Windowing Windows: grouping of events according to time, session*, count
  • 79. Windowing Windows: grouping of events according to time, session*, count Powerful built-in windows:
  • 80. Windowing Windows: grouping of events according to time, session*, count Powerful built-in windows: Count: number of events to trigger the window. Process X last events each Y events.
  • 81. Windowing Windows: grouping of events according to time, session*, count Powerful built-in windows: Count: number of events to trigger the window. Process X last events each Y events. Time: lTumbling: trigger every X time with received events lSliding: trigger every X time with received events in last Y time
  • 82. Windowing Windows: grouping of events according to time, session*, count Powerful built-in windows: Count: number of events to trigger the window. Process X last events each Y events. Time: lTumbling: trigger every X time with received events lSliding: trigger every X time with received events in last Y time Session: all events from session/user X until session time expired ( Gap )
  • 83. Windowing Windows: grouping of events according to time, session*, count Powerful built-in windows: Count: number of events to trigger the window. Process X last events each Y events. Time: lTumbling: trigger every X time with received events lSliding: trigger every X time with received events in last Y time Session: all events from session/user X until session time expired ( Gap ) High level API for user windows: Window Assigner, Trigger, Evictor
  • 84. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks lManaged operator state for backup/recovery lSavepoints
  • 86. Stateful Streaming Op Op State Stateless Stream Processing Stateful Stream Processing lBuilt-in internal state in each operator for exactly-once semantics lUser state can be declared in each operator to be saved locally in memory ( API, key/value pars ) lSnapshots: periodically local states in memory are persisted in lightweight distributed snapshots. No global pause !! lCheckpoint as global consistent point-in-time snapshot build by set of distributed snapshots. lPluggable state backend for snapshots: JobManager, HDFS, RocksDB lSavepoints: user-triggered retained checkpoint
  • 87. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks lExactly-once semantics with managed operator state lDistributed Snapshotting Algorithm
  • 88. Periodically Chandy-Lamport Snapshots “The global-state-detection algorithm is to be superimposed on the underlying computation: It must run concurrently with, but no alter, this underlying computation” . Triggers snapshots asynchronously . Embedded snapshots algorithm in stream of data ( barriers ) . No global pause, lightweight impact in performance Handling Checkpoints
  • 89. Periodically Chandy-Lamport Snapshots “The global-state-detection algorithm is to be superimposed on the underlying computation: It must run concurrently with, but no alter, this underlying computation” . Triggers snapshots asynchronously . Embedded snapshots algorithm in stream of data ( barriers ) . No global pause, lightweight impact in performance Handling Checkpoints
  • 90. Periodically Chandy-Lamport Snapshots “The global-state-detection algorithm is to be superimposed on the underlying computation: It must run concurrently with, but no alter, this underlying computation” . Triggers snapshots asynchronously . Embedded snapshots algorithm in stream of data ( barriers ) . No global pause, lightweight impact in performance Handling Checkpoints
  • 91. snapshot Job Manager Periodically pushes barriers for new state New state X+1 Ack for Snapshot state X from Task N Handling Checkpoints
  • 94. snapshot Job Manager Handling Checkpoints All Acks received Register Checkpoint for restore in case of fail
  • 95. Streaming Fault Tolerance In case of fail, last global checkpoint is recovered ( recovery from partial checkpoint / individual snapshots is coming ) Need of stateful source like kafka to ensure end-to-end exactly-once semantic in case of fail. Kafka sink doesn't guarantee end-to-end exactly-once ( multiple writes in topic ) ( at least-once ) Semantics in Flink: At Least Once: never loses events, events might be reprocessed Exactly once: neither reprocessed nor lost events. Exactly once by default, with low impact in performance
  • 96. If you want to know one thing about Flink is that you don't need to know the internals of Flink
  • 97. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks lPipelined runtime lLatency vs throughput tunning
  • 98. Exactly-once semantic with low impact in performance Controllable checkpointing overhead Higher throughput using processing time Performance improvements thanks to: . operator chaining during optimization phase . own optimized serialization stack with code generation Performance Tunning
  • 99. Benchmark for “Streaming Computation” published by Yahoo. Dec 18, 2015 https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at Production use-case lcounting ad impressions group by campaign laggregations over a 10 second window lsave current aggregate value to Redis every second Streaming Benchmark
  • 100. Throughput vs Latency Graph Throughput ( 1000 events / sec ) 99 Percentile Latency ( ms ) Not Operator combinig in Storm, more complicate topology, more steps for events and more overhead
  • 101. Apache Storm Without Trident lAt least once / Double counting after fail / Lost state after Failures lCPU bounded Apache Spark lLatency increase with throughput Apache Flink lExactly once / No double counting / No state loss lLimited by bandwidth between Kafka and Flink cluster l(1 GigE). lkafka brokers within Kafka Cluster ( 10 GigE ) lAchieved 15 million messages /sec l( before 3 million m/sec) with exactly once semantic 10,000,000 20,000,000 1 GigE 10 GigE Performance Tunning
  • 102. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks lHigh Level API lWide range of basic and advanced operators lJava , Scala. Python soon !!
  • 103. API
  • 104. API Working on data streams ( bounded ? )
  • 105. API Working on data streams ( bounded ? ) Stream Processing: Explicit Handling of Time
  • 106. API Working on data streams ( bounded ? ) Stream Processing: Explicit Handling of Time Java & Scala. Python coming. Java: Bean type classes vs Tuples with position addresses. Scala: case classes.
  • 107. API Working on data streams ( bounded ? ) Stream Processing: Explicit Handling of Time Java & Scala. Python coming. Java: Bean type classes vs Tuples with position addresses. Scala: case classes. Operators: Sources: kafka, FileSystem, Cassandra … Sinks: Kafka, HDFS, Cassandra …. Transformations: Basic: map, flatmap, filter, grouping, iterate, project, join, cross, … Streaming: Windowing + Aggregations, Temporal Binary Iterative Stream operators
  • 108. API Working on data streams ( bounded ? ) Stream Processing: Explicit Handling of Time Java & Scala. Python coming. Java: Bean type classes vs Tuples with position addresses. Scala: case classes. Operators: Sources: kafka, FileSystem, Cassandra … Sinks: Kafka, HDFS, Cassandra …. Transformations: Basic: map, flatmap, filter, grouping, iterate, project, join, cross, … Streaming: Windowing + Aggregations, Temporal Binary Iterative Stream operators DataStream<?> DataSet<?> Core API 1 implementation*, 2 interfaces
  • 109. Source Map Reduce Fliter Join Sum Sink Map Source Operators
  • 110. Source Map Reduce Fliter Join Sum Sink Map Source Operators
  • 111. Source Map Reduce Fliter Join Sum Sink Source Filter Operators
  • 112. Source Map Reduce Fliter Join Sum Sink Source Filter Operators
  • 113. Source Map Reduce Fliter Join Sum Sink Source Reduce Operators
  • 114. Source Map Reduce Fliter Join Sum Sink Source Reduce Operators
  • 115. Source Map Reduce Fliter Join Sum Sink Source Join Operators
  • 116. Source Map Reduce Fliter Join Sum Sink Source Join Operators
  • 117. Source Map Reduce Fliter Join Sum Sink Source Operators
  • 118. Events Time & Windows Fault Tolerance & Correctness State Handling Low Latency & High Throughput API Libraries SQL Building Blocks lEasy to use. SQL !! lBased on Apache Calcite
  • 119. API extension for DataSets y DataStreams Based on relational Table abstraction Table <=> Source / DataSet / DataStream Operators like: where, select, as, groupBy, join, union, minus, distinct, orderBy, ... Table API
  • 120. Execute SQL-Like sentences on DataSets and Datastreams Resuts returned as Table ( Table API ), convertible to DataStream or DataSets SQL and Table API can be seamlessly mixed over DataStream/DataSets Flink’s SQL support is not feature complete, yet. Queries that include unsupported SQL will fail !! SQL Support SQL
  • 121. Parsing and Logical plan for Table operators and SQL are optimized using Apache Calcite Only supported a Subset of the comprehensive SQL standard Apache Calcite provides with: SQL Parsing API for building expressions in relational algebra Query planning engine Provides SQL for Streaming Queries with windows aggregations SELECT STREAM TUMBLE_END(rowtime, INTERVAL '1' HOUR) AS rowtime, productId, COUNT(*) AS c, SUM(units) AS units FROM Orders Apache Calcite SQL Sentence Apache Calcite: SQL to Logical Plan as Relational Algebra Flink Optimizer: Logical Plan to Execution Plan
  • 122. If you want to know one thing about Flink is that you don't need to know the internals of Flink So … Batch
  • 124. Stream: Unbounded Data Stream Unbounded Data Stream Batch on Stream
  • 125. Stream: Unbounded Data Stream Batch: Bounded stream ( dataset ) on a stream processor Global window over the entire dataset Optimization in operators for joins and grouping, with blocking data exchange if needed Unbounded Data Stream Bounded Data Set Batch on Stream
  • 126. Stream: Unbounded Data Stream Batch: Bounded stream ( dataset ) on a stream processor Global window over the entire dataset Optimization in operators for joins and grouping, with blocking data exchange if needed Unbounded Data Stream Bounded Data Set Batch on Stream
  • 127. Stream: Unbounded Data Stream Batch: Bounded stream ( dataset ) on a stream processor Global window over the entire dataset Optimization in operators for joins and grouping, with blocking data exchange if needed Batch specific optimizations: Cost-based optimizer: dataset size known before hand Manage memory on / off-heap for join, sort, … Optimization serialization stack for user-types Bounded Data Set Batch on Stream Unbounded Data Stream
  • 130. Conclusions Flink Pure streaming engine matches real life. No Abstraction
  • 131. Conclusions Flink Pure streaming engine matches real life. No Abstraction Batch on streaming
  • 132. Conclusions Flink Pure streaming engine matches real life. No Abstraction Batch on streaming Flexible Windowing Semantics with Explicit Time handling
  • 133. Conclusions Flink Pure streaming engine matches real life. No Abstraction Batch on streaming Flexible Windowing Semantics with Explicit Time handling Competitive Performance, low latency and hight throughput
  • 134. Conclusions Flink Pure streaming engine matches real life. No Abstraction Batch on streaming Flexible Windowing Semantics with Explicit Time handling Competitive Performance, low latency and hight throughput Apache Beam, open sourced by Google, uses Flink as its first order runner for Batch and Streaming processing in partnership with Data Artisans. 100% Compliance of data processing model “what, where, when, how “
  • 135.