SlideShare a Scribd company logo
1 of 24
Gyula Fóra
gyfora@apache.org
Flink committer
Swedish ICT
Real-time data processing
with Apache Flink
What is Apache Flink?
2
What is Apache Flink
3
Distributed Data Flow Processing System
▪ Focused on large-scale data analytics
▪ Unified real-time stream and batch processing
▪ Easy and powerful APIs in Java / Scala (+ Python)
▪ Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
Flink Stack (0.9.0)
4
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
MRQL
Table
Cascading(WiP)Streaming dataflow runtime
Stream processing
5
6
▪ Data stream: Infinite sequence of data arriving in a continuous
fashion.
▪ Stream processing: Analyzing and acting on real-time streaming
data, using continuous queries
Stream processing
3 Parts of a Streaming Infrastructure
7
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
8
Apache Storm
• True streaming over distributed dataflow
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing emulated on top of batch system (non-native)
• Functional API (DStreams), restricted by batch runtime
Apache Samza
• True streaming built on top of Apache Kafka, state is first class citizen
• Slightly different stream notion, low level API
Apache Flink
• True streaming over stateful distributed dataflow
• Rich functional API exploiting streaming runtime; e.g. rich windowing
semantics
Streaming landscape
Flink Streaming
9
What is Flink Streaming
10
 Native, low-latency stream processor
 Expressive functional API
 Flexible operator state, stream windows
 Exactly-once processing semantics
Native vs non-native streaming
11
Stream
discretizer
Job Job Job Jobwhile (true) {
// get next few records
// issue batch computation
}
while (true) {
// process next record
}
Long-standing
operators
Non-native streaming
Native streaming
Computation vs Operator state
12
 Computation state
• Result of a continuous computation (stream)
• Typically produced by combining independent
results (update function)
• Transformations cannot interact with it!
 Operator state
• Lives inside long standing operators
• Transformations can interact with the state (Map
function dependent on internal state)
 Examples for stateful operators
• Pattern, anomaly detection
• Multipass machine learning algorithms
DataStream API
13
Overview of the API
 Data stream sources
• File system
• Message queue connectors
• Arbitrary source functionality
 Stream transformations
• Basic transformations: Map, Reduce, Filter, Aggregations…
• Binary stream transformations: CoMap, CoReduce…
• Windowing semantics: Policy based flexible windowing (Time, Count,
Delta…)
• Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
 Data stream outputs
 For the details please refer to the programming guide:
• http://flink.apache.org/docs/latest/streaming_guide.html 14
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
Word count in Batch and Streaming
15
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Flexible windows
16
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
17
▪ Performance optimizations
• Effective serialization due to strongly typed topologies
• Operator chaining (thread sharing/no serialization)
• Different automatic query optimizations
▪ Competitive performance
• ~ 1.5m events / sec / core
• As a comparison Storm promises ~ 1m tuples / sec /
node
Performance
Fault tolerance
18
Overview
19
▪ Fault tolerance in other systems
• Message tracking/acks (Storm)
• RDD re-computation (Spark)
 Fault tolerance in Apache Flink
• Based on consistent global snapshots
• Algorithm inspired by Chandy-Lamport
• Low runtime overhead, stateful exactly-once
semantics
Checkpointing / Recovery
20
Asynchronous Barrier Snapshotting for globally consistent checkpoints
Pushes checkpoint barriers
through the data flow
Operator checkpoint
starting
Checkpoint done
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
State management
21
 State declared in the operators is managed and
checkpointed by Flink
 Pluggable backends for storing persistent
snapshots
• Currently: JobManager, FileSystem (HDFS, Tachyon)
 State partitioning and flexible scaling in the
future
Closing
22
Streaming roadmap for 2015
 Improved state management
• New backends for state snapshotting
• Support for state partitioning and incremental
snapshots
• Master Failover
 Improved job monitoring
 Integration with other Apache projects
• SAMOA (PR ready), Zeppelin (PR ready), Ignite
 Streaming machine learning and other new
libraries 23
flink.apache.org
@ApacheFlink

More Related Content

What's hot

Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 

What's hot (20)

Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HATech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HA
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache flink
Apache flinkApache flink
Apache flink
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
 

Viewers also liked

Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 

Viewers also liked (17)

Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Concord: Simple & Flexible Stream Processing on Apache Mesos: Data By The Bay...
Concord: Simple & Flexible Stream Processing on Apache Mesos: Data By The Bay...Concord: Simple & Flexible Stream Processing on Apache Mesos: Data By The Bay...
Concord: Simple & Flexible Stream Processing on Apache Mesos: Data By The Bay...
 
Flink 1.0-slides
Flink 1.0-slidesFlink 1.0-slides
Flink 1.0-slides
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 

Similar to Flink Streaming @BudapestData

Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 

Similar to Flink Streaming @BudapestData (20)

Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Introduction to Apache Flink at Vienna Meet Up
Introduction to Apache Flink at Vienna Meet UpIntroduction to Apache Flink at Vienna Meet Up
Introduction to Apache Flink at Vienna Meet Up
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Flink Streaming @BudapestData

  • 1. Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Real-time data processing with Apache Flink
  • 2. What is Apache Flink? 2
  • 3. What is Apache Flink 3 Distributed Data Flow Processing System ▪ Focused on large-scale data analytics ▪ Unified real-time stream and batch processing ▪ Easy and powerful APIs in Java / Scala (+ Python) ▪ Robust and fast execution backend Reduce Join Filter Reduce Map Iterate Source Sink Source
  • 4. Flink Stack (0.9.0) 4 Gelly Table ML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow MRQL Table Cascading(WiP)Streaming dataflow runtime
  • 6. 6 ▪ Data stream: Infinite sequence of data arriving in a continuous fashion. ▪ Stream processing: Analyzing and acting on real-time streaming data, using continuous queries Stream processing
  • 7. 3 Parts of a Streaming Infrastructure 7 Gathering Broker Analysis Sensors Transaction logs … Server Logs
  • 8. 8 Apache Storm • True streaming over distributed dataflow • Low level API (Bolts, Spouts) + Trident Spark Streaming • Stream processing emulated on top of batch system (non-native) • Functional API (DStreams), restricted by batch runtime Apache Samza • True streaming built on top of Apache Kafka, state is first class citizen • Slightly different stream notion, low level API Apache Flink • True streaming over stateful distributed dataflow • Rich functional API exploiting streaming runtime; e.g. rich windowing semantics Streaming landscape
  • 10. What is Flink Streaming 10  Native, low-latency stream processor  Expressive functional API  Flexible operator state, stream windows  Exactly-once processing semantics
  • 11. Native vs non-native streaming 11 Stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch computation } while (true) { // process next record } Long-standing operators Non-native streaming Native streaming
  • 12. Computation vs Operator state 12  Computation state • Result of a continuous computation (stream) • Typically produced by combining independent results (update function) • Transformations cannot interact with it!  Operator state • Lives inside long standing operators • Transformations can interact with the state (Map function dependent on internal state)  Examples for stateful operators • Pattern, anomaly detection • Multipass machine learning algorithms
  • 14. Overview of the API  Data stream sources • File system • Message queue connectors • Arbitrary source functionality  Stream transformations • Basic transformations: Map, Reduce, Filter, Aggregations… • Binary stream transformations: CoMap, CoReduce… • Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) • Temporal binary stream operators: Joins, Crosses… • Native support for iterations  Data stream outputs  For the details please refer to the programming guide: • http://flink.apache.org/docs/latest/streaming_guide.html 14 Reduce Merge Filter Sum Map Src Sink Src
  • 15. Word count in Batch and Streaming 15 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 16. Flexible windows 16 More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
  • 17. 17 ▪ Performance optimizations • Effective serialization due to strongly typed topologies • Operator chaining (thread sharing/no serialization) • Different automatic query optimizations ▪ Competitive performance • ~ 1.5m events / sec / core • As a comparison Storm promises ~ 1m tuples / sec / node Performance
  • 19. Overview 19 ▪ Fault tolerance in other systems • Message tracking/acks (Storm) • RDD re-computation (Spark)  Fault tolerance in Apache Flink • Based on consistent global snapshots • Algorithm inspired by Chandy-Lamport • Low runtime overhead, stateful exactly-once semantics
  • 20. Checkpointing / Recovery 20 Asynchronous Barrier Snapshotting for globally consistent checkpoints Pushes checkpoint barriers through the data flow Operator checkpoint starting Checkpoint done Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot Checkpoint done checkpoint in progress (backup till next snapshot)
  • 21. State management 21  State declared in the operators is managed and checkpointed by Flink  Pluggable backends for storing persistent snapshots • Currently: JobManager, FileSystem (HDFS, Tachyon)  State partitioning and flexible scaling in the future
  • 23. Streaming roadmap for 2015  Improved state management • New backends for state snapshotting • Support for state partitioning and incremental snapshots • Master Failover  Improved job monitoring  Integration with other Apache projects • SAMOA (PR ready), Zeppelin (PR ready), Ignite  Streaming machine learning and other new libraries 23