Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Stream Processing with
Apache Flink
Fabian Hueske
@fhueske
Apache Flink Meetup Madrid, 25.02.2016
What is Apache Flink?
Apache Flink is an open source platform for
scalable stream and batch processing.
2
• The core of Fl...
What is Apache Flink?
3
Streaming topologies
Long batch pipelines
Machine Learning at scale
A stream processor with many f...
History & Community of Flink
From incubation until now
4
5
Apr ‘14 Jun ‘15Dec ‘14
0.70.60.5 0.9 0.10
Nov ‘15
Top level
0.8
Mar ‘15
1.0!
Growing and Vibrant Community
Flink is one of the largest and most active Apache big data projects:
• more than 150 contri...
Flink Meetups around the Globe
7
Flink Meetups around the Globe
8
✔ 
Organizations at Flink Forward
9
The streaming era
Coming soon…
10
What is Stream Processing?
11
 Today, most data is continuously produced
• user activity logs, web logs, sensors, databas...
Why do Stream Processing?
 Decreases the overall latency to obtain results
• No need to persist data in stable storage
• ...
What are the Requirements?
 Low latency
• Results in millisecond
 High throughput
• Millions of events per second
 Exac...
OS Stream Processors so far
 Either low latency or high throughput
 Exactly-once guarantees only with high latency
 Lac...
Stream Processing with Flink
15
Stream Processing with Flink
 Low latency
• Pipelined processing engine
 High throughput
• Controllable checkpointing ov...
Flink in Streaming Architectures
17
Flink
Flink Flink
Elasticsearch, Hbase,
Cassandra, …
HDFS
Kafka
Analytics on static da...
The DataStream API
Concise and easy-to-grasp code
18
The DataStream API
19
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.f...
The DataStream API
20
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.f...
The DataStream API
21
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.f...
Event-time processing
Consistent and sound results
22
Event-time Processing
 Most data streams consist of events
• log entries, sensor data, user actions, …
• Events have an a...
Event Processing
24
Events occur on devices
Queue / Log
Events analyzed in a
stream processor
Stream Analysis
Events store...
Event Processing
25
Event Processing
26
Event Processing
27
Event Processing
28
Out of order!!!
First burst of events
Second burst of events
Event Processing
29
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out-of-order streams (e...
Event-time Processing
 Event-time processing decouples job semantics
from processing speed
 Analyze events from static d...
Operational Features
Running Flink 24*7*52
31
Monitoring & Dashboard
 Many metrics exposed via REST interface
 Web dashboard
• Submit, stop, and cancel jobs
• Inspect...
Highly-available Cluster Setup
 Stream applications run for weeks, months, …
• Application must never fail!
• No single-p...
 A save point is a consistent snapshot of a job
• Includes source offsets and operator state
• Stop job
• Restart job fro...
Performance: Summary
35
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
W...
Integration (picture not complete)
36
POSIX Java/Scala
Collections
POSIX
Post v1.0 Roadmap
What’s coming next?
37
Stream SQL and Table API
 Structured queries over data streams
• LINQ-style Table API
• Stream SQL
 Based on Apache Calc...
Complex Event Processing
 Identify complex patterns in event streams
• Correlations & sequences
 Many applications
• Net...
Dynamic Job Parallelism
 Adjusting parallelism of tasks without (significantly)
interrupting the program
 Initial versio...
Wrap up!
 Flink is a kick-ass stream processor…
• Low latency & high throughput
• Exactly-once consistency
• Event-time p...
I ♥ Squirrels, do you?
 More Information at
• http://flink.apache.org/
 Free Flink training at
• http://dataartisans.git...
43
Upcoming SlideShare
Loading in …5
×

Data Stream Processing with Apache Flink

3,138 views

Published on

This talk is an introduction into Stream Processing with Apache Flink. I gave this talk at the Madrid Apache Flink Meetup at February 25th, 2016.

The talk discusses Flink's features, shows it's DataStream API and explains the benefits of Event-time stream processing. It gives an outlook on some features that will be added after the 1.0 release.

Published in: Software
  • Be the first to comment

Data Stream Processing with Apache Flink

  1. 1. Data Stream Processing with Apache Flink Fabian Hueske @fhueske Apache Flink Meetup Madrid, 25.02.2016
  2. 2. What is Apache Flink? Apache Flink is an open source platform for scalable stream and batch processing. 2 • The core of Flink is a distributed streaming dataflow engine. • Executes dataflows in parallel on clusters • Provides a reliable backend for various workloads • DataStream and DataSet programming abstractions are the foundation for user programs and higher layers
  3. 3. What is Apache Flink? 3 Streaming topologies Long batch pipelines Machine Learning at scale A stream processor with many faces Graph Analysis  resource utilization  iterative algorithms  Mutable state  low-latency processing
  4. 4. History & Community of Flink From incubation until now 4
  5. 5. 5 Apr ‘14 Jun ‘15Dec ‘14 0.70.60.5 0.9 0.10 Nov ‘15 Top level 0.8 Mar ‘15 1.0!
  6. 6. Growing and Vibrant Community Flink is one of the largest and most active Apache big data projects: • more than 150 contributors • more than 600 forks • more than 1000 Github stars (since yesterday) 6
  7. 7. Flink Meetups around the Globe 7
  8. 8. Flink Meetups around the Globe 8 ✔ 
  9. 9. Organizations at Flink Forward 9
  10. 10. The streaming era Coming soon… 10
  11. 11. What is Stream Processing? 11  Today, most data is continuously produced • user activity logs, web logs, sensors, database transactions, …  The common approach to analyze such data so far • Record data stream to stable storage (DBMS, HDFS, …) • Periodically analyze data with batch processing engine (DBMS, MapReduce, ...)  Streaming processing engines analyze data while it arrives
  12. 12. Why do Stream Processing?  Decreases the overall latency to obtain results • No need to persist data in stable storage • No periodic batch analysis jobs  Simplifies the data infrastructure • Fewer moving parts to be maintained and coordinated  Makes time dimension of data explicit • Each event has a timestamp • Data can be processed based on timestamps 12
  13. 13. What are the Requirements?  Low latency • Results in millisecond  High throughput • Millions of events per second  Exactly-once consistency • Correct results in case of failures  Out-of-order events • Process events based on their associated time  Intuitive APIs 13
  14. 14. OS Stream Processors so far  Either low latency or high throughput  Exactly-once guarantees only with high latency  Lacking time semantics • Processing by wall clock time only • Events are processed in arrival order, not in the order they were created  Shortcomings lead to complicated system designs • Lambda architecture 14
  15. 15. Stream Processing with Flink 15
  16. 16. Stream Processing with Flink  Low latency • Pipelined processing engine  High throughput • Controllable checkpointing overhead  Exactly-once guarantees • Distributed snapshots  Support for out-of-order streams • Processing semantics based on event-time  Programmability • APIs similar to those known from the batch world 16
  17. 17. Flink in Streaming Architectures 17 Flink Flink Flink Elasticsearch, Hbase, Cassandra, … HDFS Kafka Analytics on static data Data ingestion and ETL Analytics on data in motion
  18. 18. The DataStream API Concise and easy-to-grasp code 18
  19. 19. The DataStream API 19 case class Event(location: Location, numVehicles: Long) val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) }
  20. 20. The DataStream API 20 case class Event(location: Location, numVehicles: Long) val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.minutes(15), Time.minutes(5)) .sum("numVehicles")
  21. 21. The DataStream API 21 case class Event(location: Location, numVehicles: Long) val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.minutes(15), Time.minutes(5)) .sum("numVehicles") .keyBy("location") .mapWithState { (evt, state: Option[Model]) => { val model = state.orElse(new Model()) (model.classify(evt), Some(model.update(evt))) }}
  22. 22. Event-time processing Consistent and sound results 22
  23. 23. Event-time Processing  Most data streams consist of events • log entries, sensor data, user actions, … • Events have an associated timestamp  Many analysis tasks are based on time • “Average temperature every minute” • “Count of processed parcels per hour” • ...  Events often arrive out-of-order at processor • Distributed sources, network delays, non-synced clocks, …  Stream processor must respect time of events for consistent and sound results • Most stream processors use wall clock time 23
  24. 24. Event Processing 24 Events occur on devices Queue / Log Events analyzed in a stream processor Stream Analysis Events stored in a log
  25. 25. Event Processing 25
  26. 26. Event Processing 26
  27. 27. Event Processing 27
  28. 28. Event Processing 28 Out of order!!! First burst of events Second burst of events
  29. 29. Event Processing 29 Event time windows Arrival time windows Instant event-at-a-time Flink supports out-of-order streams (event time) windows, arrival time windows (and mixtures) plus low latency processing. First burst of events Second burst of events
  30. 30. Event-time Processing  Event-time processing decouples job semantics from processing speed  Analyze events from static data store and online stream using the same program  Semantically sound and consistent results  Details: http://data-artisans.com/how-apache-flink-enables-new- streaming-applications-part-1 30
  31. 31. Operational Features Running Flink 24*7*52 31
  32. 32. Monitoring & Dashboard  Many metrics exposed via REST interface  Web dashboard • Submit, stop, and cancel jobs • Inspect running and completed jobs • Analyze performance • Check exceptions • Inspect configuration • … 32
  33. 33. Highly-available Cluster Setup  Stream applications run for weeks, months, … • Application must never fail! • No single-point-of-failure component allowed  Flink supports highly-available cluster setups • Master failures are resolved using Apache Zookeeper • Worker failures are resolved by master  Stand-alone cluster setup • Requires (manually started) stand-by masters and workers  YARN cluster setup • Masters and workers are automatically restarted 33
  34. 34.  A save point is a consistent snapshot of a job • Includes source offsets and operator state • Stop job • Restart job from save point  What can I use it for? • Fix or update your job • A/B testing • Update Flink • Migrate cluster • …  Details: http://data-artisans.com/how-apache-flink-enables-new- streaming-applications Save Points 34
  35. 35. Performance: Summary 35 Continuous streaming Latency-bound buffering Distributed Snapshots High Throughput & Low Latency With configurable throughput/latency tradeoff Details: http://data-artisans.com/high-throughput-low-latency- and-exactly-once-stream-processing-with-apache-flink
  36. 36. Integration (picture not complete) 36 POSIX Java/Scala Collections POSIX
  37. 37. Post v1.0 Roadmap What’s coming next? 37
  38. 38. Stream SQL and Table API  Structured queries over data streams • LINQ-style Table API • Stream SQL  Based on Apache Calcite • SQL Parser and optimizer  “Compute every hour the number of orders and number ordered units for each product.” 38 SELECT STREAM productId, TUMBLE_END(rowtime, INTERVAL '1' HOUR) AS rowtime, COUNT(*) AS cnt, SUM(units) AS units FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), productId;
  39. 39. Complex Event Processing  Identify complex patterns in event streams • Correlations & sequences  Many applications • Network intrusion detection via access patterns • Item tracking (parcels, devices, …) • …  CEP depends on low latency processing • Most CEP system are not distributed  CEP in Flink • Easy-to-use API to define CEP patterns • Integration with Table API for structured analytics • Low-latency and high-throughput engine 39
  40. 40. Dynamic Job Parallelism  Adjusting parallelism of tasks without (significantly) interrupting the program  Initial version based on save points • Trigger save point • Stop job • Restart job with adjusted parallelism  Later change parallelism while job is running  Vision is automatic adaption based on throughput 40
  41. 41. Wrap up!  Flink is a kick-ass stream processor… • Low latency & high throughput • Exactly-once consistency • Event-time processing • Support for out-of-order streams • Intuitive API  with lots of features in the pipeline…  and a reliable batch processor as well! 41
  42. 42. I ♥ Squirrels, do you?  More Information at • http://flink.apache.org/  Free Flink training at • http://dataartisans.github.io/flink-training  Sign up for user/dev mailing list  Get involved and contribute  Follow @ApacheFlink on Twitter 42
  43. 43. 43

×