Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It

146 views

Published on

Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays.

The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges.

We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
146
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It

  1. 1. Stream Processing and Apache Flink®'s approach to it @StephanEwen Apache Flink PMC CTO @ data Artisans
  2. 2. About me Database systems, TU Berlin, IBM, Microsoft Co-bootstrapped Stratosphere project's runtime Apache Flink created from a (partial) Stratosphere fork Apache Flink community founded data Artisans Now Flink PMC and CTO at data Artisans
  3. 3. Streaming technology is enabling the obvious: continuous processing on data that is continuously produced Hint: you already have streaming data 3
  4. 4. Streaming Subsumes Batch 4 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am… partition partition
  5. 5. Streaming Subsumes Batch 5 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am… partition partition Stream (low latency) Stream (high latency)
  6. 6. Streaming Subsumes Batch 6 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am… partition partition Stream (low latency) Batch (bounded stream) Stream (high latency)
  7. 7. Stream Processing Decouples 7 Database (State) App a App b App c App a App b App c Applications build their own stateState managed centralized
  8. 8. Time Travel 8 Process a period of historic data partition partition Process latest data with low latency (tail of the log) Reprocess stream (historic data first, catches up with realtime data)
  9. 9. 9 Latency Volume/ Throughput State & Accuracy
  10. 10. 10 Latency Volume/ Throughput State & Accuracy Exactly-once semantics Event time processing 10s of millions evts/sec for stateful applications Latency down to the milliseconds Apache Flink was the first open-source system to eliminate these tradeoffs
  11. 11. Streaming Architecture Blueprint 11 collect log analyze serve & store
  12. 12. Flink's Approach 12 Stateful Steam Processing Fluent API, Windows, Event Time Table API Stream SQL Core API Declarative DSL High-level Language Building Block
  13. 13. Stateful Steam Processing 13 Source Filter / Transform State read/write Sink
  14. 14. Stateful Steam Processing 14 Scalable embedded state Access at memory speed & scales with parallel operators
  15. 15. Stateful Steam Processing 15 Re-load state Reset positions in input streams Rolling back computation Re-processing
  16. 16. Stateful Steam Processing 16 Restore to different programs Bugfixes, Upgrades, A/B testing, etc
  17. 17. Versioning the state of applications 17 Savepoint Savepoint Savepoint App. A App. B App. C Time Savepoint
  18. 18. Flink's Approach 18 Stateful Steam Processing Fluent API, Windows, Event Time Table API Stream SQL Core API Declarative DSL High-level Language Building Block
  19. 19. Event Time / Out-of-Order 19 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time
  20. 20. (Stream) SQL & Table API 20 Table API // convert stream into Table val sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'time, 'tempF) // define query on Table val avgTempCTable: Table = sensorTable .groupBy('location) .window(Tumble over 1.days on 'rowtime as 'w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%") SQL sensorTable.sql(""" SELECT day, location, avg((tempF - 32) * 0.556) AS avgTempC FROM sensorData WHERE location LIKE 'room%' GROUP BY day, location """)
  21. 21. What can you do with that? 21 10 billion events (2TB) processed daily across multiple Flink jobs for the telco network control center. Ad-hoc realtime queries, > 30 operators, processing 30 billion events daily, maintaining state of 100s of GB inside Flink with exactly-once guarantees Jobs with > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
  22. 22. Flink's Streams playing at Batch 22 TeraSort Relational Join Classic Batch Jobs Graph Processing Linear Algebra
  23. 23. 23 What can we expect next ?
  24. 24. Queryable State 24
  25. 25. Streaming Architecture Blueprint 25 collect log analyze & serve & store Other Services
  26. 26. Full SQL on Streams 26 Continuous queries incremental results Windows, event time, processing time Consistent with SQL on bounded data https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU
  27. 27. Elastic Parallelism 27 Maintaining exactly-once state consistency No extra effort for the user No need to carefully plan partitions
  28. 28. Very large state 28 Terabytes of state inside the stream processor Maintaining fast checkpoints and recovery E.g., long histories of windows, large join tables State at local memory speed
  29. 29. 29
  30. 30. We are hiring! data-artisans.com/careers

×