Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink: Streaming Done Right @ FOSDEM 2016

1,523 views

Published on

The talk I gave at the FOSDEM 2016 on the 31st of January.

The talk explains how we can do stateful stream processing with Apache Flink at the example of counting tweet impressions. It covers Flink's windowing semantics, stateful operators, fault tolerance and performance numbers. The talks ends with giving an outlook on what's is going to happen in the next couple of months.

Published in: Technology
  • Be the first to comment

Apache Flink: Streaming Done Right @ FOSDEM 2016

  1. 1. Apache Flink Streaming Done Right Till Rohrmann trohrmann@apache.org @stsffap
  2. 2. What Is Apache Flink? §  Apache TLP since December 2014 §  Parallel streaming data flow runtime §  Low latency & high throughput §  Exactly once semantics §  Stateful operations 2
  3. 3. Why Stream Processing? §  Most problems have streaming nature §  Stream processing gives lower latency §  Data volumes more easily tamed §  More predictable resource consumption 3 Event stream
  4. 4. Counting Tweet Impressions 4 Input stream Input Group #tweets #tweets #tweets #tweets
  5. 5. How Would I Do It with Flink? 5 case class Tweet(id: Long, timestamp: Long, count: Long) val env = StreamExecutionEnvironment.getEnvironment() val tweets: DataStream[Tweet] = env.addSource( new MyTwitterSource()) val result: DataStream[Tweet] = tweets .keyBy("id") .timeWindow(Time.minutes(10)) .sum("count") result.print() env.execute() We have to define a 6me frame for the aggrega6on
  6. 6. Expressive Windows Time and count based; Event-time support; Custom windows 6
  7. 7. Stateful Operators §  What if a window grows too large? §  Solution: Stateful mapper with counter 7 Count tweet impressions #42 Operator User function Operator state #13 #37 Read Write
  8. 8. Intuitive API 8 case class Tweet(id: Long, timestamp: Long, count: Long) val env = StreamExecutionEnvironment.getEnvironment() val tweets: DataStream[Tweet] = env.addSource( new MyTwitterSource()) val result: DataStream[Tweet] = tweets .keyBy("id") .mapWithState { (tweet, state: Option[Long]) => state match { case Some(counter) => (tweet.copy(count = counter + 1L), Some(counter + 1L)) case None => (tweet, Some(1L)) } }
  9. 9. What If a Failure Occurs? Loss of data and incorrect count! 9 Count tweet impressions #42 Operator User function Operator state #?? #?? Read Write
  10. 10. We Need to Save Our State! §  Consistent snapshots of distributed data stream and operator state 10
  11. 11. How Does It Work? §  Markers for checkpoints §  Injected in the data flow 11
  12. 12. Processing Guarantees §  At most once •  No guarantees at all §  At least once •  Ensure that all operators see all events §  Exactly once 12 Flink gives you all guarantees
  13. 13. Checkpointed Operators §  State is automatically checkpointed §  State backend is configurable 13 Barriers Count tweet impressions #42 Operator User function Operator state #13 #37 Read Write
  14. 14. What About Performance? 14 Continuous streaming Latency-bound buffering Distributed Snapshots High Throughput & Low Latency With configurable throughput/latency tradeoff
  15. 15. What’s coming next? 15
  16. 16. Asynchronous Snapshots §  Taking snapshots stalls the operator §  Solution: Out of core & asynchronous snapshots 16Async/incremental snapshots Spill to disk Barriers Count tweet impressions #42 User function Operator state #13 #37 Read Write
  17. 17. Streams with Varying Data Rate 17 time events/second With static resources: Provision for max. rate Idle capacity
  18. 18. (1) Adjust Parallelism 18 Initial configuration Scale Out (for load) Scale In (save resources)
  19. 19. (2) Dynamic Worker Pool 19 JobManager Resource Manager Pool of Cluster ResourcesYARN/Mesos/… Allocate TaskManager TaskManager Release
  20. 20. Declarative Queries §  StreamSQL §  Complex event processing 20 val tabEnv = new TableEnvironment(env) tabEnv.registerStream(stream, “myStream”, (“ID”, “MEASURE”, “COUNT”)) val sqlQuery = tabEnv.sql( “SELECT ID, MEASURE FROM myStream WHERE COUNT > 17”) Sober Drunk Car Afoot drink drink enters car walks Raise alert
  21. 21. Where to Find Us? www.flink.apache.org https://github.com/apache/flink @ApacheFlink 21

×