Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Streaming @BudapestData

1,877 views

Published on

Presentation on real-time analytics using Apache Flink at the Budapest Data Forum

Published in: Technology

Flink Streaming @BudapestData

  1. 1. Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Real-time data processing with Apache Flink
  2. 2. What is Apache Flink? 2
  3. 3. What is Apache Flink 3 Distributed Data Flow Processing System ▪ Focused on large-scale data analytics ▪ Unified real-time stream and batch processing ▪ Easy and powerful APIs in Java / Scala (+ Python) ▪ Robust and fast execution backend Reduce Join Filter Reduce Map Iterate Source Sink Source
  4. 4. Flink Stack (0.9.0) 4 Gelly Table ML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow MRQL Table Cascading(WiP)Streaming dataflow runtime
  5. 5. Stream processing 5
  6. 6. 6 ▪ Data stream: Infinite sequence of data arriving in a continuous fashion. ▪ Stream processing: Analyzing and acting on real-time streaming data, using continuous queries Stream processing
  7. 7. 3 Parts of a Streaming Infrastructure 7 Gathering Broker Analysis Sensors Transaction logs … Server Logs
  8. 8. 8 Apache Storm • True streaming over distributed dataflow • Low level API (Bolts, Spouts) + Trident Spark Streaming • Stream processing emulated on top of batch system (non-native) • Functional API (DStreams), restricted by batch runtime Apache Samza • True streaming built on top of Apache Kafka, state is first class citizen • Slightly different stream notion, low level API Apache Flink • True streaming over stateful distributed dataflow • Rich functional API exploiting streaming runtime; e.g. rich windowing semantics Streaming landscape
  9. 9. Flink Streaming 9
  10. 10. What is Flink Streaming 10  Native, low-latency stream processor  Expressive functional API  Flexible operator state, stream windows  Exactly-once processing semantics
  11. 11. Native vs non-native streaming 11 Stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch computation } while (true) { // process next record } Long-standing operators Non-native streaming Native streaming
  12. 12. Computation vs Operator state 12  Computation state • Result of a continuous computation (stream) • Typically produced by combining independent results (update function) • Transformations cannot interact with it!  Operator state • Lives inside long standing operators • Transformations can interact with the state (Map function dependent on internal state)  Examples for stateful operators • Pattern, anomaly detection • Multipass machine learning algorithms
  13. 13. DataStream API 13
  14. 14. Overview of the API  Data stream sources • File system • Message queue connectors • Arbitrary source functionality  Stream transformations • Basic transformations: Map, Reduce, Filter, Aggregations… • Binary stream transformations: CoMap, CoReduce… • Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) • Temporal binary stream operators: Joins, Crosses… • Native support for iterations  Data stream outputs  For the details please refer to the programming guide: • http://flink.apache.org/docs/latest/streaming_guide.html 14 Reduce Merge Filter Sum Map Src Sink Src
  15. 15. Word count in Batch and Streaming 15 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  16. 16. Flexible windows 16 More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
  17. 17. 17 ▪ Performance optimizations • Effective serialization due to strongly typed topologies • Operator chaining (thread sharing/no serialization) • Different automatic query optimizations ▪ Competitive performance • ~ 1.5m events / sec / core • As a comparison Storm promises ~ 1m tuples / sec / node Performance
  18. 18. Fault tolerance 18
  19. 19. Overview 19 ▪ Fault tolerance in other systems • Message tracking/acks (Storm) • RDD re-computation (Spark)  Fault tolerance in Apache Flink • Based on consistent global snapshots • Algorithm inspired by Chandy-Lamport • Low runtime overhead, stateful exactly-once semantics
  20. 20. Checkpointing / Recovery 20 Asynchronous Barrier Snapshotting for globally consistent checkpoints Pushes checkpoint barriers through the data flow Operator checkpoint starting Checkpoint done Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot Checkpoint done checkpoint in progress (backup till next snapshot)
  21. 21. State management 21  State declared in the operators is managed and checkpointed by Flink  Pluggable backends for storing persistent snapshots • Currently: JobManager, FileSystem (HDFS, Tachyon)  State partitioning and flexible scaling in the future
  22. 22. Closing 22
  23. 23. Streaming roadmap for 2015  Improved state management • New backends for state snapshotting • Support for state partitioning and incremental snapshots • Master Failover  Improved job monitoring  Integration with other Apache projects • SAMOA (PR ready), Zeppelin (PR ready), Ignite  Streaming machine learning and other new libraries 23
  24. 24. flink.apache.org @ApacheFlink

×