Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Flink Streaming


Published on

Framework for modern streaming applications

Published in: Data & Analytics
  • Be the first to comment

Introduction to Flink Streaming

  1. 1. Introduction to Flink Streaming Framework for modern streaming applications
  2. 2. ● Madhukara Phatak ● Big data consultant and trainer at ● Consult in Hadoop, Spark and Scala ●
  3. 3. Agenda ● Stream abstraction vs streaming applications ● Stream as an abstraction ● Challenges with modern streaming applications ● Why not Spark streaming? ● Introduction to Flink ● Introduction to Flink streaming ● Flink Streaming API ● References
  4. 4. Use of stream in applications ● Streams are used both in big data and outside big data to support two major use cases ○ Stream as abstraction layer ○ Stream as unbounded data to support real time analysis ● Abstraction and real time have different need and expectation from the streams ● Different platforms use stream in different meanings
  5. 5. Stream as the abstraction ● A stream is a sequence of data elements made available over time. ● A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches. ● Streams can be unbounded ( message queue) and bounded ( files) ● Streams are becoming new abstractions to build data pipelines.
  6. 6. Streams as abstraction outside big data ● Streams are used as an abstraction outside big data in last few years ● Some of them are ○ Reactive streams like akka-streams, akka-http ○ Java 8 streams ○ RxJava etc ● These use of streams are don't care about real time analysis
  7. 7. Streams for real time analysis ● In this use cases of stream, stream is viewed as unbounded data which has low latency and available as soon it arrives in the system ● Stream can be processed using non stream abstraction in run time ● So focus in these scenarios is only to model API in streams not the implementation ● Ex : Spark streaming
  8. 8. Stream abstraction in big data ● Stream is the new abstraction layer people are exploring in the big data ● With right implementation, stream can support both streaming and batch applications much more effectively than existing abstractions. ● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch ● Batch can be faster on streaming than dedicated batch processing
  9. 9. Frameworks with stream as abstraction
  10. 10. Apache flink ● Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. ● Flink provides ○ Dataset API - for bounded streams ○ Datastream API - for unbounded streams ● Flink embraces the stream as abstraction to implement it’s dataflow.
  11. 11. Flink stack
  12. 12. Flink history ● Stratosphere project started in Technical university, Berlin in 2009 ● Incubated to Apache in March, 2014 ● Top level project in Dec 2014 ● Started as stream engine for batch processing ● Started to support streaming few versions before ● DataArtisians is company founded by core flink team
  13. 13. Flink streaming ● Flink Streaming is an extension of the core Flink API for high-throughput, low-latency data stream processing ● Supports many data sources like Flume, Twitter, ZeroMQ and also from any user defined data source ● Data streams can be transformed and modified using high-level functions similar to the ones provided by the batch processing API ● Sound much like Spark streaming promises !!
  14. 14. Streaming is not fast batch processing ● Most of the streaming framework focuses too much on the latency when they develop streaming extensions ● Both storm and spark-streaming view streaming as low latency batch processing system ● Though latency plays an important role in the real time application, the need and challenges go beyond it ● Addressing the complex needs of modern streaming systems need a fresh view on streaming API’s
  15. 15. Streaming in Lamda architecture ● Streaming is viewed as limited, approximate, low latency computing system compared to a batch system in lambda architecture ● So we usually run a streaming system to get low latency approximate results and run a batch system to get high latency with accurate result ● All the limitations of streaming is stemmed from conventional thinking and implementations ● New idea is why not streaming a low latency accurate system itself?
  16. 16. Google dataflow ● Google articulated the first modern streaming framework which is low latency, exactly once, accurate stream applications in their dataflow paper ● It talks about a single system which can replace need of separate streaming and batch processing system ● Known as Kappa architecture ● Modern stream frameworks embrace this over lambda architecture ● Google dataflow is open sourced under the name apache beam
  17. 17. Google dataflow and Flink streaming ● Flink adopted dataflow ideas for their streaming API ● Flink streaming API went through big overhaul in 1.0 version to embrace these ideas ● It was relatively easy to adapt ideas as both google dataflow and flink use streaming as abstraction ● Spark 2.0 may add some of these ideas in their structured stream processing effort
  18. 18. Needs of modern real time applications ● Ability to handle out of time events in unbounded data ● Ability to correlate the events with different dimensions of time ● Ability to correlate events using custom application based characteristics like session ● Ability to both microbatch and event at a time on same framework ● Support for complex stream processing libraries
  19. 19. Mandatory wordcount ● Streams are represented using DataStream in Flink streaming ● DataStream support both RDD and Dataset like API for manipulation ● In this example, ○ Read from socket to create DataStream ○ Use map, keyBy and sum operation for aggregation ● com.madhukaraphatak.flink.streaming.examples. StreamingWordCount
  20. 20. Flink streaming vs Spark streaming Spark Streaming Flink Streaming Streams are represented using DStreams Streams are represented using DataStreams Stream is discretized to mini batch Stream is not discretized Support RDD DSL Supports Dataset like DSL By default stateless By default stateful at operator level Runs mini batch for each interval Runs pipelined operators for each events that comes in Near realtime Real time
  21. 21. Discretizing the stream ● Flink by default don’t need any discretization of stream to work ● But using window API, we can create discretized stream similar to spark ● This time state will be discarded, as and when the batch is computed ● This way you can mimic spark micro batches in Flink ● com.madhukaraphatak.flink.streaming.examples. WindowedStreamingWordCount
  22. 22. Understanding dataflow of flink ● All programs in flink, both batch and streaming, are represented using a dataflow ● This dataflow signifies the stream abstraction provided by the flink runtime ● This dataflow treats all the data as stream and processes using long running operator model ● This is quite different than RDD model of the spark ● Flink UI allows us to understand dataflow of a given flink program
  23. 23. Running in local mode ● bin/ ● bin/flink run -c com.madhukaraphatak.flink.streaming. examples.StreamingWordCount /home/madhu/Dev/mybuild/flink-examples/target/scala- 2.10/flink-examples_2.10-1.0.jar
  24. 24. Dataflow for wordcount example
  25. 25. Operator fusing ● Flink optimiser fuses the operator for efficiency ● All the fused operator run in a same thread, which saves the serialization and deserialization cost between the operators ● For all fused operators, flink generates a nested function which comprises all the code from operators ● This is much efficient that RDD optimization ● Dataset is planning to support this functionality ● You can disable this by env.disableOperatorChaining()
  26. 26. Dataflow for without operate fusing
  27. 27. Flink streaming vs Spark streaming Spark Streaming Flink Streaming Uses RDD distribution model for processing Uses pipelined stream processing paradigm for processing Parallelism is done at batch level Parallelism is controlled at operator level Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault recovery RDD level optimization for stream optimization Operator fusing for stream optimization
  28. 28. Window API ● Powerful API to track and do custom state analysis ● Types of windows ○ Time window ■ Tumbling window ■ Sliding window ○ Non time based window ■ Count window ● Ex : WindowExample.scala
  29. 29. Anatomy of Window API ● Window API is made of 3 different components ● The three components of window are ○ Window assigner ○ Trigger ○ Evictor ● These three components made up all the window API in Flink
  30. 30. Window Assigner ● A function which determines given element, which window it should belong ● Responsible for creation of window and assigning elements to a window ● Two types of window assigner ○ Time based window assigner ○ GlobalWindow assigner ● User can write their custom window assigner too
  31. 31. Trigger ● Trigger is a function responsible for determining when a given window is triggered ● In a time based window, this function will wait till time is done to trigger ● But in non time based window, it can use custom logic to determine when to evaluate a given window ● In our example, the example number of records in an given window is used to determine the trigger or not. ● WindowAnatomy.scala
  32. 32. Building custom session window ● We want to track session of a user ● Each session is identified using sessionID ● We will get an event when the session is started ● Evaluate the session, when we get the end of session event ● For this, we want to implement our own custom window trigger which tracks the end of session ● Ex : SessionWindowExample.scala
  33. 33. Concept of Time in Flink streaming ● Time in a streaming application plays an important role ● So having ability to express time in flexible way is very important feature of modern streaming application ● Flink support three kind of time ○ Process time ○ Event time ○ Ingestion time ● Event time is one of the important feature of flink which compliments the custom window API
  34. 34. Understanding event time ● Time in flink needs to address following two questions ○ When event is occurred? ○ How much time has occurred after the event? ● First question can be answered using the assigning time stamps ● Second question is answered using understanding the concept of the water marks ● Ex : EventTimeExample.scala
  35. 35. Watermarks in Event Time ● Watermark is a special signal which signifies flow of time in Flink ● In above diagram, w(20) signifies 20 units of time is passed in source ● Watermarks allow flink to support different time abstractions
  36. 36. References ● ● streaming/ ● ● https://yahooeng.tumblr. com/post/135321837876/benchmarking-streaming- computation-engines-at ● ● comparative-performance-evaluation-of-flink