Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs

719 views

Published on

As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs

  1. 1. 1 Kostas Kloudas @KLOUBEN_K Flink Forward San Francisco April 11, 2017 Extending Flink’s Streaming APIs
  2. 2. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution
  3. 3. Extensions to the DataStream API 3
  4. 4. Extensions to the DataStream API 4  ProcessFunction for Low-level Operations  Support for Asynchronous I/O
  5. 5. ProcessFunction 5
  6. 6. Stream Processing 6 Computation Computations on never-ending “streams” of events
  7. 7. Distributed Stream Processing 7 Computation Computation spread across many machines Computation Computation
  8. 8. Stateful Stream Processing 8 Computation State Result depends on history of stream
  9. 9. Stream Processing Engines  Time: • handle infinite streams • with out-of-order events  State: • guarantee fault-tolerance (distributed) • guarantee consistency (infinite streams) 9
  10. 10.  Gives access to all basic building blocks: • Events • Fault-tolerant, Consistent State • Timers (event- and processing-time) • Side Outputs 10 ProcessFunction
  11. 11. Common Usecase Skeleton A  On each incoming element: • update some state • register a callback for a moment in the future  When that moment comes: • Check a condition and perform a certain action, e.g. emit an element 11
  12. 12.  Use built-in windowing: • +Expressive • +A lot of functionality out-of-the-box • - Not always intuitive • - An overkill for simple cases  Write your own operator: • - Too many things to account for 12 Before the ProcessFunction
  13. 13.  Simple yet powerful API: 13 /** * Process one element from the input stream. */ void processElement(I value, Context ctx, Collector<O> out) throws Exception; /** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception; ProcessFunction
  14. 14.  Simple yet powerful API: 14 /** * Process one element from the input stream. */ void processElement(I value, Context ctx, Collector<O> out) throws Exception; /** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception; A collector to emit result values ProcessFunction
  15. 15.  Simple yet powerful API: 15 /** * Process one element from the input stream. */ void processElement(I value, Context ctx, Collector<O> out) throws Exception; /** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception; 1. Get the timestamp of the element 2. Register and use side outputs 3. Interact with the TimerService to: • query the current time • register timers 1. Do the above 2. Query if we are on Event or Processing time ProcessFunction
  16. 16.  Requirements: • maintain counts per incoming key, and • emit the key/count pair if no element came for the key in the last 100 ms (in event time) 16 ProcessFunction: example
  17. 17. 17  Implementation sketch: • Store the count, key and last mod timestamp in a ValueState (scoped by key) • For each record: • update the counter and the last mod timestamp • register a timer 100ms from “now” (in event time) • When the timer fires: • check the timer’s timestamp against the last mod time for that key and • emit the key/count pair if they differ by 100ms ProcessFunction: example
  18. 18. 18 public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { // define your state descriptors @Override public void processElement(Tuple2<String, Long> value, Context ctx, Collector<Tuple2<String, Long>> out) throws Exception { // update our state and register a timer } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { // check the state for the key and emit a result if needed } } ProcessFunction: example
  19. 19. 19 public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { // define your state descriptors private final ValueStateDescriptor<CounterWithTS> stateDesc = new ValueStateDescriptor<>("myState", CounterWithTS.class); } ProcessFunction: example
  20. 20. 20 public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { @Override public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, Long>> out) throws Exception { ValueState<MyStateClass> state = getRuntimeContext().getState(stateDesc); CounterWithTS current = state.value(); if (current == null) { current = new CounterWithTS(); current.key = value.f0; } current.count++; current.lastModified = ctx.timestamp(); state.update(current); ctx.timerService().registerEventTimeTimer(current.lastModified + 100); } } ProcessFunction: example
  21. 21. 21 public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { CounterWithTS result = getRuntimeContext().getState(stateDesc).value(); if (timestamp == result.lastModified + 100) { out.collect(new Tuple2<String, Long>(result.key, result.count)); } } } ProcessFunction: example
  22. 22. 22 stream.keyBy(”key”) .process(new MyProcessFunction()) ProcessFunction: example
  23. 23. ProcessFunction: Side Outputs  Additional (to the main) output streams  No type limitations • each side output can have its own type 23
  24. 24.  Requirements: • maintain counts per incoming key, and • emit the key/count pair if no element came for the key in the last 100 ms (in event time) • in other case, if the count > 10, send the key to a side-output named gt10 24 ProcessFunction: example+
  25. 25. 25 final OutputTag<String> outputTag = new OutputTag<String>(”gt10"){}; SingleOutputStreamOperator<Tuple2<String, Long>> mainStream = input.process( new ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>>() { @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { CounterWithTS result = getRuntimeContext().getState(adStateDesc).value(); if (timestamp == result.lastModified + 100) { out.collect(new Tuple2<String, Long>(result.key, result.count)); } else if (result.count > 10) { ctx.output(outputTag, result.key); } } DataStream<String> sideOutputStream = mainStream.getSideOutput(outputTag); ProcessFunction: example+
  26. 26. 26 final OutputTag<String> outputTag = new OutputTag<String>(”gt10"){}; SingleOutputStreamOperator<Tuple2<String, Long>> mainStream = input.process( new ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>>() { @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { CounterWithTS result = getRuntimeContext().getState(adStateDesc).value(); if (timestamp == result.lastModified + 100) { out.collect(new Tuple2<String, Long>(result.key, result.count)); } else if (result.count > 10) { ctx.output(outputTag, result.key); } } DataStream<String> sideOutputStream = mainStream.getSideOutput(outputTag); ProcessFunction: example+
  27. 27. 27  Applicable to Keyed streams  For Non-Keyed streams:  group on a dummy key if you need the timers  BEWARE: parallelism of 1  Use it directly without the timers  CoProcessFunction for low-level joins: • Applied on two input streams ProcessFunction
  28. 28. Asynchronous I/O 28
  29. 29. Common Usecase Skeleton B 29  On each incoming element: • extract some info from the element (e.g. key) • query an external storage system (DB or KV- store) for additional info • emit an enriched version of the input element
  30. 30.  Write a MapFunction that queries the DB: • +Simple • - Slow (synchronous access) or/and • - Requires high parallelism (more tasks)  Write your own operator: • - Too many things to account for 30 Before the AsuncIO support
  31. 31.  Write a MapFunction that queries the DB: • +Simple • - Slow (synchronous access) or/and • - Requires high parallelism (more tasks)  Write your own operator: • - Too many things to account for 31 Before the AsyncIO support
  32. 32. 32 Synchronous Access
  33. 33. 33 Communication delay can dominate application throughput and latency Synchronous Access
  34. 34. 34 Asynchronous Access
  35. 35.  Requirement: • a client that supports asynchronous requests  Flink handles the rest: • integration of async IO with DataStream API • fault-tolerance • order of emitted elements • correct time semantics (event/processing time) 35 AsyncFunction
  36. 36.  Simple API: /** * Trigger async operation for each stream input. */ void asyncInvoke(IN input, AsyncCollector<OUT> collector) throws Exception;  API call: /** * Example async function call. */ DataStream<...> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); 36 AsyncFunction
  37. 37. 37 Emitter P2P3 P1P4 AsyncWaitOperator E5 AsyncWaitOperator: • a queue of “Promises” • a separate thread (Emitter) AsyncFunction
  38. 38. 38 Emitter P2P3 P1P4 AsyncWaitOperator • Wrap E5 in a “promise” P5 • Put P5 in the queue • Call asyncInvoke(E5, P5) E5 P5 asyncInvoke(E5, P5)P5 AsyncFunction
  39. 39. 39 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the result (when the query returns) AsyncFunction
  40. 40. 40 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the result (when the query returns) Future<String> future = client.query(E5); future.thenAccept((String result) -> { P5.collect( Collections.singleton( new Tuple2<>(E5, result))); }); AsyncFunction
  41. 41. 41 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the result (when the query returns) Future<String> future = client.query(E5); future.thenAccept((String result) -> { P5.collect( Collections.singleton( new Tuple2<>(E5, result))); }); AsyncFunction
  42. 42. 42 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 Emitter: • separate thread • polls queue for completed promises (blocking) • emits elements downstream AsyncFunction
  43. 43. 43 DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);  our asyncFunction  a timeout: max time until considered failed  capacity: max number of in-flight requests AsyncFunction
  44. 44. 44 DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); AsyncFunction
  45. 45. 45 DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); P2P3 P1P4E2E3 E1E4 Ideally... Emitter AsyncFunction
  46. 46. 46 DataStream<Tuple2<String, String>> result = AsyncDataStream.unorderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); P2P3 P1P4E2E3 E1E4 Reallistically... Emitter ...output ordered based on which request finished first AsyncFunction
  47. 47. 47 P2P3 P1P4E2E3 E1E4 Emitter  unorderedWait: emit results in order of completion  orderedWait: emit results in order of arrival  Always: watermarks never overpass elements and vice versa AsyncFunction
  48. 48. Documentation  ProcessFunction: https://ci.apache.org/projects/flink/flink-docs-release- 1.2/dev/stream/process_function.html https://ci.apache.org/projects/flink/flink-docs-release- 1.3/dev/stream/process_function.html  AsyncIO: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html 48
  49. 49. 4 Thank you! @KLOUBEN_K @ApacheFlink @dataArtisans
  50. 50. 50 Stream Processing and Apache Flink®'s approach to it @StephanEwen Apache Flink PMC CTO @ data ArtisansFLINKFORWARD IS COMING BACKTO BERLIN SEPTEMBER11-13, 2017 BERLIN.FLINK-FORWARD.ORG -
  51. 51. We are hiring! data-artisans.com/careers

×