Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kostas Kloudas - Extending Flink's Streaming APIs

921 views

Published on

As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.

Published in: Data & Analytics
  • Be the first to comment

Kostas Kloudas - Extending Flink's Streaming APIs

  1. 1. 1 Kostas Kloudas @KLOUBEN_K meetup@ResearchGate February16, 2017 Extending Flink’s Streaming APIs
  2. 2. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution
  3. 3. Additions in Flink 1.2 3
  4. 4. Additions in Flink 1.2 4  Re-scalable State  Low-level Stream Operations  Asynchronous I/O  Table API and SQL  Externalized Checkpoints  Queryable State  Mesos Integration  …and, of course, Documentation
  5. 5. Additions in Flink 1.2 5  Re-scalable State  Low-level Stream Operations  Asynchronous I/O  Table API and SQL  Externalized Checkpoints  Queryable State  Mesos Integration  …and, of course, Documentation
  6. 6. Low-level Stream Operations 6
  7. 7. Common Usecase Skeleton A  On each incoming element: • update some state • register a callback for a moment in the future  When that moment comes: • Check a condition and perform a certain action, e.g. emit an element 7
  8. 8.  Use built-in windowing: • +Expressive • +A lot of functionality out-of-the-box • - Not always intuitive • - An overkill for simple cases  Write your own operator: • - Too many things to account for in Flink 1.1 8 The Flink 1.1 way
  9. 9. The Flink 1.2 way: ProcessFunction  Gives access to all basic building blocks: • Events • Fault-tolerant, Consistent State • Timers (event- and processing-time) 9
  10. 10. The Flink 1.2 way: ProcessFunction  Simple yet powerful API: 10 /** * Process one element from the input stream. */ void processElement(I value, Context ctx, Collector<O> out) throws Exception; /** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;
  11. 11. The Flink 1.2 way: ProcessFunction  Simple yet powerful API: 11 /** * Process one element from the input stream. */ void processElement(I value, Context ctx, Collector<O> out) throws Exception; /** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception; A collector to emit result values
  12. 12. The Flink 1.2 way: ProcessFunction  Simple yet powerful API: 12 /** * Process one element from the input stream. */ void processElement(I value, Context ctx, Collector<O> out) throws Exception; /** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception; 1. Get the timestamp of the element 2. Interact with the TimerService to: • query the current time • and register timers 1. Do the above 2. Query if we are operating on Event or Processing time
  13. 13. ProcessFunction: example  Requirements: • maintain counts per incoming key, and • emit the key/count pair if no element came for the key in the last 100 ms (in event time) 13
  14. 14. ProcessFunction: example 14  Implementation sketch: • Store the count, key and last mod timestamp in a ValueState (scoped by key) • For each record: • update the counter and the last mod timestamp • register a timer 100ms from “now” (in event time) • When the timer fires: • check the callback’s timestamp against the last mod time for the key and • emit the key/count pair if they match
  15. 15. ProcessFunction: example 15 public class MyProcessFunction extends RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { @Override public void open(Configuration parameters) throws Exception { // register our state with the state backend } @Override public void processElement(Tuple2<String, Long> value, Context ctx, Collector<Tuple2<String, Long>> out) throws Exception { // update our state and register a timer } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { // check the state for the key and emit a result if needed } }
  16. 16. ProcessFunction: example 16 public class MyProcessFunction extends RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { private ValueState<MyStateClass> state; @Override public void open(Configuration parameters) throws Exception { state = getRuntimeContext().getState( new ValueStateDescriptor<>("myState", MyStateClass.class)); } }
  17. 17. ProcessFunction: example 17 public class MyProcessFunction extends RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { @Override public void processElement(Tuple2<String, Long> value, Context ctx, Collector<Tuple2<String, Long>> out) throws Exception { CountWithTimestamp current = state.value(); if (current == null) { current = new CountWithTimestamp(); current.key = value.f0; } current.count++; current.lastModified = ctx.timestamp(); state.update(current); ctx.timerService().registerEventTimeTimer(current.timestamp + 100); } }
  18. 18. ProcessFunction: example 18 public class MyProcessFunction extends RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> { @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { CountWithTimestamp result = state.value(); if (timestamp == result.lastModified) { out.collect(new Tuple2<String, Long>(result.key, result.count)); } } }
  19. 19. ProcessFunction: example 19  If your stream is not keyed, you can always group on a dummy key  BEWARE: parallelism of 1 stream.keyBy("id") .process(new MyProcessFunction())
  20. 20. ProcessFunction: miscellaneous 20  CoProcessFunction for low-level joins: • Applied on two input streams • Has two processElement() methods, one for each input stream  Upcoming releases may further enhance the ProcessFunction/CoProcessFunction  Planning to transform all CEP operators to ProcessFunctions
  21. 21. Asynchronous I/O 21
  22. 22. Common Usecase Skeleton B 22  On each incoming element: • extract some info from the element (e.g. key) • query an external storage system (DB or KV- store) for additional info • emit an enriched version of the input element
  23. 23.  Write a MapFunction that queries the DB: • +Simple • - Slow (synchronous access) or/and • - Requires high parallelism (more tasks)  Write your own operator: • - Too many things to account for in Flink 1.1 23 The Flink 1.1 way
  24. 24.  Write a MapFunction that queries the DB: • +Simple • - Slow (synchronous access) or/and • - Requires high parallelism (more tasks)  Write your own operator: • - Too many things to account for in Flink 1.1 24 The Flink 1.1 way
  25. 25. 25 Synchronous Access
  26. 26. 26 Synchronous Access Communication delay can dominate application throughput and latency
  27. 27. 27 Asynchronous Access
  28. 28.  Requirement: • a client that supports asynchronous requests  Flink handles the rest: • integration of async IO with DataStream API • fault-tolerance • order of emitted elements • correct time semantics (event/processing time) 28 The Flink 1.2 way: AsyncFunction
  29. 29.  Simple API: /** * Trigger async operation for each stream input. */ void asyncInvoke(IN input, AsyncCollector<OUT> collector) throws Exception;  API call: /** * Example async function call. */ DataStream<...> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); 29 The Flink 1.2 way: AsyncFunction
  30. 30. The Flink 1.2 way: AsyncFunction 30 Emitter P2P3 P1P4 AsyncWaitOperator E5 AsyncWaitOperator: • a queue of “Promises” • a separate thread (Emitter)
  31. 31. The Flink 1.2 way: AsyncFunction 31 Emitter P2P3 P1P4 AsyncWaitOperator • Wrap E5 in a “promise” P5 • Put P5 in the queue • Call asyncInvoke(E5, P5) E5 P5 asyncInvoke(E5, P5)P5
  32. 32. The Flink 1.2 way: AsyncFunction 32 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the result (when the query returns)
  33. 33. The Flink 1.2 way: AsyncFunction 33 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the result (when the query returns) Future<String> future = client.query(E5); future.thenAccept((String result) -> { P5.collect( Collections.singleton( new Tuple2<>(E5, result))); });
  34. 34. The Flink 1.2 way: AsyncFunction 34 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the result (when the query returns) Future<String> future = client.query(E5); future.thenAccept((String result) -> { P5.collect( Collections.singleton( new Tuple2<>(E5, result))); });
  35. 35. The Flink 1.2 way: AsyncFunction 35 Emitter P2P3 P1P4 AsyncWaitOperator E5 P5 asyncInvoke(E5, P5)P5 Emitter: • separate thread • polls queue for completed promises (blocking) • emits elements downstream
  36. 36. 36 The Flink 1.2 way: AsyncFunction DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);  our asyncFunction  a timeout: max time until considered failed  capacity: max number of in-flight requests
  37. 37. 37 The Flink 1.2 way: AsyncFunction DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);
  38. 38. 38 The Flink 1.2 way: AsyncFunction DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); P2P3 P1P4E2E3 E1E4 Ideally... Emitter
  39. 39. 39 The Flink 1.2 way: AsyncFunction DataStream<Tuple2<String, String>> result = AsyncDataStream.unorderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100); P2P3 P1P4E2E3 E1E4 Reallistically... Emitter ...output ordered based on which request finished first
  40. 40. 40 The Flink 1.2 way: AsyncFunction P2P3 P1P4E2E3 E1E4 Emitter  unorderedWait: emit results in order of completion  orderedWait: emit results in order of arrival  Always: watermarks never overpass elements and vice versa
  41. 41. Documentation  ProcessFunction: https://ci.apache.org/projects/flink/flink-docs-release- 1.2/dev/stream/process_function.html  AsyncIO: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html 41
  42. 42. 4 Thank you! @KLOUBEN_K @ApacheFlink @dataArtisans
  43. 43. 43 One day of hands-on Flink training One day of conference Tickets are on sale Call for Papers is already open Please visit our website: http://sf.flink-forward.org Follow us on Twitter: @FlinkForward
  44. 44. We are hiring! data-artisans.com/careers

×