Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

2,177 views

Published on

Talk which I gave at the first Apache Flink Meetup in Paris on the 29th of October.

It gives an introduction into Apache Flink's streaming and batch API. Furthermore, it is explained how Flink jobs are deployed. Flink's checkpointing mechanism is presented which gives exactly-once processing guarantees.

Published in: Technology
  • Be the first to comment

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

  1. 1. Streaming Data Flow with Apache Flink Till Rohrmann trohrmann@apache.org @stsffap
  2. 2. Recent History April ‘14 December ‘14 v0.5 v0.6 v0.7 April ‘15 Project Incubation Top Level Project v0.8 v0.9 Currently moving towards 0.10 and 1.0 release.
  3. 3. What is Flink? Deployment
 Local (Single JVM) · Cluster (Standalone, YARN) DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API
  4. 4. What is Flink? Streaming Topologies Stream Time Window Count Low Latency Long Batch Pipelines Resource Utilization 1.2 1.4 1.5 1.2 0.8 0.9 1.0 0.8 Rating Matrix User Matrix Item Matrix 1.5 1.7 1.2 0.6 1.0 1.1 0.8 0.4 W X Y ZW X Y Z A B C D 4.0 4.5 5.0 3.5 2.0 3.5 4.0 2.0 1.0 = X User Machine Learning Iterative Algorithms Graph Analysis 53 1 2 4 0.5 0.2 0.9 0.3 0.1 0.4 0.7 Mutable State
  5. 5. Stream Processing Real world data is unbounded and is pushed to systems. BatchStreaming
  6. 6. Stream Platform Architecture Server Logs Trxn Logs Sensor Logs Downstream Systems Flink – Analyze and correlate streams – Create derived streams Kafka – Gather and backup streams – Offer streams
  7. 7. Cornerstones of Flink Low Latency for fast results. High Throughput to handle many events per second. Exactly-once guarantees for correct results. Expressive APIs for productivity.
  8. 8. sum DataStream API keyBy sumTime Window Time Window
  9. 9. sum DataStream API keyBy sumTime Window Time Window
  10. 10. sum DataStream API keyBy sumTime Window Time Window
  11. 11. sum DataStream API keyBy sumTime Window Time Window
  12. 12. sum DataStream API keyBy sumTime Window Time Window
  13. 13. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  14. 14. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  15. 15. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  16. 16. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  17. 17. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  18. 18. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  19. 19. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  20. 20. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  21. 21. DataStream API public static class SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  22. 22. DataStream API public static class SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  23. 23. DataStream API public static class SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  24. 24. DataStream API public static class SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  25. 25. DataStream API public static class SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  26. 26. DataStream API public static class SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  27. 27. Pipelining DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, …); // DataStream WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // split stream by word .sum(1); // sum per word as they arrive Source Map Reduce
  28. 28. Pipelining S1 M1 R1 S2 M2 R2 Source Map Reduce Complete pipeline online concurrently.
  29. 29. Pipelining S1 M1 R1 S2 M2 R2 Chained tasks Complete pipeline online concurrently. Source Map Reduce
  30. 30. Pipelining S1 M1 R1 S2 M2 R2 Chained tasks Complete pipeline online concurrently. Source Map Reduce S1 · M1
  31. 31. Pipelining S1 S2 M2 M1 R1 Complete pipeline online concurrently. Chained tasks Pipelined Shuffle Source Map Reduce S1 · M1 R2
  32. 32. Pipelining Complete pipeline online concurrently. Worker Worker
  33. 33. Pipelining Complete pipeline online concurrently. Worker Worker
  34. 34. Pipelining Complete pipeline online concurrently. Worker Worker
  35. 35. Pipelining Complete pipeline online concurrently. Worker Worker
  36. 36. Pipelining Complete pipeline online concurrently. Worker Worker
  37. 37. Streaming Fault Tolerance At Most Once • No guarantees at all At Least Once • Ensure that all operators see all events. Exactly Once • Ensure that all operators see all events. • Do not perform duplicates updates to operator state. Flink gives you all guarantees.
  38. 38. Distributed Snapshots Barriers flow through the topology in line with data. Flink guarantees exactly once processing. Part of snapshot
  39. 39. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843
  40. 40. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843 Start Checkpoint Message
  41. 41. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Emit Barriers Acknowledge with Position
  42. 42. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Received barrier at each input
  43. 43. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Write snapshot of its state Received barrier at each input
  44. 44. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Acknowledge with pointer to state s2
  45. 45. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2 Acknowledge Checkpoint Received barrier at each input
  46. 46. Distributed Snapshots Flink guarantees exactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2
  47. 47. Operator State Stateless Operators ds.filter(_ != 0) System state ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS)) User defined state public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter; @Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; } @Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); } }
  48. 48. Batch on Streaming DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API
  49. 49. Batch on Streaming Run a bounded stream (data set) on
 a stream processor. Bounded data set Unbounded data stream
  50. 50. Batch on Streaming Stream Windows Pipelined Data Exchange Global View Pipelined or Blocking Data Exchange Infinite Streams Finite Streams Run a bounded stream (data set) on
 a stream processor.
  51. 51. Batch Pipelines Data exchange
 is mostly streamed Some operators block (e.g. sort, hash table)
  52. 52. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  53. 53. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  54. 54. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  55. 55. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  56. 56. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  57. 57. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  58. 58. DataSet API ExecutionEnvironment env = ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  59. 59. Batch-specific optimizations Cost-based optimizer • Program adapts to changing data size Managed memory • On- and off-heap memory • Internal operators (e.g. join or sort) with out-of-core support • Serialization stack for user-types
  60. 60. Demo Time
  61. 61. Getting Started Project Page: http://flink.apache.org
  62. 62. Getting Started Project Page: http://flink.apache.org Quickstarts: Java & Scala API
  63. 63. Getting Started Project Page: http://flink.apache.org Docs: Programming Guides
  64. 64. Getting Started Project Page: http://flink.apache.org Get Involved: Mailing Lists, Stack Overflow, IRC, …
  65. 65. Blogs http://flink.apache.org/blog http://data-artisans.com/blog Twitter @ApacheFlink Mailing lists (news|user|dev)@flink.apache.org Apache Flink
  66. 66. Thank You!

×