Streaming Dataflow with Apache Flink

1. Ufuk Celebi uce@apache.org HUG London October 15, 2015 Streaming Data Flow with Apache Flink

2. Recent History April ‘14 December ‘14 v0.5 v0.6 v0.7 April ‘15 Project Incubation Top Level Project v0.8 v0.9 Currently moving towards 0.10 and 1.0 release.

3. What is Flink? Streaming Topologies Stream Time Window Count Low Latency Long Batch Pipelines Resource Utilization 1.2 1.4 1.5 1.2 0.8 0.9 1.0 0.8 Rating Matrix User Matrix Item Matrix 1.5 1.7 1.2 0.6 1.0 1.1 0.8 0.4 W X Y ZW X Y Z A B C D 4.0 4.5 5.0 3.5 2.0 3.5 4.0 2.0 1.0 = X User Machine Learning Iterative Algorithms Graph Analysis 53 1 2 4 0.5 0.2 0.9 0.3 0.1 0.4 0.7 Mutable State

4. Overview Deployment  Local (Single JVM) · Cluster (Standalone, YARN) DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API

5. Stream Processing Real world data is unbounded and is pushed to systems. BatchStreaming

6. Stream Platform Architecture Server Logs Trxn Logs Sensor Logs Downstream Systems Flink – Analyze and correlate streams – Create derived streams Kafka – Gather and backup streams – Offer streams

7. Cornerstones of Flink Low Latency for fast results. High Throughput to handle many events per second. Exactly-once guarantees for correct results. Intuitive APIs for productivity.

8. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment  .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();

16. Pipelining s1 t1 w1 s2 t2 w2 Source Tokenizer Window Count Complete Pipeline Online Concurrently.

17. Pipelining s1 t1 w1 s2 t2 w2 Source Tokenizer Window Count Complete Pipeline Online Concurrently. Chained tasks

18. Pipelining s1 s2 t2 w2 t1 w1 Source Tokenizer Window Count Complete Pipeline Online Concurrently. Chained tasks Pipelined Shufﬂe

20. Streaming Fault Tolerance At Least Once • Ensure that all operators see all events. Exactly Once • Ensure that all operators see all events. • Do not perform duplicates updates to operator state.

21. Streaming Fault Tolerance At Least Once • Ensure that all operators see all events. Exactly Once • Ensure that all operators see all events. • Do not perform duplicates updates to operator state. Flink guarantees exactly once processing.

22. Distributed Snaphots Barriers ﬂow through the topology in line with data. Flink guarantees exactly once processing. Part of snapshot

32. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843

33. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843 Start Checkpoint Message

34. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Emit Barriers Acknowledge with Position

35. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Received barrier at each input

36. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Write Snapshot of its state Received barrier at each input

37. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Acknowledge with pointer to state s2

38. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2 Acknowledge Checkpoint Received barrier at each input

39. Distributed Snaphots Flink guarantees exactly once processing.   JobManager Master State Backend Ceckpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2

40. Operator State User-deﬁned state • Flink’s transformations are long running operators • Feel free to keep objects around • Hooks to include into system’s checkpoint Windowed streams • Time, count, and data-driven windows • Managed by the system

41. Batch on Streaming DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API

42. Batch on Streaming Run a bounded stream (data set) on  a stream processor. Bounded data set Unbounded data stream

43. Batch on Streaming Stream Windows Pipelined Data Exchange Global View Pipelined or Blocking Data Exchange Inﬁnite Streams Finite Streams Run a bounded stream (data set) on  a stream processor.

44. Batch Pipelines

45. Batch Pipelines Data exchange  is mostly streamed

46. Batch Pipelines Data exchange  is mostly streamed Some operators block (e.g. sort, hash table)

47. DataSet API ExecutionEnvironment env = ExecutionEnvironment  .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();

48. DataStream API ExecutionEnvironment env = ExecutionEnvironment  .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();

54. Batch-speciﬁc optimizations Managed memory • On- and off-heap memory • Internal operators (e.g. join or sort) with out-of-core support • Serialization stack for user-types Cost-based optimizer • Program adapts to changing data size

55. Getting Started Project Page: http://ﬂink.apache.org

56. Getting Started Project Page: http://ﬂink.apache.org Quickstarts: Java & Scala API

57. Getting Started Project Page: http://ﬂink.apache.org Docs: Programming Guides

58. Getting Started Project Page: http://ﬂink.apache.org Get Involved: Mailing Lists, Stack Overﬂow, IRC, …

59. Blogs http://ﬂink.apache.org/blog http://data-artisans.com/blog Twitter @ApacheFlink Mailing lists (news|user|dev)@ﬂink.apache.org Apache Flink

Streaming Dataflow with Apache Flink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming Dataflow with Apache Flink

Similar to Streaming Dataflow with Apache Flink (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Streaming Dataflow with Apache Flink