Introduction to Kafka Streams

12,456 views

Published on

Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.

Published in: Engineering
7 Comments
37 Likes
Statistics
Notes
No Downloads
Views
Total views
12,456
On SlideShare
0
From Embeds
0
Number of Embeds
382
Actions
Shares
0
Downloads
349
Comments
7
Likes
37
Embeds 0
No embeds

No notes for slide
  • Thank you.
  • Well, stream processing has become widely popular today. Unlike Hadoop, Spark-like processing, which takes the bounded set of data, and only start processing until the data is completed, from a ETL process, and it can happen at a much later time than the data was originally generated, Stream processing is a real-time, continuous process for unbounded data series where the processing is usually takes a small set of record, or even one record at a time. And today, a common place to store these data streams is Kafka.
  • Stream processing is a fundamental complement to capturing streams of data.
  • This kind of run-as-a-service operational pattern comes from the Hadoop community.
  • We think there should be an even better solution.
  • No extra dependency, no enforced operational cost.

    In addition, it should support
  • Again, in implementation such changelog streams should be compactable.
  • Take all the organization's data and put it into a central place for real-time subscription.

    Data integration, replication, real-time stream processing.
  • WAL
  • Streaming on Message Pipes
  • Batching: wait for all the data to be available.

    Reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew.

    Not all use cases care about event times (and if yours doesn’t, hooray! — your life is easier), but many do: billing, monitoring, anomaly detection.
  • Talk about stream synchronization
  • Introduction to Kafka Streams

    1. 1. Kafka Streams Stream processing Made Simple with Kafka 1 Guozhang Wang Hadoop Summit, June 28, 2016
    2. 2. 2 What is NOT Stream Processing?
    3. 3. 3 Stream Processing isn’t (necessarily) • Transient, approximate, lossy… • .. that you must have batch processing as safety net
    4. 4. 4
    5. 5. 5
    6. 6. 6
    7. 7. 7
    8. 8. 8 Stream Processing • A different programming paradigm • .. that brings computation to unbounded data • .. with tradeoffs between latency / cost / correctness
    9. 9. 9 Why Kafka in Stream Processing?
    10. 10. 10 • Persistent Buffering • Logical Ordering • Scalable “source-of-truth” Kafka: Real-time Platforms
    11. 11. 11 Stream Processing with Kafka
    12. 12. 12 • Option I: Do It Yourself ! Stream Processing with Kafka
    13. 13. 13 • Option I: Do It Yourself ! Stream Processing with Kafka while (isRunning) { // read some messages from Kafka inputMessages = consumer.poll(); // do some processing… // send output messages back to Kafka producer.send(outputMessages); }
    14. 14. 14
    15. 15. 15 • Ordering • Partitioning &

 Scalability
 • Fault tolerance DIY Stream Processing is Hard • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
    16. 16. 16 • Option I: Do It Yourself ! • Option II: full-fledged stream processing system • Storm, Spark, Flink, Samza, .. Stream Processing with Kafka
    17. 17. 17 MapReduce Heritage? • Config Management • Resource Management
 • Configuration
 • etc..
    18. 18. 18 MapReduce Heritage? • Config Management • Resource Management
 • Deployment
 • etc..
    19. 19. 19 MapReduce Heritage? • Config Management • Resource Management
 • Deployment
 • etc.. Can I just use my own?!
    20. 20. 20 • Option I: Do It Yourself ! • Option II: full-fledged stream processing system • Option III: lightweight stream processing library Stream Processing with Kafka
    21. 21. Kafka Streams • In Apache Kafka since v0.10, May 2016 • Powerful yet easy-to-use stream processing library • Event-at-a-time, Stateful • Windowing with out-of-order handling • Highly scalable, distributed, fault tolerant • and more.. 21
    22. 22. 22 Anywhere, anytime Ok. Ok. Ok. Ok.
    23. 23. 23 Anywhere, anytime <dependency>
 <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version> </dependency>
    24. 24. 24 Anywhere, anytime War File Rsync Puppet/Chef YARN M esos Docker Kubernetes Very Uncool Very Cool
    25. 25. 25 Simple is Beautiful
    26. 26. Kafka Streams DSL 26 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
    27. 27. Kafka Streams DSL 27 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
    28. 28. Kafka Streams DSL 28 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
    29. 29. Kafka Streams DSL 29 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
    30. 30. Kafka Streams DSL 30 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
    31. 31. Kafka Streams DSL 31 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
    32. 32. 32 Native Kafka Integration Property cfg = new Properties(); cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”); cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”); cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”); cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”); cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”); StreamsConfig config = new StreamsConfig(cfg); … KafkaStreams streams = new KafkaStreams(builder, config);
    33. 33. 33 Property cfg = new Properties(); cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”); cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”); cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”); cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”); cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”); StreamsConfig config = new StreamsConfig(cfg); … KafkaStreams streams = new KafkaStreams(builder, config); Native Kafka Integration
    34. 34. 34 API, coding “Full stack” evaluation Operations, debugging, …
    35. 35. 35 API, coding “Full stack” evaluation Operations, debugging, … Simple is Beautiful
    36. 36. 36 Key Idea: Outsource hard problems to Kafka!
    37. 37. Kafka Concepts: the Log 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 10) Messages 3
    38. 38. Topic 1 Topic 2 Partitions Producers Producers Consumers Consumers Brokers Kafka Concepts: the Log
    39. 39. 39 Kafka Streams: Key Concepts
    40. 40. Stream and Records 40 Key Value Key Value Key Value Key Value Stream Record
    41. 41. Processor Topology 41 Stream
    42. 42. Processor Topology 42 Stream Processor
    43. 43. Processor Topology 43 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
    44. 44. Processor Topology 44 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
    45. 45. Processor Topology 45 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
    46. 46. Processor Topology 46 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
    47. 47. Processor Topology 47 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
    48. 48. Processor Topology 48 Source Processor Sink Processor KStream<..> stream1 = builder.stream( KStream<..> stream2 = builder.stream( aggregated.to(
    49. 49. Processor Topology 49 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.table(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”); builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”) .addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”) .addStateStore(Stores.persistent().build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”)
    50. 50. Processor Topology 50 builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”) .addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”) .addStateStore(Stores.persistent().build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”)
    51. 51. Processor Topology 51 builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”) .addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”) .addStateStore(Stores.persistent().build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”)
    52. 52. Processor Topology 52 builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”) .addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”) .addStateStore(Stores.persistent().build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”)
    53. 53. Processor Topology 53Kafka Streams Kafka
    54. 54. Processor Topology 54 … sink1.to(”topic1”); source1 = builder.table(”topic1”); source2 = sink1.through(”topic2”); …
    55. 55. Processor Topology 55 … sink1.to(”topic1”); source1 = builder.table(”topic1”); source2 = sink1.through(”topic2”); …
    56. 56. Processor Topology 56 … sink1.to(”topic1”); source1 = builder.table(”topic1”); source2 = sink1.through(”topic2”); …
    57. 57. Processor Topology 57 … sink1.to(”topic1”); source1 = builder.table(”topic1”); source2 = sink1.through(”topic2”); …
    58. 58. Processor Topology 58 … sink1.to(”topic1”); source1 = builder.table(”topic1”); source2 = sink1.through(”topic2”); … Sub-Topology
    59. 59. Processor Topology 59Kafka Streams Kafka
    60. 60. Processor Topology 60Kafka Streams Kafka
    61. 61. Processor Topology 61Kafka Streams Kafka
    62. 62. Processor Topology 62Kafka Streams Kafka
    63. 63. Stream Partitions and Tasks 63 Kafka Topic B Kafka Topic A P1 P2 P1 P2
    64. 64. Stream Partitions and Tasks 64 Kafka Topic B Kafka Topic A Processor Topology P1 P2 P1 P2
    65. 65. Stream Partitions and Tasks 65 Kafka Topic AKafka Topic B
    66. 66. Kafka Topic B Task2Task1 Stream Partitions and Tasks 66 Kafka Topic A
    67. 67. Kafka Topic B Stream Partitions and Tasks 67 Kafka Topic A Task2Task1
    68. 68. Kafka Topic B Stream Threads 68 Kafka Topic A MyApp.1 Task2Task1
    69. 69. Kafka Topic B Stream Threads 69 Kafka Topic A Task2Task1 MyApp.1 MyApp.2
    70. 70. Kafka Topic B Stream Threads 70 Kafka Topic A MyApp.1 MyApp.2 Task2Task1
    71. 71. Stream Threads 71 Kafka Topic AKafka Topic B Task2Task1 MyApp.1 MyApp.2
    72. 72. Stream Threads 72 Task3 MyApp.3 Kafka Topic AKafka Topic B Task2Task1 MyApp.1 MyApp.2
    73. 73. Stream Threads 73 Task3 Kafka Topic AKafka Topic B Task2Task1 MyApp.1 MyApp.2 MyApp.3
    74. 74. Stream Threads 74 Thread1 Kafka Topic B Task2Task1 Thread2 Task4Task3 Kafka Topic AKafka Topic A
    75. 75. Stream Threads 75 Thread1 Kafka Topic B Task2Task1 Thread2 Task4Task3 Kafka Topic AKafka Topic A
    76. 76. Stream Threads 76 Thread1 Kafka Topic B Task2Task1 Thread2 Task4Task3 Kafka Topic AKafka Topic A
    77. 77. Stream Threads 77 Thread1 Kafka Topic B Task2Task1 Thread2 Task4Task3 Kafka Topic AKafka Topic A
    78. 78. 78 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
    79. 79. States in Stream Processing 79 • filter • map
 • join
 • aggregate Stateless Stateful
    80. 80. 80
    81. 81. States in Stream Processing 81 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic2”); State
    82. 82. 82 builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”) .addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”) .addStateStore(Stores.persistent().build(), ”Aggregate”) .addSink(”Sink”, ”topic3”, ”Aggregate”) State States in Stream Processing
    83. 83. Kafka Topic B Task2Task1 States in Stream Processing 83 Kafka Topic A State State
    84. 84. It’s all about Time • Event-time (when an event is created) • Processing-time (when an event is processed) 84
    85. 85. Event-time 1 2 3 4 5 6 7 Processing-time 1999 2002 2005 1997 1980 1983 2015 85 PHANTOMMENACE ATTACKOFTHECLONES REVENGEOFTHESITH ANEWHOPE THEEMPIRESTRIKESBACK RETURNOFTHEJEDI THEFORCEAWAKENS Out-of-Order
    86. 86. Timestamp Extractor 86 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); }
    87. 87. Timestamp Extractor 87 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); } processing-time
    88. 88. Timestamp Extractor 88 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); } processing-time event-time
    89. 89. Timestamp Extractor 89 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } processing-time event-time public long extract(ConsumerRecord<Object, Object> record) { return ((JsonNode) record.value()).get(”timestamp”).longValue(); }
    90. 90. Windowing 90 t …
    91. 91. Windowing 91 t …
    92. 92. Windowing 92 t …
    93. 93. Windowing 93 t …
    94. 94. Windowing 94 t …
    95. 95. Windowing 95 t …
    96. 96. Windowing 96 t …
    97. 97. 97 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
    98. 98. Stream v.s.Table? 98 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic2”); State
    99. 99. 99 Tables ≈ Streams
    100. 100. 100
    101. 101. 101
    102. 102. 102
    103. 103. The Stream-Table Duality • A stream is a changelog of a table • A table is a materialized view at time of a stream • Example: change data capture (CDC) of databases 103
    104. 104. KStream = interprets data as record stream ~ think: “append-only” KTable = data as changelog stream ~ continuously updated materialized view 104
    105. 105. 105 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile
    106. 106. 106 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile time “Alice bought eggs.” “Alice is now at LinkedIn.”
    107. 107. 107 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile time “Alice bought eggs and milk.” “Alice is now at LinkedIn Microsoft.”
    108. 108. 108 alice 2 bob 10 alice 3 timeKStream.aggregate() KTable.aggregate() (key: Alice, value: 2) (key: Alice, value: 2)
    109. 109. 109 alice 2 bob 10 alice 3 time (key: Alice, value: 2 3) (key: Alice, value: 2+3) KStream.aggregate() KTable.aggregate()
    110. 110. 110 KStream KTable reduce() aggregate() … toStream() map() filter() join() … map() filter() join() …
    111. 111. 111 KTable aggregated KStream joined KStream stream1KStream stream2 Updates Propagation in KTable State
    112. 112. 112 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
    113. 113. 113 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
    114. 114. 114 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
    115. 115. 115 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
    116. 116. 116 Remember?
    117. 117. 117 StateProcess StateProcess StateProcess Kafka ChangelogFault Tolerance Kafka Kafka Streams Kafka
    118. 118. 118 StateProcess StateProcess Protoco l StateProcess Fault Tolerance Kafka Kafka Streams Kafka Changelog Kafka
    119. 119. 119 StateProcess StateProcess Protoco l StateProcess Fault Tolerance StateProcess Kafka Kafka Streams Kafka Changelog Kafka
    120. 120. 120
    121. 121. 121
    122. 122. 122
    123. 123. 123
    124. 124. 124 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
    125. 125. 125 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing Simple is Beautiful
    126. 126. Ongoing Work (0.10+) • Beyond Java APIs • SQL support, Python client, etc • End-to-End Semantics (exactly-once) • Queryable States • … and more 126
    127. 127. Queryable States 127 State Real-time Analytics select Count(*), Sum(*) from “MyAgg” where windowId > now() - 10;
    128. 128. 128 But how to get data in / out Kafka?
    129. 129. 129
    130. 130. 130
    131. 131. 131
    132. 132. 132
    133. 133. Take-aways • Stream Processing: a new programming paradigm 133
    134. 134. Take-aways • Stream Processing: a new programming paradigm • Kafka Streams: stream processing made easy 134
    135. 135. Take-aways • Stream Processing: a new programming paradigm • Kafka Streams: stream processing made easy 135 THANKS! Guozhang Wang | guozhang@confluent.io | @guozhangwang Visit Confluent at the Syncsort Booth (#1303), live demos @ 29th Download Kafka Streams: www.confluent.io/product
    136. 136. 136 We are Hiring!

    ×