Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream Processing made simple with Kafka

1,954 views

Published on

Stream Processing made simple with Kafka

Published in: Technology
  • Be the first to comment

Stream Processing made simple with Kafka

  1. 1. Kafka Streams Stream processing Made Simple with Kafka 1 Guozhang Wang Hadoop Summit, June 28, 2016
  2. 2. 2 What is NOT Stream Processing?
  3. 3. 3 Stream Processing isn’t (necessarily) • Transient, approximate, lossy… • .. that you must have batch processing as safety net
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. 8 Stream Processing • A different programming paradigm • .. that brings computation to unbounded data • .. with tradeoffs between latency / cost / correctness
  9. 9. 9 Why Kafka in Stream Processing?
  10. 10. 10 • Persistent Buffering • Logical Ordering • Highly Scalable “source-of-truth” Kafka: Real-time Platforms
  11. 11. 11 Stream Processing with Kafka
  12. 12. 12 • Option I: Do It Yourself ! Stream Processing with Kafka
  13. 13. 13 • Option I: Do It Yourself ! Stream Processing with Kafka while (isRunning) { // read some messages from Kafka inputMessages = consumer.poll(); // do some processing… // send output messages back to Kafka producer.send(outputMessages); }
  14. 14. 14 • Ordering • Partitioning &

 Scalability
 • Fault tolerance DIY Stream Processing is Hard • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
  15. 15. 15 • Option I: Do It Yourself ! • Option II: full-fledged stream processing system • Storm, Spark, Flink, Samza, .. Stream Processing with Kafka
  16. 16. 16 MapReduce Heritage? • Config Management • Resource Management
 • Configuration
 • etc..
  17. 17. 17 MapReduce Heritage? • Config Management • Resource Management
 • Deployment
 • etc..
  18. 18. 18 MapReduce Heritage? • Config Management • Resource Management
 • Deployment
 • etc.. Can I just use my own?!
  19. 19. 19 • Option I: Do It Yourself ! • Option II: full-fledged stream processing system • Option III: lightweight stream processing library Stream Processing with Kafka
  20. 20. Kafka Streams • In Apache Kafka since v0.10, May 2016 • Powerful yet easy-to-use stream processing library • Event-at-a-time, Stateful • Windowing with out-of-order handling • Highly scalable, distributed, fault tolerant • and more.. 20
  21. 21. 21 Anywhere, anytime Ok. Ok. Ok. Ok.
  22. 22. 22 Anywhere, anytime <dependency>
 <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version> </dependency>
  23. 23. 23 Anywhere, anytime War File Rsync Puppet/Chef YARN M esos Docker Kubernetes Very Uncool Very Cool
  24. 24. 24 Simple is Beautiful
  25. 25. Kafka Streams DSL 25 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  26. 26. Kafka Streams DSL 26 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  27. 27. Kafka Streams DSL 27 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  28. 28. Kafka Streams DSL 28 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  29. 29. Kafka Streams DSL 29 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  30. 30. Kafka Streams DSL 30 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  31. 31. 31 Native Kafka Integration Property cfg = new Properties(); cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”); cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”); cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”); cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”); cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”); StreamsConfig config = new StreamsConfig(cfg); … KafkaStreams streams = new KafkaStreams(builder, config);
  32. 32. 32 Property cfg = new Properties(); cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”); cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”); cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”); cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”); cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”); StreamsConfig config = new StreamsConfig(cfg); … KafkaStreams streams = new KafkaStreams(builder, config); Native Kafka Integration
  33. 33. 33 API, coding “Full stack” evaluation Operations, debugging, …
  34. 34. 34 API, coding “Full stack” evaluation Operations, debugging, … Simple is Beautiful
  35. 35. 35 Key Idea: Outsource hard problems to Kafka!
  36. 36. Kafka Concepts: the Log 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 10) Messages 3
  37. 37. Topic 1 Topic 2 Partitions Producers Producers Consumers Consumers Brokers Kafka Concepts: the Log
  38. 38. 38 Kafka Streams: Key Concepts
  39. 39. Stream and Records 39 Key Value Key Value Key Value Key Value Stream Record
  40. 40. Processor Topology 40 Stream
  41. 41. Processor Topology 41 Stream Processor
  42. 42. Processor Topology 42 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  43. 43. Processor Topology 43 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  44. 44. Processor Topology 44 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  45. 45. Processor Topology 45 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  46. 46. Processor Topology 46 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  47. 47. Processor Topology 47 Source Processor Sink Processor KStream<..> stream1 = builder.stream( KStream<..> stream2 = builder.stream( aggregated.to(
  48. 48. Processor Topology 48Kafka Streams Kafka
  49. 49. Kafka Topic B Data Parallelism 49 Kafka Topic A MyApp.1 MyApp.2 Task2Task1
  50. 50. 50 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
  51. 51. States in Stream Processing 51 • filter • map
 • join
 • aggregate Stateless Stateful
  52. 52. 52
  53. 53. States in Stream Processing 53 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic2”); State
  54. 54. Kafka Topic B Task2Task1 States in Stream Processing 54 Kafka Topic A State State
  55. 55. It’s all about Time • Event-time (when an event is created) • Processing-time (when an event is processed) 55
  56. 56. Event-time 1 2 3 4 5 6 7 Processing-time 1999 2002 2005 1997 1980 1983 2015 56 PHANTOMMENACE ATTACKOFTHECLONES REVENGEOFTHESITH ANEWHOPE THEEMPIRESTRIKESBACK RETURNOFTHEJEDI THEFORCEAWAKENS Out-of-Order
  57. 57. Timestamp Extractor 57 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); }
  58. 58. Timestamp Extractor 58 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); } processing-time
  59. 59. Timestamp Extractor 59 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); } processing-time event-time
  60. 60. Windowing 60 t …
  61. 61. Windowing 61 t …
  62. 62. Windowing 62 t …
  63. 63. Windowing 63 t …
  64. 64. Windowing 64 t …
  65. 65. Windowing 65 t …
  66. 66. Windowing 66 t …
  67. 67. 67 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
  68. 68. Stream v.s.Table? 68 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic2”); State
  69. 69. 69 Tables ≈ Streams
  70. 70. 70
  71. 71. 71
  72. 72. 72
  73. 73. The Stream-Table Duality • A stream is a changelog of a table • A table is a materialized view at time of a stream • Example: change data capture (CDC) of databases 73
  74. 74. KStream = interprets data as record stream ~ think: “append-only” KTable = data as changelog stream ~ continuously updated materialized view 74
  75. 75. 75 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile
  76. 76. 76 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile time “Alice bought eggs.” “Alice is now at LinkedIn.”
  77. 77. 77 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile time “Alice bought eggs and milk.” “Alice is now at LinkedIn Microsoft.”
  78. 78. 78 alice 2 bob 10 alice 3 timeKStream.aggregate() KTable.aggregate() (key: Alice, value: 2) (key: Alice, value: 2)
  79. 79. 79 alice 2 bob 10 alice 3 time (key: Alice, value: 2 3) (key: Alice, value: 2+3) KStream.aggregate() KTable.aggregate()
  80. 80. 80 KStream KTable reduce() aggregate() … toStream() map() filter() join() … map() filter() join() …
  81. 81. 81 KTable aggregated KStream joined KStream stream1KStream stream2 Updates Propagation in KTable State
  82. 82. 82 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
  83. 83. 83 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
  84. 84. 84 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
  85. 85. 85 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
  86. 86. 86 Remember?
  87. 87. 87 StateProcess StateProcess StateProcess Kafka ChangelogFault Tolerance Kafka Kafka Streams Kafka
  88. 88. 88 StateProcess StateProcess Protoco l StateProcess Fault Tolerance Kafka Kafka Streams Kafka Changelog Kafka
  89. 89. 89 StateProcess StateProcess Protoco l StateProcess Fault Tolerance StateProcess Kafka Kafka Streams Kafka Changelog Kafka
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93
  94. 94. 94 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing
  95. 95. 95 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing Simple is Beautiful
  96. 96. 96 But how to get data in / out Kafka?
  97. 97. 97
  98. 98. 98
  99. 99. 99
  100. 100. 100
  101. 101. Take-aways • Stream Processing: a new programming paradigm 101
  102. 102. Take-aways • Stream Processing: a new programming paradigm • Kafka Streams: stream processing made easy 102
  103. 103. Take-aways • Stream Processing: a new programming paradigm • Kafka Streams: stream processing made easy 103 THANKS! Guozhang Wang | guozhang@confluent.io | @guozhangwang Visit Confluent at the Syncsort Booth (#1303), live demos @ 29th Download Kafka Streams: www.confluent.io/product
  104. 104. 104 We are Hiring!

×