Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kafka, and the Rise of Stream Processing

1,091 views

Published on

For a long time, a substantial portion of data processing that companies did ran as big batch jobs. But businesses operate in real-time and the software they run is catching up. Today, processing data in a streaming fashion becomes more and more popular in many companies over the more "traditional" way of batch-processing big data sets available as a whole.

Published in: Engineering

Apache Kafka, and the Rise of Stream Processing

  1. 1. 1 Guozhang Wang DataEngConf, Nov 3, 2016 Apache Kafka and the Rise of Stream Processing
  2. 2. 2 What is Kafka, really?
  3. 3. 3 What is Kafka, Really? [NetDB 2011]a scalable Pub-sub messaging system..
  4. 4. 4 What is Kafka, Really? [NetDB 2011] [Hadoop Summit 2013] a scalable Pub-sub messaging system.. a real-time data pipeline..
  5. 5. 5 Example: Centralized Data Pipeline KV-Store Doc-Store RDBMS Tracking Logs / Metrics Hadoop / DW Monitoring Rec. Engine Social GraphSearchingSecurity Apache Kafka …
  6. 6. 6 What is Kafka, Really? [NetDB 2011] [Hadoop Summit 2013] [VLDB 2015] a scalable Pub-sub messaging system.. a real-time data pipeline.. a distributed and replicated log..
  7. 7. 7 Example: Data Store Geo-Replication Apache Kafka Local Stores User Apps User Apps Local Stores Apache Kafka Region 2Region 1 write read append log mirroring apply log
  8. 8. 8 What is Kafka, Really? a scalable Pub-sub messaging system.. [NetDB 2011] a real-time data pipeline.. [Hadoop Summit 2013] a distributed and replicated log.. [VLDB 2015] a unified data integration stack.. [CIDR 2015]
  9. 9. 9 Example: Async. Micro-Services
  10. 10. 10 Which of the following is true? a scalable Pub-sub messaging system.. [NetDB 2011] a real-time data pipeline.. [Hadoop Summit 2013] a distributed and replicated log.. [VLDB 2015] a unified data integration stack.. [CIDR 2015]
  11. 11. 11 Which of the following is true? a scalable Pub-sub messaging system.. [NetDB 2011] a real-time data pipeline.. [Hadoop Summit 2013] a distributed and replicated log.. [VLDB 2015] a unified data integration stack.. [CIDR 2015] All of them!
  12. 12. 12 Kafka: Streaming Platform • Publish / Subscribe • Move data around as online streams • Store • “Source-of-truth” continuous data • Process • React / process data in real-time
  13. 13. 13 Kafka: Streaming Platform • Publish / Subscribe • Move data around as online streams • Store • “Source-of-truth” continuous data • Process • React / process data in real-time Producer Producer Consumer Consumer
  14. 14. 14 Kafka: Streaming Platform • Publish / Subscribe • Move data around as online streams • Store • “Source-of-truth” continuous data • Process • React / process data in real-time Producer Producer Consumer Consumer Connect Connect
  15. 15. 15
  16. 16. 16 Connectors • 40+ since first release this Feb (0.9+) • 13 from & partners
  17. 17. 17 Kafka: Streaming Platform • Publish / Subscribe • Move data around as online streams • Store • “Source-of-truth” continuous data • Process • React / process data in real-time Streams Producer Producer Consumer Consumer Connect Connect Streams
  18. 18. 18 Kafka: Streaming Platform • Publish / Subscribe • Move data around as online streams • Store • “Source-of-truth” continuous data • Process • React / process data in real-time Streams Producer Producer Consumer Consumer Connect Connect Streams Today’s Talk
  19. 19. 19 Stream Processing • A different programming paradigm • .. that brings computation to unbounded data • .. with tradeoffs between latency / cost / correctness
  20. 20. 20 Model Update Model Prediction Training / Feedback Stream Input Stream (parameters, etc) Output Stream (scores, categories, etc) Online Machine Learning
  21. 21. 21 Sales Message Queue Shipment Orders Async. Micro-services
  22. 22. 22 OLAP Engine Real-time Analytics Online DB2 Log File Stream / Change Data Capture Stream Online DB1 Query Results
  23. 23. Kafka Streams (0.10+) • New client library besides producer and consumer • Powerful yet easy-to-use • Event-at-a-time, Stateful • Windowing with out-of-order handling • Highly scalable, distributed, fault tolerant • and more.. 23
  24. 24. 24 Anywhere, anytime Ok. Ok. Ok. Ok.
  25. 25. 25 Anywhere, anytime <dependency>
 <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version> </dependency>
  26. 26. 26 Anywhere, anytime War File Rsync Puppet/Chef YARN M esos Docker Kubernetes Very Uncool Very Cool
  27. 27. 27 Simple is Beautiful
  28. 28. Kafka Streams DSL 28 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  29. 29. Kafka Streams DSL 29 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  30. 30. Kafka Streams DSL 30 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  31. 31. Kafka Streams DSL 31 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  32. 32. Kafka Streams DSL 32 public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  33. 33. 33 API, coding “Full stack” evaluation Operations, debugging, …
  34. 34. 34 API, coding “Full stack” evaluation Operations, debugging, … Simple is Beautiful
  35. 35. 35 Kafka Streams: Key Concepts
  36. 36. Stream and Records 36 Key Value Key Value Key Value Key Value Stream Record
  37. 37. Processor Topology 37 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  38. 38. Processor Topology 38 Stream
  39. 39. Processor Topology 39 Stream Processor
  40. 40. Processor Topology 40 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  41. 41. Processor Topology 41 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  42. 42. Processor Topology 42 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  43. 43. Processor Topology 43 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  44. 44. Processor Topology 44 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  45. 45. Processor Topology 45 Source Processor Sink Processor KStream<..> stream1 = builder.stream( KStream<..> stream2 = builder.stream( aggregated.to(
  46. 46. Processor Topology 46Kafka Streams Kafka
  47. 47. Stream Partitions and Tasks 47 Kafka Topic B Kafka Topic A P1 P2 P1 P2
  48. 48. Stream Partitions and Tasks 48 Kafka Topic B Kafka Topic A Processor Topology P1 P2 P1 P2
  49. 49. Stream Partitions and Tasks 49 Kafka Topic AKafka Topic B
  50. 50. Kafka Topic B Stream Threads 50 Kafka Topic A MyApp.1 MyApp.2 Task2Task1
  51. 51. Stream Threads 51 Kafka Topic AKafka Topic B Task2Task1 MyApp.1 MyApp.2
  52. 52. Stream Threads 52 Task3 MyApp.3 Kafka Topic AKafka Topic B Task2Task1 MyApp.1 MyApp.2
  53. 53. Stream Threads 53 Task3 Kafka Topic AKafka Topic B Task2Task1 MyApp.1 MyApp.2 MyApp.3
  54. 54. States in Stream Processing 54 • filter • map
 • join
 • aggregate Stateless Stateful
  55. 55. 55
  56. 56. States in Stream Processing 56 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic2”); State
  57. 57. Kafka Topic B Task2Task1 States in Stream Processing 57 Kafka Topic A State State
  58. 58. Interactive Queries on States 58 Your App (Streams) Kafka Other AppN Other App1 … Query Engine
  59. 59. Interactive Queries on States 59 Your App (Streams) Kafka Other AppN Other App1 … Query Engine Complexity: lots of moving parts
  60. 60. Interactive Queries on States (0.10.1+) 60 Your App (Streams) Kafka State Other AppN Other App2 Other App1 …
  61. 61. Stream v.s.Table? 61 KStream<..> stream1 = builder.stream(”topic1”); KStream<..> stream2 = builder.stream(”topic2”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic2”); State
  62. 62. 62 Tables ≈ Streams
  63. 63. 63
  64. 64. 64
  65. 65. 65
  66. 66. The Stream-Table Duality • A stream is a changelog of a table • A table is a materialized view at time of a stream • Example: change data capture (CDC) of databases 66
  67. 67. KStream = interprets data as record stream ~ think: “append-only” KTable = data as changelog stream ~ continuously updated materialized view 67
  68. 68. 68 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile
  69. 69. 69 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile time “Alice bought eggs.” “Alice is now at LinkedIn.”
  70. 70. 70 alice eggs bob lettuce alice milk alice lnkd bob googl alice msft KStream KTable User purchase history User employment profile time “Alice bought eggs and milk.” “Alice is now at LinkedIn Microsoft.”
  71. 71. 71 alice 2 bob 10 alice 3 timeKStream.aggregate() KTable.aggregate() (key: Alice, value: 2) (key: Alice, value: 2)
  72. 72. 72 alice 2 bob 10 alice 3 time (key: Alice, value: 2 3) (key: Alice, value: 2+3) KStream.aggregate() KTable.aggregate()
  73. 73. 73 KStream KTable reduce() aggregate() … toStream() map() filter() join() … map() filter() join() …
  74. 74. 74 KTable aggregated KStream joined KStream stream1KStream stream2 Updates Propagation in KTable State
  75. 75. 75 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
  76. 76. 76 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
  77. 77. 77 KTable aggregated KStream joined KStream stream1KStream stream2 State Updates Propagation in KTable
  78. 78. 78 What about Fault Tolerance?
  79. 79. 79 Remember?
  80. 80. 80 StateProcess StateProcess StateProcess Kafka ChangelogFault Tolerance Kafka Kafka Streams Kafka
  81. 81. 81 StateProcess StateProcess Protoco l StateProcess Fault Tolerance Kafka Kafka Streams Kafka Changelog Kafka
  82. 82. 82 StateProcess StateProcess Protoco l StateProcess Fault Tolerance StateProcess Kafka Kafka Streams Kafka Changelog Kafka
  83. 83. 83
  84. 84. 84
  85. 85. 85
  86. 86. 86
  87. 87. It’s all about Time • Event-time (when an event is created) • Processing-time (when an event is processed) 87
  88. 88. Event-time 1 2 3 4 5 6 7 Processing-time 1999 2002 2005 1977 1980 1983 2015 88 PHANTOMMENACE ATTACKOFTHECLONES REVENGEOFTHESITH ANEWHOPE THEEMPIRESTRIKESBACK RETURNOFTHEJEDI THEFORCEAWAKENS Out-of-Order
  89. 89. Timestamp Extractor 89 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); }
  90. 90. Timestamp Extractor 90 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); } processing-time
  91. 91. Timestamp Extractor 91 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } public long extract(ConsumerRecord<Object, Object> record) { return record.timestamp(); } processing-time event-time
  92. 92. Timestamp Extractor 92 public long extract(ConsumerRecord<Object, Object> record) { return System.currentTimeMillis(); } processing-time event-time public long extract(ConsumerRecord<Object, Object> record) { return ((JsonNode) record.value()).get(”timestamp”).longValue(); }
  93. 93. Windowing 93 t …
  94. 94. Windowing 94 t …
  95. 95. Windowing 95 t …
  96. 96. Windowing 96 t …
  97. 97. Windowing 97 t …
  98. 98. Windowing 98 t …
  99. 99. Windowing 99 t …
  100. 100. 100 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing For more details: http://docs.confluent.io/current
  101. 101. 101 • Ordering • Partitioning &

 Scalability
 • Fault tolerance Stream Processing Hard Parts • State Management • Time, Window &

 Out-of-order Data
 • Re-processing Simple is Beautiful For more details: http://docs.confluent.io/current
  102. 102. Ongoing Work (0.10.1+) • Beyond Java APIs • SQL support, Python client, etc • End-to-End Semantics (exactly-once) • … and more 102
  103. 103. Take-aways • Apache Kafka: a centralized streaming platform • Kafka Streams: stream processing made easy 103
  104. 104. Take-aways • Apache Kafka: a centralized streaming platform 104
  105. 105. Take-aways 105 THANKS! Guozhang Wang | guozhang@confluent.io | @guozhangwang CFP Kafka Summit 2017 @ NYC & SF Confluent Webinar: http://www.confluent.io/resources • Apache Kafka: a centralized streaming platform • Kafka Streams: stream processing made easy

×