Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Stream Processing Applications with Apache Kafka's Exactly-Once Processing Guarantees

414 views

Published on

This talk was given at the "Big Data Applications" Meetup group (https://www.meetup.com/BigDataApps/).

Abstract:
Kafka 0.11 added a new feature called "exactly-once guarantees". In this talk, we will explain what "exactly-once" means in the context of Kafka and data stream processing and how it effects application development. The talk will go into some details about exactly-once namely the new idempotent producer and transactions and how both can be exploited to simplify application code: for example, you don't need to have complex deduplication code in your input path, as you can rely on Kafka to deduplicate messages when data is produces by an upstream application. Transactions can be used to write multiple messages into different topics and/or partitions and commit all writes in an atomic manner (or abort all writes so none will be read by a downstream consumer in read-committed mode). Thus, transactions allow for applications with strong consistency guarantees, like in the financial sector (e.g., either send a withdrawal and deposit message to transfer money or none of them). Finally, we talk about Kafka's Streams API that makes exactly-once stream processing as simple as it can get.

Published in: Software
  • Be the first to comment

Building Stream Processing Applications with Apache Kafka's Exactly-Once Processing Guarantees

  1. 1. 1 Building Stream Processing Applications with Apache® Kafka’sTM Exactly-Once Processing Guarantees Matthias J. Sax | Software Engineer matthias@confluent.io @MatthiasJSax
  2. 2. 2 Apache Kafka • A distributed Streaming Platform Consumers Producers Connectors Processing
  3. 3. 3 Confluent • Founded by the original creators of Apache Kafka • Headquarters based in Palo Alto, CA KSQL: Streaming SQL for Apache Kafka Developer Preview (https://github.com/confluentinc/ksql)
  4. 4. 4 How to Build Applications with Apache Kafka • Streams API • Client library (it’s actually much more, but you use it like one) • DIY using Consumer/Producer API <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.11.0.1</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.11.0.1</version> </dependency>
  5. 5. 5 Apache Kafka is a Streaming Platform Does NOT run inside the Kafka brokers!
  6. 6. 6 Deploy as you wish
  7. 7. 7 Streams API The easiest way to use exactly-once semantics! • Easy to use and powerful DSL plus low level Processor API • Filter, aggregations, windows, joins, tables, punctuations, … • Rich time semantics (event time, ingestion time, processing time) • Elastic, scalable, fault-tolerant (including state) • S, M, L, XL, … use cases • No need to change any code to use exactly-once! • Config parameter processing.mode = “exactly_once”
  8. 8. 8 Apache Kafka’s Exactly-once Guarantees • Avoids duplicates on writes • Frees application code from de-duplication • Simplifies application development • Enables new use cases with strong consistency guarantees • Stock market • Financial industry • Billing • Etc…
  9. 9. 9 Core Concepts in Streams API
  10. 10. 10 Topics, Streams, and Tables
  11. 11. 11 Processing Streams KStreamBuilder builder = new KStreamBuilder(); KStream<Long,String> inputStream = builder.stream(“input-topic”); KStream<Long,String> outputStream = inputStream.mapValues( value -> value.toLowerCase()); outputStream.to(“output-topic”);
  12. 12. 12 Using Tables KTable<Long,String> inputTable = builder.table(“changelog-topic”);
  13. 13. 13 Using Tables KStream<Long,String> enrichedStream = inputStream.join(inputTable, …);
  14. 14. 14 Aggregating Streams KTable<Long,Long> countPerKey = enrichedStream.groupByKey() .count();
  15. 15. 15 End-To-End Application and Exactly-Once read – process – write track input offsets – track state updates – write output
  16. 16. 16 Application Failure Scenarios: tracking offsets Application (k,v1) (k,v2) Duplicate reads results in duplicate writes. read(k,v1) -> process -> output -> commit offsets read(k,v2) -> process -> output -> commit offsets read(k,v2) -> process -> output -> CRASH
  17. 17. 17 Application Failure Scenarios: state update Application (k,v1) (k,v2) read(k,v1) -> process/state -> output -> commit offsets read(k,v2) -> process/state -> output -> commit offsets read(k,v2) -> process/state -> CRASH Duplicate reads results in corrupted state and thus wrong results (e.g., over counting).
  18. 18. 18 Application Failure Scenarios: Error on Write Producer Application Consumer Application (k,v1) (k,v2) Duplicate writes lead to wrong result and downstream duplicate reads. write(k,v1) write(k,v2) read(k,v1) read(k,v2) read(k,v2) (k,v2)
  19. 19. 19 Error Propagation Application Application Application
  20. 20. 20 Exactly-Once in Kafka Streams API Kafka’s Streams API provides Exactly-Once Processing Guarantees, by an atomic read-process-write pattern. This allows for deep processing pipelines with exactly-one guarantees.
  21. 21. 21 Exactly-Once in Apache Kafka Streams API
  22. 22. 22 Exactly-Once in Kafka Streams API (since v0.11.0) • Builds on top of KafkaProducer and KafkaConsumer • In v0.11.0, KafkaProducer adds: • Idempotent writes • Transactional API • Includes offset commits in a producer transaction • No offset commits via KafkaConsumer • In v0.11.0, KafkaConsumer adds: • read committed mode (vs. read uncommited)
  23. 23. 23 How to use exactly-once capabilities: • Streams API (the easiest way to use exactly-once semantics) • Config parameter processing.mode = “exactly_once” • Idempotent Producer • Config parameter enable.idempotence = true • Transactional Producer • Config parameter transactional.id = “my-unique-tid” • And Transactional API (hard to use correctly – even if it look simple on the surface) • Transactional Consumer • Config parameter isolation.level = “read_committed” (default: “read_uncommitted”)
  24. 24. 24 Transactional API producer.initTransactions(); try { producer.beginTransaction(); producer.send(message1); producer.send(message2); producer.sendOffsetsToTxn(…); producer.commitTransaction(); } catch (ProducerFencedException e) { producer.close(); } catch (KafkaException e) { producer.abortTransaction(); }
  25. 25. 25 Summary • Streams API is the easiest way to build applications with Apache Kafka • It’s a library that enriches your application • No compute cluster • It provides end-to-end exactly-once processing guarantees • Kafka’s exactly-once guarantees provide strong semantics that simplifies your application code
  26. 26. 26 Material • Download Confluent Open Source: https://www.confluent.io/product/confluent- open-source/ • Check out the docs: https://docs.confluent.io/ • Check our blog: • Exactly-Once: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how- apache-kafka-does-it/ • Micro-Service Blog Series: https://www.confluent.io/blog/data-dichotomy-rethinking-the-way- we-treat-data-and-services/ • Kafka Summit talks: • Exactly-Once (NY): https://www.confluent.io/kafka-summit-nyc17/resource/#exactly-once- semantics_slide • Exactly-Once with Streams API (SF): https://www.confluent.io/kafka-summit- sf17/resource/#Exactly-once-Stream-Processing-with-Kafka-Streams_slide • Micro-Services (SF): https://www.confluent.io/kafka-summit-sf17/resource/#building-event- driven-services-stateful-streams_slide
  27. 27. 27 Thank You We are hiring!

×