SlideShare a Scribd company logo
1 of 68
Download to read offline
1
Exactly-once Data Processing
with Kafka Streams
Guozhang Wang
Kafka Meetup SF, July 27, 2017
2
Outline
• What is exactly-once for stream processing?
• How to achieve exactly-once with Kafka?
• Kafka Streams: exactly-once made easy
3
4
Stream Processing
Source SinkProcess
State
Source Sink
5
Stream Processing with Kafka
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
6
Stream Processing with Kafka
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
ack
commit
7
Exactly-Once
• An application property for stream processing,
• .. that for each received record,
• .. it will be processed exactly once,
• .. even under failures
8
Stream Processing with Kafka
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
ack
commit
9
Error Scenario #1: Duplicate Write
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
10
Error Scenario #1: Duplicate Write
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
11
Error Scenario #2: Re-process
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
commit
ack
ack
12
Error Scenario #2: Re-process
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
13
Error Scenario #2: Re-process
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
14
Error Scenario #3: Data loss
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
ack
commit
15
Error Scenario #3: Data loss
State
Process
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
ack
commit
16
Error Scenario #3: Data loss
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
17
Error Scenario #3: Data loss
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
18
Exactly-Once does NOT mean..
• Two Generals problem can now be solved
• .. or FLP result is proved wrong
• .. or TCP at transport level is “perfect”
• .. or you can get distributed consensus in any settings
19
What can cause incorrect results?
• Unbounded network partition (algorithmical proof)
• A long GC or hard crash
• A bad config in your system
• A human operating error
• A bug in your code
20
What can cause incorrect results?
• Unbounded network partition (algorithmical proof)
• A long GC or hard crash
• A bad config in your system
• A human operating error
• A bug in your code
99.9%
0.01%
21
What can cause incorrect results?
• Unbounded network partition (algorithmical proof)
• A long GC or hard crash
• A bad config in your system
• A human operating error
• A bug in your code
99.9%
0.01%
Can we do better for the 99.99% ?
22
So how to achieve Exactly-Once?
23
Option #1: “Just give up”
Streaming
Source Sink
Batch
State
State
24
Option #2: At-least-once + Dedup
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
ack
commit
25
Option #2: At-least-once + Dedup
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
26
Option #2: At-least-once + Dedup
Process
State
KafkaTopic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
ack
commit
27
Option #2: At-least-once + Dedup
2
2
3
3
4
4
Dedup
28
Option #3: The Kafka Way!(0.11+)
• Idempotent producer: send exactly-once per partition
• Transactional messaging: multiple-sends atomically
29
Idempotent Producer
Producer
Kafka Topic Cack
pid = 1
pid = 1
seq = 28
pid = 1
seq = 28
30
Idempotent Producer
Producer
Kafka Topic Cack
pid = 1
pid = 1
seq = 28
pid = 1
seq = 28
config: enable.idempotence = true
31
Atomic Multi-Sends (aka. “transactions”)
Producer
Kafka Topic C
Kafka Topic D
producer.beginTxn();
producer.send(rec1); // topic C
producer.send(rec2); // topic D
producer.sendOffsetsToTxn(A, 10);
KafkaTopic A
producer.commitTxn();
try {
} catch (KafkaException e) {
}
Atomic
Commit
32
Atomic Multi-Sends (aka. “transactions”)
Producer
Kafka Topic C
Kafka Topic D
producer.beginTxn();
producer.send(rec1); // topic C
producer.send(rec2); // topic D
producer.sendOffsetsToTxn(A, 10);
KafkaTopic A
producer.commitTxn();
try {
} catch (KafkaException e) {
}
Atomic
Commit
producer.abortTxn();
33
Atomic Multi-Sends (aka. “transactions”)
Consumer
Kafka Topic C
Kafka Topic D
Read
Committed
consumer.subscribe(C, D);
recs = consumer.poll();
for (Record rec <- recs) {
// process ..
}
config: isolation.level = read_committed (default = read_uncommitted)
34
Exactly-Once Processing with Kafka
Process
State
Kafka Topic C
Kafka Topic D
ack
ack
KafkaTopic A
Kafka Topic B
commit
35
Exactly-Once Processing with Kafka
• Offset commit for source topics
• Value update on processor state
• Acked produce to sink topics
All or Nothing
36
Kafka Streams (0.10+)
• New client library besides producer and consumer
• Powerful yet easy-to-use
• Event-at-a-time, Stateful
• Windowing with out-of-order handling
• Highly scalable, distributed, fault tolerant
• and more..
37
Anywhere, anytime
Ok. Ok. Ok. Ok.
38
Anywhere, anytime
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>0.10.0.0</version>
</dependency>
39
Kafka Streams DSL
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
40
Kafka Streams DSL
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
41
Kafka Streams DSL
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
42
Kafka Streams DSL
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
43
Kafka Streams DSL
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
44
Processor Topology
KStream<..> stream1 = builder.stream(”topic3”);
KStream<..> stream2 = builder.stream(”topic3”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);
45
Processor Topology
KStream<..> stream1 = builder.stream(”topic3”);
KStream<..> stream2 = builder.stream(”topic3”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);
46
Processor Topology
KStream<..> stream1 = builder.stream(”topic3”);
KStream<..> stream2 = builder.stream(”topic3”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);
47
Processor Topology
KStream<..> stream1 = builder.stream(”topic3”);
KStream<..> stream2 = builder.stream(”topic3”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);
48
Processor Topology
KStream<..> stream1 = builder.stream(”topic3”);
KStream<..> stream2 = builder.stream(”topic3”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);
49
Processor Topology
KStream<..> stream1 = builder.stream(”topic3”);
KStream<..> stream2 = builder.stream(”topic3”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);
State
50
Processing in Kafka Streams
Kafka Topic B Kafka Topic A
51
Processing in Kafka Streams
Kafka Topic B Kafka Topic A
Processor Topology
P1
P2
P1
P2
52
Processing in Kafka Streams
Kafka Topic AKafka Topic B
53
Processing in Kafka Streams
Kafka Topic B Kafka Topic A
MyApp.1 MyApp.2
Task2Task1
54
States in Stream Processing
MyApp.2MyApp.1
Kafka Topic B
Task2Task1
Kafka Topic A
State State
55
Fault Tolerance in Streams
StateProcess
StateProcess
StateProcess
Kafka
Kafka Streams
Kafka
Kafka Changelog
56
• All or Nothing for the following:
• Offset commit for source topics
• Value update on processor state
• Acked produce to sink topics
57
Exactly-Once with Kafka Streams (0.11+)
• Process data in transactions of:
• A batch of input records from source topics
• A batch of output records to changelog topics
• A batch of output records to sink topics
config: processing.mode = exactly-once (default = at-least-once)
58
Exactly-Once with Failures
State
Process
StateProcess
State
Process
Kafka
Kafka Streams
Kafka Changelog
Kafka
59
Exactly-Once with Failures
State
Process
StateProcess
State
Process
Kafka
Kafka Streams
Kafka Changelog
Kafka
60
Exactly-Once with Failures
StateProcess
StateProcess
StateProcess
Kafka Streams
Kafka
Kafka Changelog
Kafka
61
Exactly-Once with Failures
StateProcess
StateProcess
StateProcess
Kafka Streams
Kafka
Kafka Changelog
Kafka
62
Exactly-Once life is goooood~
63
What if not all my data is in Kafka?
64
65
Connectors
• 40+ since first release this
Feb (0.9+)
• 13 from &
partners
66
End-to-End Exactly-Once
67
Take-aways
• Exactly-once: important property for stream processing
• Kafka Streams: exactly-once made easy
Join Kafka Summit 2017 SF (discount code available!)
Additional resources:
http://www.confluent.io/resources
Guozhang Wang | guozhang@confluent.io | @guozhangwang
68
Thank You!
Guozhang Wang
Kafka Meetup SF, July 27, 2017

More Related Content

What's hot

Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
HostedbyConfluent
 

What's hot (20)

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
 
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLSteps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 
Actors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesActors or Not: Async Event Architectures
Actors or Not: Async Event Architectures
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafka
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
 
Apache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know About
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRKafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
 
Real-world Streaming Architectures
Real-world Streaming ArchitecturesReal-world Streaming Architectures
Real-world Streaming Architectures
 
Intro to AsyncAPI
Intro to AsyncAPIIntro to AsyncAPI
Intro to AsyncAPI
 
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 

Similar to Exactly-once Data Processing with Kafka Streams - July 27, 2017

Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
confluent
 

Similar to Exactly-once Data Processing with Kafka Streams - July 27, 2017 (20)

Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
I can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and SpringI can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and Spring
 
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
Streams Don't Fail Me Now - Robustness Features in Kafka StreamsStreams Don't Fail Me Now - Robustness Features in Kafka Streams
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
 
Containerizing Distributed Pipes
Containerizing Distributed PipesContainerizing Distributed Pipes
Containerizing Distributed Pipes
 
Improving Streams Scalability with Transactional StateStores (KIP-892)
Improving Streams Scalability with Transactional StateStores (KIP-892)Improving Streams Scalability with Transactional StateStores (KIP-892)
Improving Streams Scalability with Transactional StateStores (KIP-892)
 
How to Build an Apache Kafka® Connector
How to Build an Apache Kafka® ConnectorHow to Build an Apache Kafka® Connector
How to Build an Apache Kafka® Connector
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
From a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonFrom a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePerson
 
Connect, Test, Optimize: The Ultimate Kafka Connector Benchmarking Toolkit
Connect, Test, Optimize: The Ultimate Kafka Connector Benchmarking ToolkitConnect, Test, Optimize: The Ultimate Kafka Connector Benchmarking Toolkit
Connect, Test, Optimize: The Ultimate Kafka Connector Benchmarking Toolkit
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Concurrency at the Database Layer
Concurrency at the Database Layer Concurrency at the Database Layer
Concurrency at the Database Layer
 

More from confluent

More from confluent (20)

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 

Recently uploaded

Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Marc Lester
 

Recently uploaded (20)

Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 

Exactly-once Data Processing with Kafka Streams - July 27, 2017

  • 1. 1 Exactly-once Data Processing with Kafka Streams Guozhang Wang Kafka Meetup SF, July 27, 2017
  • 2. 2 Outline • What is exactly-once for stream processing? • How to achieve exactly-once with Kafka? • Kafka Streams: exactly-once made easy
  • 3. 3
  • 5. 5 Stream Processing with Kafka Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D
  • 6. 6 Stream Processing with Kafka Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack ack commit
  • 7. 7 Exactly-Once • An application property for stream processing, • .. that for each received record, • .. it will be processed exactly once, • .. even under failures
  • 8. 8 Stream Processing with Kafka Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack ack commit
  • 9. 9 Error Scenario #1: Duplicate Write Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack
  • 10. 10 Error Scenario #1: Duplicate Write Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack
  • 11. 11 Error Scenario #2: Re-process Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D commit ack ack
  • 12. 12 Error Scenario #2: Re-process Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D
  • 13. 13 Error Scenario #2: Re-process Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D
  • 14. 14 Error Scenario #3: Data loss Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack ack commit
  • 15. 15 Error Scenario #3: Data loss State Process KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack ack commit
  • 16. 16 Error Scenario #3: Data loss Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack
  • 17. 17 Error Scenario #3: Data loss Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack
  • 18. 18 Exactly-Once does NOT mean.. • Two Generals problem can now be solved • .. or FLP result is proved wrong • .. or TCP at transport level is “perfect” • .. or you can get distributed consensus in any settings
  • 19. 19 What can cause incorrect results? • Unbounded network partition (algorithmical proof) • A long GC or hard crash • A bad config in your system • A human operating error • A bug in your code
  • 20. 20 What can cause incorrect results? • Unbounded network partition (algorithmical proof) • A long GC or hard crash • A bad config in your system • A human operating error • A bug in your code 99.9% 0.01%
  • 21. 21 What can cause incorrect results? • Unbounded network partition (algorithmical proof) • A long GC or hard crash • A bad config in your system • A human operating error • A bug in your code 99.9% 0.01% Can we do better for the 99.99% ?
  • 22. 22 So how to achieve Exactly-Once?
  • 23. 23 Option #1: “Just give up” Streaming Source Sink Batch State State
  • 24. 24 Option #2: At-least-once + Dedup Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack ack commit
  • 25. 25 Option #2: At-least-once + Dedup Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D
  • 26. 26 Option #2: At-least-once + Dedup Process State KafkaTopic A Kafka Topic B Kafka Topic C Kafka Topic D ack ack commit
  • 27. 27 Option #2: At-least-once + Dedup 2 2 3 3 4 4 Dedup
  • 28. 28 Option #3: The Kafka Way!(0.11+) • Idempotent producer: send exactly-once per partition • Transactional messaging: multiple-sends atomically
  • 29. 29 Idempotent Producer Producer Kafka Topic Cack pid = 1 pid = 1 seq = 28 pid = 1 seq = 28
  • 30. 30 Idempotent Producer Producer Kafka Topic Cack pid = 1 pid = 1 seq = 28 pid = 1 seq = 28 config: enable.idempotence = true
  • 31. 31 Atomic Multi-Sends (aka. “transactions”) Producer Kafka Topic C Kafka Topic D producer.beginTxn(); producer.send(rec1); // topic C producer.send(rec2); // topic D producer.sendOffsetsToTxn(A, 10); KafkaTopic A producer.commitTxn(); try { } catch (KafkaException e) { } Atomic Commit
  • 32. 32 Atomic Multi-Sends (aka. “transactions”) Producer Kafka Topic C Kafka Topic D producer.beginTxn(); producer.send(rec1); // topic C producer.send(rec2); // topic D producer.sendOffsetsToTxn(A, 10); KafkaTopic A producer.commitTxn(); try { } catch (KafkaException e) { } Atomic Commit producer.abortTxn();
  • 33. 33 Atomic Multi-Sends (aka. “transactions”) Consumer Kafka Topic C Kafka Topic D Read Committed consumer.subscribe(C, D); recs = consumer.poll(); for (Record rec <- recs) { // process .. } config: isolation.level = read_committed (default = read_uncommitted)
  • 34. 34 Exactly-Once Processing with Kafka Process State Kafka Topic C Kafka Topic D ack ack KafkaTopic A Kafka Topic B commit
  • 35. 35 Exactly-Once Processing with Kafka • Offset commit for source topics • Value update on processor state • Acked produce to sink topics All or Nothing
  • 36. 36 Kafka Streams (0.10+) • New client library besides producer and consumer • Powerful yet easy-to-use • Event-at-a-time, Stateful • Windowing with out-of-order handling • Highly scalable, distributed, fault tolerant • and more..
  • 39. 39 Kafka Streams DSL public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  • 40. 40 Kafka Streams DSL public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  • 41. 41 Kafka Streams DSL public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  • 42. 42 Kafka Streams DSL public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  • 43. 43 Kafka Streams DSL public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”); // count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”); // write the result table to a new topic counts.to(”topic2”); // create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  • 44. 44 Processor Topology KStream<..> stream1 = builder.stream(”topic3”); KStream<..> stream2 = builder.stream(”topic3”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  • 45. 45 Processor Topology KStream<..> stream1 = builder.stream(”topic3”); KStream<..> stream2 = builder.stream(”topic3”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  • 46. 46 Processor Topology KStream<..> stream1 = builder.stream(”topic3”); KStream<..> stream2 = builder.stream(”topic3”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  • 47. 47 Processor Topology KStream<..> stream1 = builder.stream(”topic3”); KStream<..> stream2 = builder.stream(”topic3”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  • 48. 48 Processor Topology KStream<..> stream1 = builder.stream(”topic3”); KStream<..> stream2 = builder.stream(”topic3”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”);
  • 49. 49 Processor Topology KStream<..> stream1 = builder.stream(”topic3”); KStream<..> stream2 = builder.stream(”topic3”); KStream<..> joined = stream1.leftJoin(stream2, ...); KTable<..> aggregated = joined.aggregateByKey(...); aggregated.to(”topic3”); State
  • 50. 50 Processing in Kafka Streams Kafka Topic B Kafka Topic A
  • 51. 51 Processing in Kafka Streams Kafka Topic B Kafka Topic A Processor Topology P1 P2 P1 P2
  • 52. 52 Processing in Kafka Streams Kafka Topic AKafka Topic B
  • 53. 53 Processing in Kafka Streams Kafka Topic B Kafka Topic A MyApp.1 MyApp.2 Task2Task1
  • 54. 54 States in Stream Processing MyApp.2MyApp.1 Kafka Topic B Task2Task1 Kafka Topic A State State
  • 55. 55 Fault Tolerance in Streams StateProcess StateProcess StateProcess Kafka Kafka Streams Kafka Kafka Changelog
  • 56. 56 • All or Nothing for the following: • Offset commit for source topics • Value update on processor state • Acked produce to sink topics
  • 57. 57 Exactly-Once with Kafka Streams (0.11+) • Process data in transactions of: • A batch of input records from source topics • A batch of output records to changelog topics • A batch of output records to sink topics config: processing.mode = exactly-once (default = at-least-once)
  • 63. 63 What if not all my data is in Kafka?
  • 64. 64
  • 65. 65 Connectors • 40+ since first release this Feb (0.9+) • 13 from & partners
  • 67. 67 Take-aways • Exactly-once: important property for stream processing • Kafka Streams: exactly-once made easy Join Kafka Summit 2017 SF (discount code available!) Additional resources: http://www.confluent.io/resources Guozhang Wang | guozhang@confluent.io | @guozhangwang
  • 68. 68 Thank You! Guozhang Wang Kafka Meetup SF, July 27, 2017