SlideShare a Scribd company logo
1 of 38
1
When it absolutely, positively,
has to be there
Reliability Guarantees in Apache Kafka
@jeffholoman Cloudera
@gwenshap Confluent
2
Apache Kafka
High Throughput
Low Latency
Scalable
Centralized
Real-time
3
Streaming Platform
Producer Consumer
Streaming Applications
Connectors Connectors
Apache Kafka
4
Versions of Apache Kafka
• 0.7.0 <- Please don’t
• 0.8.0 <- Replication exists, it will continue evolving with
every release
• 0.8.2 <- New producer, offset commits to Kafka
• 0.9.0 <- New consumer, Connect APIs
• 0.10.0 <- New consumer improvements, Streams APIs
• 0.11.0 <- Idempotent producer, transactional semantics,
Exactly once.
5
Kafka Components
• Broker
• Java clients:
• Producer
• Consumers
• Kafka Streams
• Kafka Connect
• Non-Java:
• Librdkafka
• Librdkafka based – Python, Go, NodeJS, C#...
• Others
6
If Kafka is a critical piece of our pipeline
 Can we be 100% sure that our data will get there?
 Can we lose messages?
 How do we verify?
 Who’s fault is it?
7
Distributed Systems
 Things Fail
 Systems are designed
to tolerate failure
 We must expect
failures and design our
code and configure our
systems to handle
them
8
Network
Broker MachineClient Machine
Data Flow - Producer
Kafka Client
Broker
O/S Socket Buffer
NIC
NIC
Page Cache
Disk
Application
Thread
O/S Socket Buffercallbac
k
✗
✗
✗
✗
✗
✗
✗✗ data
ack /
exception
Replication
10
Kafka is super reliable.
Stores data, on disk. Replicated.
… if you know how to configure it that way.
11
Replication is your friend
 Kafka protects against failures by replicating data
 The unit of replication is the partition
 One replica is designated as the Leader
 Follower replicas fetch data from the leader
 The leader holds the list of “in-sync” replicas
12
Replication and ISRs
0
1
2
0
1
2
0
1
2
Producer
Broker
100
Broker
101
Broker
102
Topic:
Partitions
:
Replicas:
my_topic
3
3
Partition
:
Leader:
ISR:
1
101
100,102
Partition
:
Leader:
ISR:
2
102
101,100
Partition
:
Leader:
ISR:
0
100
101,102
13
ISR
2 things make a replica in-sync
 Lag behind leader
 replica.lag.time.max.ms – replica that didn’t fetch or is behind
 replica.lag.max.messages – has gone away in 0.9
 Connection to Zookeeper
14
Terminology
Acked
• Producers will not retry sending.
• Depends on producer setting.
Committed
• Only when message got to all ISR
(future leaders have it).
• Consumers can read.
• replica.lag.time.max.ms controls: how long
can a dead replica prevent consumers from
reading?
Committed Offsets
• Consumer told Kafka the latest offsets it
read. By default the consumer will not see
these events again.
15
Replication
Acks = all
• Waits for all in-sync replicas to reply.
Replica 3
100
Replica 2
100
Replica 1
100
Time
16
Replica 3 stopped replicating for some reason
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
Acked in acks = all
“committed”
Acked in acks = 1
but not
“committed”
17
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
One replica drops out of ISR, or goes offline
All messages are now acked and committed
18
2nd Replica drops out, or is offline
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
19
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
Now we’re in trouble
✗
20
Replication
If Replica 2 or 3 come back online before the leader
you can will lose data.
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
All those are
“acked” and
“committed”
21
So what to do?
Disable Unclean Leader Election
•unclean.leader.election.enable = false
•Default from 0.11.0
Set replication factor
•default.replication.factor = 3
Set minimum ISRs
•min.insync.replicas = 2
22
Warning!
min.insync.replicas is applied at the topic-level.
Must alter the topic configuration manually if created before the
server level change
Must manually alter the topic < 0.9.0 (KAFKA-2114)
23
Replication
Replication = 3
Min ISR = 2
Replica 3
100
Replica 2
100
Replica 1
100
Time
24
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
One replica drops out of ISR, or goes offline
25
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101102
103
104
Time
2nd Replica fails out, or is out of sync
Buffers in
Producer
26
27
Producer Internals
Producer sends batches of messages to a buffer
M3
Application
Thread
Application
Thread
Application
Thread
send()
M2 M1 M0
Batch 3
Batch 2
Batch 1
Fail
?
response
retry
Update
Future
callback
drain
Metadata or
Exception
28
Basics
Durability can be configured with the producer configuration
request.required.acks
•0 The message is written to the network (buffer)
•1 The message is written to the leader
•all The producer gets an ack after all ISRs receive the data; the
message is committed
Make sure producer doesn’t just throw messages away!
•For clients < 09.0, block.on.buffer.full = true
max.block.ms = Long.MAX_VALUE
•Or handle the BufferExhaustedException /
TimeoutException yourself
29
Producer
All calls are non-blocking async
2 Options for checking for failures:
• Immediately block for response: send().get()
• Do follow-up work in Callback, close producer after error threshold
• Be careful about buffering these failures. Future work? KAFKA-1955
• Don’t forget to close the producer! producer.close() will block until in-
flight txns complete
retries (producer config) defaults to 0
In flight requests could lead to message re-ordering
(max.in.flight.request.per.connection)
30
31
Consumer
Three choices for Consumer API
•Simple Consumer
•High Level Consumer (ZookeeperConsumer)
•New KafkaConsumer
32
New Consumer – auto commit
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "10000");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String,
String>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
processAndUpdateDB(record);
}
} What if we crash
after 8 seconds?
Commit automatically
every 10 seconds
33
New Consumer – manual commit
props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String,
String>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
processAndUpdateDB(record);
consumer.commitSync();
}
Commit entire
batch outside the
loop!
34
Minimize Duplicates for At Least Once Consuming
1. Commit your own offsets - Set autocommit.enable =
false
2. Use Rebalance Listener to limit duplicates
3. Make sure you commit only what you are done processing
4. Note: New consumer is single threaded – one consumer per
thread.
35
Exactly Once Semantics
At most once is easy
At least once is not bad either – commit after 100% sure data is safe
Exactly once is tricky
• Commit data and offsets in one transaction
• Idempotent producer
Kafka Connect – many connectors (especially Confluent’s) are exactly once
by using an external database to write events and store offsets in
one transaction
Kafka Streams – starting at 0.11.0 have easy to configure exactly once
(exactly.once=true).
Other stream processing systems – have their own thing.
36
How do we test Kafka?
Replication Tests:
These Tests verify that replication provides simple durability guarantees by checking that data acked
by brokers is still available for consumption in the face of various failure scenarios
Setup:
• 1 Zookeeper Node
• 3 Kafka Nodes
• 1 Topic with partitions=3 replication-factor-3 and min.insync.replicas=2
Procedure:
• Produce messages in the background
• Consume messages in the background
• Initiate broker failures (shutdown, or bounce repeatedly with kill -15 or kill -9)
• When done driving failures, stop producing and finish consuming
• Validation that every acked message was consumed
37
Monitoring for Data Loss
• Monitor for producer errors – watch the retry numbers
• Monitor consumer lag – MaxLag or via offsets
• Standard schema:
• Each message should contain timestamp and originating service and host
• Each producer can report message counts and offsets to a special topic
• “Monitoring consumer” reports message counts to another special topic
• “Important consumers” also report message counts
• Reconcile the results
38
Be Safe, Not Sorry
Acks = all
Max.block.ms = Long.MAX_VALUE
Retries = MAX_INT
( Max.inflight.requests.per.connection = 1 )
Producer.close()
Replication-factor >= 3
Min.insync.replicas = 2
Unclean.leader.election = false
Auto.offset.commit = false
Commit after processing
Monitor!
39
Thank You!

More Related Content

What's hot

How to Fail at Kafka
How to Fail at KafkaHow to Fail at Kafka
How to Fail at Kafkaconfluent
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Patternconfluent
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free FridayOtávio Carvalho
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Let the alpakka pull your stream
Let the alpakka pull your streamLet the alpakka pull your stream
Let the alpakka pull your streamEnno Runne
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Webinar patterns anti patterns
Webinar patterns anti patternsWebinar patterns anti patterns
Webinar patterns anti patternsconfluent
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewDmitry Tolpeko
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 

What's hot (20)

How to Fail at Kafka
How to Fail at KafkaHow to Fail at Kafka
How to Fail at Kafka
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free Friday
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Let the alpakka pull your stream
Let the alpakka pull your streamLet the alpakka pull your stream
Let the alpakka pull your stream
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Webinar patterns anti patterns
Webinar patterns anti patternsWebinar patterns anti patterns
Webinar patterns anti patterns
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Kafka ops-new
Kafka ops-newKafka ops-new
Kafka ops-new
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 

Similar to Apache Kafka Reliability

Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafkaconfluent
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafkaconfluent
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache KafkaSaumitra Srivastav
 
Seek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationSeek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationHostedbyConfluent
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Otávio Carvalho
 
World of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaWorld of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaLevon Avakyan
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupSnehal Nagmote
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenDimosthenis Botsaris
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Drivenarconsis
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experienceRico Chen
 
BDW Chicago 2016 - Jayesh Thakrar, Sr. Software Engineer, Conversant - Data...
BDW Chicago 2016 -  Jayesh Thakrar, Sr. Software Engineer, Conversant -  Data...BDW Chicago 2016 -  Jayesh Thakrar, Sr. Software Engineer, Conversant -  Data...
BDW Chicago 2016 - Jayesh Thakrar, Sr. Software Engineer, Conversant - Data...Big Data Week
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdfTarekHamdi8
 
Messaging for Modern Applications
Messaging for Modern ApplicationsMessaging for Modern Applications
Messaging for Modern ApplicationsTom McCuch
 

Similar to Apache Kafka Reliability (20)

Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
intro-kafka
intro-kafkaintro-kafka
intro-kafka
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
 
Seek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationSeek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under Replication
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
World of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaWorld of Tanks Experience of Using Kafka
World of Tanks Experience of Using Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experience
 
BDW Chicago 2016 - Jayesh Thakrar, Sr. Software Engineer, Conversant - Data...
BDW Chicago 2016 -  Jayesh Thakrar, Sr. Software Engineer, Conversant -  Data...BDW Chicago 2016 -  Jayesh Thakrar, Sr. Software Engineer, Conversant -  Data...
BDW Chicago 2016 - Jayesh Thakrar, Sr. Software Engineer, Conversant - Data...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Messaging for Modern Applications
Messaging for Modern ApplicationsMessaging for Modern Applications
Messaging for Modern Applications
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Apache Kafka Reliability

  • 1. 1 When it absolutely, positively, has to be there Reliability Guarantees in Apache Kafka @jeffholoman Cloudera @gwenshap Confluent
  • 2. 2 Apache Kafka High Throughput Low Latency Scalable Centralized Real-time
  • 3. 3 Streaming Platform Producer Consumer Streaming Applications Connectors Connectors Apache Kafka
  • 4. 4 Versions of Apache Kafka • 0.7.0 <- Please don’t • 0.8.0 <- Replication exists, it will continue evolving with every release • 0.8.2 <- New producer, offset commits to Kafka • 0.9.0 <- New consumer, Connect APIs • 0.10.0 <- New consumer improvements, Streams APIs • 0.11.0 <- Idempotent producer, transactional semantics, Exactly once.
  • 5. 5 Kafka Components • Broker • Java clients: • Producer • Consumers • Kafka Streams • Kafka Connect • Non-Java: • Librdkafka • Librdkafka based – Python, Go, NodeJS, C#... • Others
  • 6. 6 If Kafka is a critical piece of our pipeline  Can we be 100% sure that our data will get there?  Can we lose messages?  How do we verify?  Who’s fault is it?
  • 7. 7 Distributed Systems  Things Fail  Systems are designed to tolerate failure  We must expect failures and design our code and configure our systems to handle them
  • 8. 8 Network Broker MachineClient Machine Data Flow - Producer Kafka Client Broker O/S Socket Buffer NIC NIC Page Cache Disk Application Thread O/S Socket Buffercallbac k ✗ ✗ ✗ ✗ ✗ ✗ ✗✗ data ack / exception Replication
  • 9. 10 Kafka is super reliable. Stores data, on disk. Replicated. … if you know how to configure it that way.
  • 10. 11 Replication is your friend  Kafka protects against failures by replicating data  The unit of replication is the partition  One replica is designated as the Leader  Follower replicas fetch data from the leader  The leader holds the list of “in-sync” replicas
  • 12. 13 ISR 2 things make a replica in-sync  Lag behind leader  replica.lag.time.max.ms – replica that didn’t fetch or is behind  replica.lag.max.messages – has gone away in 0.9  Connection to Zookeeper
  • 13. 14 Terminology Acked • Producers will not retry sending. • Depends on producer setting. Committed • Only when message got to all ISR (future leaders have it). • Consumers can read. • replica.lag.time.max.ms controls: how long can a dead replica prevent consumers from reading? Committed Offsets • Consumer told Kafka the latest offsets it read. By default the consumer will not see these events again.
  • 14. 15 Replication Acks = all • Waits for all in-sync replicas to reply. Replica 3 100 Replica 2 100 Replica 1 100 Time
  • 15. 16 Replica 3 stopped replicating for some reason Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time Acked in acks = all “committed” Acked in acks = 1 but not “committed”
  • 16. 17 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time One replica drops out of ISR, or goes offline All messages are now acked and committed
  • 17. 18 2nd Replica drops out, or is offline Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time
  • 18. 19 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time Now we’re in trouble ✗
  • 19. 20 Replication If Replica 2 or 3 come back online before the leader you can will lose data. Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time All those are “acked” and “committed”
  • 20. 21 So what to do? Disable Unclean Leader Election •unclean.leader.election.enable = false •Default from 0.11.0 Set replication factor •default.replication.factor = 3 Set minimum ISRs •min.insync.replicas = 2
  • 21. 22 Warning! min.insync.replicas is applied at the topic-level. Must alter the topic configuration manually if created before the server level change Must manually alter the topic < 0.9.0 (KAFKA-2114)
  • 22. 23 Replication Replication = 3 Min ISR = 2 Replica 3 100 Replica 2 100 Replica 1 100 Time
  • 23. 24 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time One replica drops out of ISR, or goes offline
  • 24. 25 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101102 103 104 Time 2nd Replica fails out, or is out of sync Buffers in Producer
  • 25. 26
  • 26. 27 Producer Internals Producer sends batches of messages to a buffer M3 Application Thread Application Thread Application Thread send() M2 M1 M0 Batch 3 Batch 2 Batch 1 Fail ? response retry Update Future callback drain Metadata or Exception
  • 27. 28 Basics Durability can be configured with the producer configuration request.required.acks •0 The message is written to the network (buffer) •1 The message is written to the leader •all The producer gets an ack after all ISRs receive the data; the message is committed Make sure producer doesn’t just throw messages away! •For clients < 09.0, block.on.buffer.full = true max.block.ms = Long.MAX_VALUE •Or handle the BufferExhaustedException / TimeoutException yourself
  • 28. 29 Producer All calls are non-blocking async 2 Options for checking for failures: • Immediately block for response: send().get() • Do follow-up work in Callback, close producer after error threshold • Be careful about buffering these failures. Future work? KAFKA-1955 • Don’t forget to close the producer! producer.close() will block until in- flight txns complete retries (producer config) defaults to 0 In flight requests could lead to message re-ordering (max.in.flight.request.per.connection)
  • 29. 30
  • 30. 31 Consumer Three choices for Consumer API •Simple Consumer •High Level Consumer (ZookeeperConsumer) •New KafkaConsumer
  • 31. 32 New Consumer – auto commit props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "10000"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); } } What if we crash after 8 seconds? Commit automatically every 10 seconds
  • 32. 33 New Consumer – manual commit props.put("enable.auto.commit", "false"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) processAndUpdateDB(record); consumer.commitSync(); } Commit entire batch outside the loop!
  • 33. 34 Minimize Duplicates for At Least Once Consuming 1. Commit your own offsets - Set autocommit.enable = false 2. Use Rebalance Listener to limit duplicates 3. Make sure you commit only what you are done processing 4. Note: New consumer is single threaded – one consumer per thread.
  • 34. 35 Exactly Once Semantics At most once is easy At least once is not bad either – commit after 100% sure data is safe Exactly once is tricky • Commit data and offsets in one transaction • Idempotent producer Kafka Connect – many connectors (especially Confluent’s) are exactly once by using an external database to write events and store offsets in one transaction Kafka Streams – starting at 0.11.0 have easy to configure exactly once (exactly.once=true). Other stream processing systems – have their own thing.
  • 35. 36 How do we test Kafka? Replication Tests: These Tests verify that replication provides simple durability guarantees by checking that data acked by brokers is still available for consumption in the face of various failure scenarios Setup: • 1 Zookeeper Node • 3 Kafka Nodes • 1 Topic with partitions=3 replication-factor-3 and min.insync.replicas=2 Procedure: • Produce messages in the background • Consume messages in the background • Initiate broker failures (shutdown, or bounce repeatedly with kill -15 or kill -9) • When done driving failures, stop producing and finish consuming • Validation that every acked message was consumed
  • 36. 37 Monitoring for Data Loss • Monitor for producer errors – watch the retry numbers • Monitor consumer lag – MaxLag or via offsets • Standard schema: • Each message should contain timestamp and originating service and host • Each producer can report message counts and offsets to a special topic • “Monitoring consumer” reports message counts to another special topic • “Important consumers” also report message counts • Reconcile the results
  • 37. 38 Be Safe, Not Sorry Acks = all Max.block.ms = Long.MAX_VALUE Retries = MAX_INT ( Max.inflight.requests.per.connection = 1 ) Producer.close() Replication-factor >= 3 Min.insync.replicas = 2 Unclean.leader.election = false Auto.offset.commit = false Commit after processing Monitor!

Editor's Notes

  1. Apache Kafka is no longer just pub-sub messaging. Because of its persistence and reliability, it makes a great place to manage general streams of events and to drive streaming applications.
  2. We are going to start by discussing reliability guarantees as implemented by the broker’s replication protocol. We then discuss how to configure the clients for better reliability. We’ll use Java clients as an example. For non-Java clients: The C client (librdkafka) works pretty much the same way – same configurations and guarantees will work. Same for clients in other languages based on Librdkafka. For other clients… its hard to make generalizations. Some are very different and the advice in this talk will not work for them.
  3. Low Level Diagram: Not talking about producer / consumer design yet…maybe this is too low-level though Show diagram of network send -> os socket -> NIC -> ---- NIC -> Os socket buffer -> socket -> internal message flow / socket server -> response back to client -> how writes get persisted to disk including os buffers, async write etc Then overlay places where things can fail.
  4. Highlight boxes with different color
  5. When Replica 3 is back, it will catch up
  6. Commit every 10 seconds, but we don’t really have any control over what’s processed, and this can lead to duplicates