Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka Reliability - When it absolutely, positively has to be there

20,408 views

Published on

Best practices for making sure messages don't get lost - ever.

Published in: Data & Analytics

Kafka Reliability - When it absolutely, positively has to be there

  1. 1. When it absolutely, positively, has to be there Reliability Guarantees in Apache Kafka @jeffholoman @gwenshap Gwen Shapira Confluent Jeff Holoman Cloudera
  2. 2. Kafka  High Throughput  Low Latency  Scalable  Centralized  Real-time
  3. 3. “If data is the lifeblood of high technology, Apache Kafka is the circulatory system” --Todd Palino Kafka SRE @ LinkedIn
  4. 4. If Kafka is a critical piece of our pipeline  Can we be 100% sure that our data will get there?  Can we lose messages?  How do we verify?  Who’s fault is it?
  5. 5. Distributed Systems  Things Fail  Systems are designed to tolerate failure  We must expect failures and design our code and configure our systems to handle them
  6. 6. Network Broker MachineClient Machine Data Flow Kafka Client Broker O/S Socket Buffer NIC NIC Page Cache Disk Application Thread O/S Socket Buffer async callback ✗ ✗ ✗ ✗ ✗ ✗ ✗✗ data ack / exception
  7. 7. Client Machine Kafka Client O/S Socket Buffer NIC Application Thread ✗ ✗ ✗Broker Machine Broker NIC Page Cache Disk O/S Socket Buffer miss ✗ ✗ ✗ ✗ Network Data Flow ✗ data offsets ZK Kafka✗
  8. 8. Replication is your friend  Kafka protects against failures by replicating data  The unit of replication is the partition  One replica is designated as the Leader  Follower replicas fetch data from the leader  The leader holds the list of “in-sync” replicas
  9. 9. Replication and ISRs 00 11 22 00 11 22 00 11 22 ProducerProducer Broker 100 Broker 101 Broker 102 Topic: Partitions: Replicas: my_topic 3 3 Partition: Leader: ISR: 1 101 100,102 Partition: Leader: ISR: 2 102 101,100 Partition: Leader: ISR: 0 100 101,102
  10. 10. ISR  2 things make a replica in-sync - Lag behind leader - replica.lag.time.max.ms – replica that didn’t fetch or is behind - replica.lag.max.messages – will go away in 0.9 - Connection to Zookeeper
  11. 11. Terminology  Acked - Producers will not retry sending. - Depends on producer setting  Committed - Consumers can read. - Only when message got to all ISR.  replica.lag.time.max.ms - how long can a dead replica prevent consumers from reading?
  12. 12. Replication  Acks = all - only waits for in-sync replicas to reply. Replica 3 100 Replica 2 100 Replica 1 100 Time
  13. 13. Replication Replica 2 100 101 Replica 1 100 101 Time  Replica 3 stopped replicating for some reason Acked in acks = all “committed” Acked in acks = 1 but not “committed”
  14. 14. Replication Replica 2 100 101 Replica 1 100 101 Time  One replica drops out of ISR, or goes offline  All messages are now acked and committed
  15. 15. Replication Replica 1 100 101 102 103 104Time  2nd Replica drops out, or is offline
  16. 16. Replication Time  Now we’re in trouble ✗
  17. 17. Replication Replica 3 100 Replica 2 100 101 Time All those are “acked” and “committed”
  18. 18. So what to do  Disable Unclean Leader Election - unclean.leader.election.enable = false  Set replication factor - default.replication.factor = 3  Set minimum ISRs - min.insync.replicas = 2
  19. 19. Warning  min.insync.replicas is applied at the topic-level.  Must alter the topic configuration manually if created before the server level change  Must manually alter the topic < 0.9.0 (KAFKA-2114)
  20. 20. Replication  Replication = 3  Min ISR = 2 Replica 3 100 Replica 2 100 Replica 1 100 Time
  21. 21. Replication Replica 2 100 101 Replica 1 100 101 Time  One replica drops out of ISR, or goes offline
  22. 22. Replication Replica 1 100 101102 103 104 Time  2nd Replica fails out, or is out of sync Buffers in Producer
  23. 23. Producer Internals  Producer sends batches of messages to a buffer M3 Application Thread Application Thread Application Thread send() M2 M1 M0 Batch 3 Batch 2 Batch 1 Fail? response retry Update Future callback drain Metadata or Exception
  24. 24. Basics  Durability can be configured with the producer configuration request.required.acks - 0 The message is written to the network (buffer) - 1 The message is written to the leader - all The producer gets an ack after all ISRs receive the data; the message is committed  Make sure producer doesn’t just throws messages away! - block.on.buffer.full = true
  25. 25.  All calls are non-blocking async  2 Options for checking for failures: - Immediately block for response: send().get() - Do followup work in Callback, close producer after error threshold - Be careful about buffering these failures. Future work? KAFKA-1955 - Don’t forget to close the producer! producer.close() will block until in-flight txns complete  retries (producer config) defaults to 0  message.send.max.retries (server config) defaults to 3  In flight requests could lead to message re-ordering
  26. 26. Consumer  Two choices for Consumer API - Simple Consumer - High Level Consumer
  27. 27. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4
  28. 28. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit?
  29. 29. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit?
  30. 30. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Auto-commit enabled ✗Commit
  31. 31. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Auto-commit enabled ✗
  32. 32. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Auto-commit enabled Consumer Picks up here
  33. 33. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit
  34. 34. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit Offset commits for all threads
  35. 35. P0 P2 P3 P4 P5 P6 Consumer 1 Consumer 2 Consumer 3 Consumer 4 Consumer Offsets Auto-commit DISABLED Commit
  36. 36. Consumer Recommendations  Set autocommit.enable = false  Manually commit offsets after the message data is processed / persisted consumer.commitOffsets();  Run each consumer in it’s own thread
  37. 37. New Consumer!  No Zookeeper! At all!  Rebalance listener  Commit: - Commit - Commit async - Commit( offset)  Seek(offset)
  38. 38. Exactly Once Semantics  At most once is easy  At least once is not bad either – commit after 100% sure data is safe  Exactly once is tricky - Commit data and offsets in one transaction - Idempotent producer
  39. 39. Monitoring for Data Loss  Monitor for producer errors – watch the retry numbers  Monitor consumer lag – MaxLag or via offsets  Standard schema: - Each message should contain timestamp and originating service and host  Each producer can report message counts and offsets to a special topic  “Monitoring consumer” reports message counts to another special topic  “Important consumers” also report message counts  Reconcile the results
  40. 40. Be Safe, Not Sorry  Acks = all  Block.on.buffer.full = true  Retries = MAX_INT  ( Max.inflight.requests.per.connect = 1 )  Producer.close()  Replication-factor >= 3  Min.insync.replicas = 2  Unclean.leader.election = false  Auto.offset.commit = false  Commit after processing  Monitor!

×