Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafkaesque days at linked in in 2015

3,728 views

Published on

Presented at the inaugural Kafka summit (2016) hosted by Confluent in San Francisco

Abstract:

Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:

Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.

Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.

Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.

Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.

and more…

This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.

Published in: Data & Analytics

Kafkaesque days at linked in in 2015

  1. 1. Kafkaesque days at LinkedIn in 2015 Joel Koshy Kafka Summit 2016
  2. 2. Kafkaesque adjective Kaf·ka·esque ˌkäf-kə-ˈesk, ˌkaf- : of, relating to, or suggestive of Franz Kafka or his writings; especially : having a nightmarishly complex, bizarre, or illogical quality Merriam-Webster
  3. 3. Kafka @ LinkedIn
  4. 4. What @bonkoif said: More clusters More use-cases More problems … Kafka @ LinkedIn
  5. 5. Incidents that we will cover ● Offset rewinds ● Data loss ● Cluster unavailability ● (In)compatibility ● Blackout
  6. 6. Offset rewinds
  7. 7. What are offset rewinds? valid offsets invalid offsetsinvalid offsets yet to arrive messages purged messages
  8. 8. If a consumer gets an OffsetOutOfRangeException: What are offset rewinds? valid offsets invalid offsetsinvalid offsets auto.offset.reset ← earliest auto.offset.reset ← latest
  9. 9. What are offset rewinds… and why do they matter? HADOOP Kafka (CORP) Push job Kafka (PROD) Stork Mirror Maker Email campaigns
  10. 10. What are offset rewinds… and why do they matter? HADOOP Kafka Push job Kafka (PROD) Stork Mirror Maker Email campaigns Real-life incident courtesy of xkcd offset rewind
  11. 11. Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy Offset rewinds: the first incident
  12. 12. Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM Offset rewinds: the first incident
  13. 13. What are offset rewinds… and why do they matter? HADOOP Kafka (CORP) Push job Kafka (PROD) Stork Mirror Maker Email campaigns Good practice to have some filtering logic here
  14. 14. Offset rewinds: detection
  15. 15. Offset rewinds: detection
  16. 16. Offset rewinds: detection - just use this
  17. 17. Offset rewinds: a typical cause
  18. 18. Offset rewinds: a typical cause valid offsets invalid offsetsinvalid offsets consumer position
  19. 19. Offset rewinds: a typical cause valid offsets invalid offsetsinvalid offsets consumer position Unclean leader election truncates the log
  20. 20. Offset rewinds: a typical cause valid offsets invalid offsetsinvalid offsets consumer position Unclean leader election truncates the log … and consumer’s offset goes out of range
  21. 21. But there were no ULEs when this happened
  22. 22. But there were no ULEs when this happened … and we set auto.offset.reset to latest
  23. 23. Offset management - a quick overview (broker) Consumer Consumer group Consumer Consumer (broker)(broker) Consume (fetch requests)
  24. 24. Offset management - a quick overview Offset Manager (broker) Consumer Consumer group Consumer Consumer Periodic OffsetCommitRequest (broker)(broker)
  25. 25. Offset management - a quick overview Offset Manager (broker) Consumer Consumer group Consumer Consumer OffsetFetchRequest (after rebalance) (broker) (broker)
  26. 26. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic
  27. 27. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic New offset commits append to the topic
  28. 28. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic New offset commits append to the topic mirror-maker PageViewEvent-0 321 mirror-maker LoginEvent-8 512 … … Maintain offset cache to serve offset fetch requests quickly
  29. 29. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic New offset commits append to the topic mirror-maker PageViewEvent-0 321 mirror-maker LoginEvent-8 512 … … Purge old offsets via log compaction Maintain offset cache to serve offset fetch requests quickly
  30. 30. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic When a new broker becomes the leader (i.e., offset manager) it loads offsets into its cache
  31. 31. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic mirror-maker PageViewEvent-0 321 mirror-maker LoginEvent-8 512 … … See this deck for more details
  32. 32. Back to the incident… 2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287], Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
  33. 33. Back to the incident… ... <rebalance> 2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205 ... <rebalance> 2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223 ... <rebalance> 2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737 ... 2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287], Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
  34. 34. ./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper <zookeeperConnect> --formatter "kafka.coordinator. GroupMetadataManager$OffsetsMessageFormatter" --consumer.config config/consumer. properties (must set exclude.internal.topics=false in consumer.properties) While debugging offset rewinds, do this first!
  35. 35. ... … [mirror-maker,metrics_event,1]::OffsetAndMetadata[83511737,NO_METADATA,1433178005711] [mirror-maker,some-log_event,13]::OffsetAndMetadata[6811737,NO_METADATA,1433178005711] ... ... [mirror-maker,some-log_event,13]::OffsetAndMetadata[9581223,NO_METADATA,1436495051231] ... Inside the __consumer_offsets topic Jul 10 (today) Jun 1 !!
  36. 36. So why did the offset manager return a stale offset? Offset manager logs: 2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63] java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String at kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
  37. 37. So why did the offset manager return a stale offset? Offset manager logs: 2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63] java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String at kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576) ... ... mirror-maker some-log_event, 13 6811737 ... ... Leader moved and new offset manager hit KAFKA-2117 while loading offsets old offsets recent offsets
  38. 38. … caused a ton of offset resets 2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205 ... 2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223 ... 2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737 ... 2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287], Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225 [some-log_event, 13] 846232 9581225 purged
  39. 39. … but why the duplicate email? Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
  40. 40. … but why the duplicate email? 2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464 ... 2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464 ... 2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539 ... Also from Jun 1
  41. 41. … but why the duplicate email? 2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464 ... 2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464 ... 2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539 ... [crt-event, 12] 0 11464 … but still valid!
  42. 42. Time-based retention does not work well for low-volume topics Addressed by KIP-32/KIP-33
  43. 43. Offset rewinds: the second incident mirror makers got wedged restarted sent duplicate emails to (few) members
  44. 44. Offset rewinds: the second incident Consumer logs 2015/04/29 17:22:48.952 <rebalance started> ... 2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
  45. 45. Offset rewinds: the second incident Consumer logs 2015/04/29 17:22:48.952 <rebalance started> ... 2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions) Broker (offset manager) logs 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
  46. 46. Offset rewinds: the second incident Consumer logs 2015/04/29 17:22:48.952 <rebalance started> ... 2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions) Broker (offset manager) logs 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) ⇒ log cleaner had failed a while ago… but why did offset fetch return -1?
  47. 47. Offset management - a quick overview How are stale offsets (for dead consumers) cleaned up? dead-group PageViewEvent-0 321 timestamp older than a week active-group LoginEvent-8 512 recent timestamp … … __consumer_offsets Offset cache cleanup task
  48. 48. Offset management - a quick overview How are stale offsets (for dead consumers) cleaned up? dead-group PageViewEvent-0 321 timestamp older than a week active-group LoginEvent-8 512 recent timestamp … … __consumer_offsets Offset cache cleanup task Append tombstones for dead-group and delete entry in offset cache
  49. 49. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) mirror-maker PageViewEvent-0 45 very old timstamp mirror-maker LoginEvent-8 12 very old timestamp ... ... ... old offsets recent offsets load offsets
  50. 50. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) mirror-maker PageViewEvent-0 45 very old timstamp mirror-maker LoginEvent-8 12 very old timestamp ... ... ... old offsets recent offsets load offsets Cleanup task happened to run during the load
  51. 51. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) ... ... ... old offsets recent offsets load offsets
  52. 52. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) mirror-maker PageViewEvent-0 321 recent timestamp mirror-maker LoginEvent-8 512 recent timestamp ... ... ... old offsets recent offsets load offsets
  53. 53. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) ... ... ... old offsets recent offsets load offsets
  54. 54. Root cause of this rewind ● Log cleaner had failed (separate bug) ○ ⇒ offsets topic grew big ○ ⇒ offset load on leader movement took a while ● Cache cleanup ran during the load ○ which appended tombstones ○ and overrode the most recent offsets ● (Fixed in KAFKA-2163)
  55. 55. Offset rewinds: wrapping it up ● Monitor log cleaner health ● If you suspect a rewind: ○ Check for unclean leader elections ○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes) ○ Take a dump of the offsets topic ○ … stare long and hard at the logs (both consumer and offset manager) ● auto.offset.reset ← closest ? ● Better lag monitoring via Burrow
  56. 56. Critical data loss
  57. 57. P R O D B P R O D A C O R P Y C O R P X Data loss: the first incident Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  58. 58. C O R P Y C O R P X P R O D B P R O D A Audit trail Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers data audit
  59. 59. Data loss: detection (example 1) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  60. 60. Data loss: detection (example 1) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  61. 61. Data loss: detection (example 2) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  62. 62. Data loss? (The actual incident) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  63. 63. Data loss or audit issue? (The actual incident) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Sporadic discrepancies in Kafka- aggregate-CORP-X counts for several topics However, Hadoop-X tier is complete ✔ ✔ ✔ ✔ ✔ ✔✔
  64. 64. Verified actual data completeness by recounting events in a few low-volume topics … so definitely an audit-only issue Likely caused by dropping audit events
  65. 65. Verified actual data completeness by recounting events in a few low-volume topics … so definitely an audit-only issue Possible sources of discrepancy: ● Cluster auditor ● Cluster itself (i.e., data loss in audit topic) ● Audit front-end Likely caused by dropping audit events
  66. 66. Possible causes C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics emit audit counts Cluster auditor ● Counting incorrectly ○ but same version of auditor everywhere and only CORP-X has issues ● Not consuming all data for audit or failing to send all audit events ○ but no errors in auditor logs ● … and auditor bounces did not help
  67. 67. Data loss in audit topic ● … but no unclean leader elections ● … and no data loss in sampled topics (counted manually) Possible causes C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics emit audit counts
  68. 68. Audit front-end fails to insert audit events into DB ● … but other tiers (e.g., CORP-Y) are correct ● … and no errors in logs Possible causes C O R P X Kafka aggregate Hadoop Audit front-end consume audit Audit DB insert from CORP-Y
  69. 69. ● Emit counts to new test tier Attempt to reproduce C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  70. 70. … fortunately worked: ● Emit counts to new test tier ● test tier counts were also sporadically off Attempt to reproduce C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  71. 71. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  72. 72. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted ● Verified from broker public access logs that audit event was sent … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  73. 73. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted ● Verified from broker public access logs that audit event was sent ● … but on closer look realized it was not the leader for that partition of the audit topic … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  74. 74. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted ● Verified from broker public access logs that audit event was sent ● … but on closer look realized it was not the leader for that partition of the audit topic ● So why did it not return NotLeaderForPartition? … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  75. 75. That broker was part of another cluster! C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster Tier test siphoned audit events
  76. 76. … and we had a VIP misconfiguration C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster V I P stray broker entry
  77. 77. ● Auditor still uses the old producer ● Periodically refreshes metadata (via VIP) for the audit topic ● ⇒ sometimes fetches metadata from the other cluster So audit events leaked into the other cluster C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster V I P AuditTopic Metadata Request Metadata response
  78. 78. ● Auditor still uses the old producer ● Periodically refreshes metadata (via VIP) for the audit topic ● ⇒ sometimes fetches metadata from the other cluster ● and leaks audit events to that cluster until at least next metadata refresh So audit events leaked into the other cluster C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster V I P emit audit counts
  79. 79. Some takeaways ● Could have been worse if mirror-makers to CORP-X had been bounced ○ (Since mirror makers could have started siphoning actual data to the other cluster) ● Consider using round-robin DNS instead of VIPs ○ … which is also necessary for using per-IP connection limits
  80. 80. Data loss: the second incident Prolonged period of data loss from our Kafka REST proxy
  81. 81. Data loss: the second incident Alerts fire that a broker in tracking cluster had gone offline NOC engages SYSOPS to investigate NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in tracking messages starting shortly after the broker failure NOC engages Traffic SRE to investigate why their tracking events had stopped Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure Tracking events return to normal (expected) counts after the bounce Prolonged period of data loss from our Kafka REST proxy
  82. 82. Reproducing the issue Broker A Producer performance Broker B
  83. 83. Reproducing the issue Broker A Producer performance Broker B Isolate the broker (iptables)
  84. 84. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send Leader for partition 1 in-flight requests
  85. 85. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send New leader for partition 1 in-flight requests Old leader for partition 1
  86. 86. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send New leader for partition 1 in-flight requests New producer did not implement a request timeout Old leader for partition 1
  87. 87. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send in-flight requests New producer did not implement a request timeout ⇒ awaiting response ⇒ unaware of leader change until next metadata refresh New leader for partition 1 Old leader for partition 1
  88. 88. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send in-flight requests So client continues to send to partition 1 New leader for partition 1 Old leader for partition 1
  89. 89. Sender Accumulator Reproducing the issue Broker A Broker B Partition 2 Partition n send batches pile up in partition 1 and eat up accumulator memory in-flight requests New leader for partition 1 Old leader for partition 1
  90. 90. Sender Accumulator Reproducing the issue Broker B Partition 2 Partition n send in-flight requests subsequent sends drop/block per block.on.buffer .full config New leader for partition 1 Old leader for partition 1 Broker A
  91. 91. Reproducing the issue ● netstat tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java ● Producer metrics ○ zero retry/error rate ● Thread dump java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit) org.apache.kafka.clients.producer.internals.BufferPool.allocate(int) org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) ● Resolved by KAFKA-2120 (KIP-19)
  92. 92. Cluster unavailability (This is an abridged version of my earlier talk.)
  93. 93. The incident Occurred a few days after upgrading to pick up quotas and SSL Multi-port KAFKA-1809 KAFKA-1928 SSL KAFKA-1690 x25 x38 October 13 Various quota patches June 3April 5 August 18
  94. 94. The incident Broker (which happened to be controller) failed in our queuing Kafka cluster
  95. 95. The incident Multiple applications begin to report “issues”: socket timeouts to Kafka cluster Posts search was one such impacted application
  96. 96. The incident Two brokers report high request and response queue sizes
  97. 97. The incident Two brokers report high request queue size and request latencies
  98. 98. The incident ● Other observations ○ High CPU load on those brokers ○ Throughput degrades to ~ half the normal throughput ○ Tons of broken pipe exceptions in server logs ○ Application owners report socket timeouts in their logs
  99. 99. Remediation Shifted site traffic to another data center “Kafka outage ⇒ member impact Multi-colo is critical!
  100. 100. Remediation ● Controller moves did not help ● Firewall the affected brokers ● The above helped, but cluster fell over again after dropping the rules ● Suspect misbehaving clients on broker failure ○ … but x25 never exhibited this issue sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
  101. 101. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade
  102. 102. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Move leaders
  103. 103. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Firewall
  104. 104. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade Firewall x25
  105. 105. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade x25 Move leaders
  106. 106. ● Test cluster ○ Tried killing controller ○ Multiple rolling bounces ○ Could not reproduce ● Upgraded the queuing cluster to x38 again ○ Could not reproduce ● So nothing… Attempts at reproducing the issue
  107. 107. Unraveling queue backups…
  108. 108. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time
  109. 109. Investigating high request times ● First look for high local time ○ then high response send time ■ then high remote (purgatory) time → generally non-issue (but caveats described later) ● High request queue/response queue times are effects, not causes
  110. 110. High local times during incident (e.g., fetch)
  111. 111. How are fetch requests handled? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation
  112. 112. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Should be fast Delayed outside API thread Test this…
  113. 113. Maintains byte-rate metrics on a per-client-id basis 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS Quota metrics ??!
  114. 114. Quota metrics - a quick benchmark for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack) } }
  115. 115. Quota metrics - a quick benchmark
  116. 116. Quota metrics - a quick benchmark Fixed in KAFKA-2664
  117. 117. meanwhile in our queuing cluster… due to climbing client-id counts
  118. 118. Rolling bounce of cluster forced the issue to recur on brokers that had high client- id metric counts ○ Used jmxterm to check per-client-id metric counts before experiment ○ Hooked up profiler to verify during incident ■ Generally avoid profiling/heapdumps in production due to interference ○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
  119. 119. How to fix high local times ● Optimize the request’s handling. For e.g.,: ○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901) ○ and KAFKA-1356 ● Make it asynchronous ○ E.g., we will do this for StopReplica in KAFKA-1911 ● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats: ○ Higher memory pressure if request purgatory size grows ○ Expired requests are handled in purgatory expiration thread (which is good) ○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
  120. 120. ● Request queue size ● Response queue sizes ● Request latencies: ○ Total time ○ Local time ○ Response send time ○ Remote time ● Request handler pool idle ratio Monitor these closely!
  121. 121. Breaking compatibility
  122. 122. The first incident: new clients old clusters Test cluster (old version) Certification cluster (old version) Metrics cluster (old version) metric events metric events
  123. 123. The first incident: new clients old clusters Test cluster (new version) Certification cluster (old version) Metrics cluster (old version) metric events metric events org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio. BufferUnderflowException at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397) ...
  124. 124. New clients old clusters: remediation Test cluster (new version) Certification cluster (new version) Metrics cluster (old version) metric events metric events Set acks to zero
  125. 125. New clients old clusters: remediation Test cluster (new version) Certification cluster (new version) Metrics cluster (new version) metric events metric events Reset acks to 1
  126. 126. New clients old clusters: remediation (BTW this just hit us again with the protocol changes in KIP-31/KIP-32) KIP-35 would help a ton!
  127. 127. The second incident: new endpoints { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x14 client old client ignore endpoints v2 ⇒ use endpoints
  128. 128. The second incident: new endpoints { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x36 { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"}, {“ssl: //localhost:9093”} ] } x14 client old client java.lang.IllegalArgumentException: No enum constant org.apache.kafka.common.protocol.SecurityProtocol.SSL at java.lang.Enum.valueOf(Enum.java:238) at org.apache.kafka.common.protocol. SecurityProtocol.valueOf(SecurityProtocol.java:24)
  129. 129. New endpoints: remediation { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x36 { "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"}, {“ssl: //localhost:9093”} ] } x14 client old client v1 ⇒ ignore endpoints
  130. 130. New endpoints: remediation { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x36 { "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"}, {“ssl: //localhost:9093”} ] } x14 client x36 client old client v1 ⇒ ignore endpoints v1 ⇒ use endpoints if present
  131. 131. New endpoints: remediation ● Fix in KAFKA-2584 ● Also related: KAFKA-3100
  132. 132. Power outage
  133. 133. Widespread FS corruption after power outage ● Mount settings at the time ○ type ext4 (rw,noatime,data=writeback,commit=120) ● Restarts were successful but brokers subsequently hit corruption ● Subsequent restarts also hit corruption in index files
  134. 134. Summary
  135. 135. ● Monitoring beyond per-broker/controller metrics ○ Validate SLAs ○ Continuously test admin functionality (in test clusters) ● Automate release validation ● https://github.com/linkedin/streaming Kafka monitor Kafka cluster producer Monitor instance ackLatencyMs e2eLatencyMs duplicateRate retryRate failureRate lossRate consumer Availability %
  136. 136. ● Monitoring beyond per-broker/controller metrics ○ Validate SLAs ○ Continuously test admin functionality (in test clusters) ● Automate release validation ● https://github.com/linkedin/streaming Kafka monitor Kafka cluster producer Monitor instance ackLatencyMs e2eLatencyMs duplicateRate retryRate failureRate lossRate consumer Monitor instance Admin Utils Monitor instance checkReassign checkPLE
  137. 137. Q&A
  138. 138. Software developers and Site Reliability Engineers at all levels Streams infrastructure @ LinkedIn ● Kafka pub-sub ecosystem ● Stream processing platform built on Apache Samza ● Next Gen Change capture technology (incubating) Contact Kartik Paramasivam Where LinkedIn campus 2061 Stierlin Ct., Mountain View, CA When May 11 at 6.30 PM Register http://bit.ly/1Sv8ach We are hiring! LinkedIn Data Infrastructure meetup

×