Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Troubleshooting Kafka’s socket server
from incident to resolution
Joel Koshy
LinkedIn
The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690...
The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster
The incident
● Alerts fire; NOC engages SYSOPS/Kafka-SRE
● Kafka-SRE restarts the broker
● Broker failure does not general...
The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
im...
The incident
Two brokers report high request and response queue sizes
The incident
Two brokers report high request queue size and request latencies
The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○...
Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over agai...
“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling...
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling...
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling...
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling dow...
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling dow...
… oh and BTW
be careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8....
The investigation
● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster ...
Understanding queue backups…
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Resp...
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
Hi...
Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purga...
High local times during incident (e.g., fetch)
How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower ...
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follo...
ISR churn?
● Low ZooKeeper write latencies
● Churn in this incident: effect of some other root cause
● Long request queues can cause ...
High local times during incident (e.g., fetch)
Besides, fetch-consumer (not
just follower) has high local time
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follo...
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: Fet...
Quota metrics - a quick benchmark
for (clientId ← 0 until N) {
timer.time {
quotaMetrics.recordAndMaybeThrottle(sensorId, ...
Quota metrics - a quick benchmark
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
meanwhile in our queuing cluster…
due to climbing
client-id counts
Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to ch...
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MIC...
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MIC...
How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper ...
as for rogue clients…
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
Correlat...
not done yet!
this lesson needs repeating...
After deploying the metrics fix to some clusters…
After deploying the metrics fix to some clusters…
Deployment
Persistent
URP
After deploying the metrics fix to some clusters…
Applications also begin to report
higher than usual consumer lag
Root cause: zero-copy broke for plaintext
Upgraded
cluster
Rolled
back
With fix
(KAFKA-2517)
The lesson...
● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote t...
Continuous validation on trunk
Any other high latency requests?
Image courtesy of ©Nevit Dilmen
Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for ...
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
A...
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
A...
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
A...
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
A...
● Broker to controller
○ ControlledShutdown (KAFKA-1342)
● Controller to broker
○ StopReplica[delete = true] should be asy...
The end
Troubleshooting Kafka's socket server: from incident to resolution
Upcoming SlideShare
Loading in …5
×

Troubleshooting Kafka's socket server: from incident to resolution

5,995 views

Published on

LinkedIn’s Kafka deployment is nearing 1300 brokers that move close to 1.3 trillion messages a day. While operating Kafka smoothly even at this scale is testament to both Kafka’s scalability and the operational expertise of LinkedIn SREs we occasionally run into some very interesting bugs at this scale. In this talk I will dive into a production issue that we recently encountered as an example of how even a subtle bug can suddenly manifest at scale and cause a near meltdown of the cluster. We will go over how we detected and responded to the situation, investigated it after the fact and summarize some lessons learned and best-practices from this incident.

Published in: Data & Analytics
  • To get professional research papers you must go for experts like HelpWriting.net
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/warLw ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Paid On Social Media Sites? YES! View 1000s of companies hiring social media managers now!  http://t.cn/AieXipTS
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Troubleshooting Kafka's socket server: from incident to resolution

  1. 1. Troubleshooting Kafka’s socket server from incident to resolution Joel Koshy LinkedIn
  2. 2. The incident Occurred a few days after upgrading to pick up quotas and SSL Multi-port KAFKA-1809 KAFKA-1928 SSL KAFKA-1690 x25 x38 October 13 Various quota patches June 3April 5 August 18
  3. 3. The incident Broker (which happened to be controller) failed in our queuing Kafka cluster
  4. 4. The incident ● Alerts fire; NOC engages SYSOPS/Kafka-SRE ● Kafka-SRE restarts the broker ● Broker failure does not generally cause prolonged application impact ○ but in this incident…
  5. 5. The incident Multiple applications begin to report “issues”: socket timeouts to Kafka cluster Posts search was one such impacted application
  6. 6. The incident Two brokers report high request and response queue sizes
  7. 7. The incident Two brokers report high request queue size and request latencies
  8. 8. The incident ● Other observations ○ High CPU load on those brokers ○ Throughput degrades to ~ half the normal throughput ○ Tons of broken pipe exceptions in server logs ○ Application owners report socket timeouts in their logs
  9. 9. Remediation Shifted site traffic to another data center “Kafka outage ⇒ member impact Multi-colo is critical!
  10. 10. Remediation ● Controller moves did not help ● Firewall the affected brokers ● The above helped, but cluster fell over again after dropping the rules ● Suspect misbehaving clients on broker failure ○ … but x25 never exhibited this issue sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
  11. 11. “Good dashboards/alerts Skilled operators Clear communication Audit/log all operations CRM principles apply to operations
  12. 12. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade
  13. 13. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Move leaders
  14. 14. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Firewall
  15. 15. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade Firewall x25
  16. 16. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade x25 Move leaders
  17. 17. … oh and BTW be careful when saving a lot of public-access/server logs: ● Can cause GC [Times: user=0.39 sys=0.01, real=8.01 secs] ● Use ionice or rsync --bwlimit low low high
  18. 18. The investigation
  19. 19. ● Test cluster ○ Tried killing controller ○ Multiple rolling bounces ○ Could not reproduce ● Upgraded the queuing cluster to x38 again ○ Could not reproduce ● So nothing… Attempts at reproducing the issue
  20. 20. Understanding queue backups…
  21. 21. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory Quota manager
  22. 22. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory Quota manager New connections
  23. 23. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager and then turn off read interest from that connection (for ordering)
  24. 24. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time
  25. 25. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time
  26. 26. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests
  27. 27. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time
  28. 28. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time
  29. 29. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time
  30. 30. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time Turn read interest back on from that connection
  31. 31. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad
  32. 32. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad Low
  33. 33. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad Low High but “normal” (purgatory)
  34. 34. Investigating high request times ● First look for high local time ○ then high response send time ■ then high remote (purgatory) time → generally non-issue (but caveats described later) ● High request queue/response queue times are effects, not causes
  35. 35. High local times during incident (e.g., fetch)
  36. 36. How are fetch requests handled? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation
  37. 37. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Maybe Should be fast Delayed outside API thread
  38. 38. ISR churn?
  39. 39. ● Low ZooKeeper write latencies ● Churn in this incident: effect of some other root cause ● Long request queues can cause churn ○ ⇒ follower fetches timeout ■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread) ○ Outstanding fetch catches up and ISR expands ISR churn? … unlikely
  40. 40. High local times during incident (e.g., fetch) Besides, fetch-consumer (not just follower) has high local time
  41. 41. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Should be fast Delayed outside API thread Test this…
  42. 42. Maintains byte-rate metrics on a per-client-id basis 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS Quota metrics ??!
  43. 43. Quota metrics - a quick benchmark for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) } }
  44. 44. Quota metrics - a quick benchmark
  45. 45. Quota metrics - a quick benchmark Fixed in KAFKA-2664
  46. 46. meanwhile in our queuing cluster… due to climbing client-id counts
  47. 47. Rolling bounce of cluster forced the issue to recur on brokers that had high client- id metric counts ○ Used jmxterm to check per-client-id metric counts before experiment ○ Hooked up profiler to verify during incident ■ Generally avoid profiling/heapdumps in production due to interference ○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
  48. 48. MACRO ● Observe week-over-week trends ● Formulate theories ● Test theory (micro-experiments) ● Deploy fix and validate MICRO (live debugging) ● Instrumentation ● Attach profilers ● Take heapdumps ● Trace-level logs, tcpdump, etc. Troubleshooting: macro vs micro
  49. 49. MACRO ● Observe week-over-week trends ● Formulate theories ● Test theory (micro-experiments) ● Deploy fix and validate MICRO (live debugging) ● Instrumentation ● Attach profilers ● Take heapdumps ● Trace-level logs, tcpdump, etc. Troubleshooting: macro vs micro Generally more effective Sometimes warranted, but invasive and tedious
  50. 50. How to fix high local times ● Optimize the request’s handling. For e.g.,: ○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901) ○ and KAFKA-1356 ● Make it asynchronous ○ E.g., we will do this for StopReplica in KAFKA-1911 ● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats: ○ Higher memory pressure if request purgatory size grows ○ Expired requests are handled in purgatory expiration thread (which is good) ○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
  51. 51. as for rogue clients… 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS “Get apps to use wrapper libraries that implement good client behavior, shield from API changes and so on…
  52. 52. not done yet! this lesson needs repeating...
  53. 53. After deploying the metrics fix to some clusters…
  54. 54. After deploying the metrics fix to some clusters… Deployment Persistent URP
  55. 55. After deploying the metrics fix to some clusters… Applications also begin to report higher than usual consumer lag
  56. 56. Root cause: zero-copy broke for plaintext Upgraded cluster Rolled back With fix (KAFKA-2517)
  57. 57. The lesson...
  58. 58. ● Request queue size ● Response queue sizes ● Request latencies: ○ Total time ○ Local time ○ Response send time ○ Remote time ● Request handler pool idle ratio Monitor these closely!
  59. 59. Continuous validation on trunk
  60. 60. Any other high latency requests? Image courtesy of ©Nevit Dilmen
  61. 61. Local times ConsumerMetadata OffsetFetch ControlledShutdown Offsets (by time) Fetch Produce LeaderAndIsr StopReplica (for delete=true) TopicMetadata UpdateMetadata OffsetCommit These are (typically 1:N) broker-to-broker requests
  62. 62. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Read bit off so ties up at most one API handler Request queue
  63. 63. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queueBut if requesting broker times out and retries…
  64. 64. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queueBut if requesting broker times out and retries… Processor Response queue API handler
  65. 65. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queue Processor Response queue API handler Configure socket timeouts >> MAX(latency)
  66. 66. ● Broker to controller ○ ControlledShutdown (KAFKA-1342) ● Controller to broker ○ StopReplica[delete = true] should be asynchronous (KAFKA-1911) ○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet… Broker-to-broker request latency improvements
  67. 67. The end

×