SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
Troubleshooting Kafka's socket server: from incident to resolution
LinkedIn’s Kafka deployment is nearing 1300 brokers that move close to 1.3 trillion messages a day. While operating Kafka smoothly even at this scale is testament to both Kafka’s scalability and the operational expertise of LinkedIn SREs we occasionally run into some very interesting bugs at this scale. In this talk I will dive into a production issue that we recently encountered as an example of how even a subtle bug can suddenly manifest at scale and cause a near meltdown of the cluster. We will go over how we detected and responded to the situation, investigated it after the fact and summarize some lessons learned and best-practices from this incident.
LinkedIn’s Kafka deployment is nearing 1300 brokers that move close to 1.3 trillion messages a day. While operating Kafka smoothly even at this scale is testament to both Kafka’s scalability and the operational expertise of LinkedIn SREs we occasionally run into some very interesting bugs at this scale. In this talk I will dive into a production issue that we recently encountered as an example of how even a subtle bug can suddenly manifest at scale and cause a near meltdown of the cluster. We will go over how we detected and responded to the situation, investigated it after the fact and summarize some lessons learned and best-practices from this incident.
Troubleshooting Kafka's socket server: from incident to resolution
1.
Troubleshooting Kafka’s socket server
from incident to resolution
Joel Koshy
LinkedIn
2.
The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18
3.
The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster
4.
The incident
● Alerts fire; NOC engages SYSOPS/Kafka-SRE
● Kafka-SRE restarts the broker
● Broker failure does not generally cause prolonged application impact
○ but in this incident…
5.
The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
impacted application
6.
The incident
Two brokers report high request and response queue sizes
7.
The incident
Two brokers report high request queue size and request latencies
8.
The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs
9.
Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
10.
Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
11.
“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations
12.
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
13.
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
14.
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
15.
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
16.
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
17.
… oh and BTW
be careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8.01 secs]
● Use ionice or rsync --bwlimit
low
low
high
19.
● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue
21.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
22.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
New
connections
23.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
and then turn off
read interest from
that connection
(for ordering)
24.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
25.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time
26.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
27.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
28.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
29.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
30.
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
Turn read interest
back on from that
connection
31.
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
32.
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
33.
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
High
but “normal”
(purgatory)
34.
Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
35.
High local times during incident (e.g., fetch)
36.
How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
37.
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Maybe
Should be fast
Delayed outside
API thread
39.
● Low ZooKeeper write latencies
● Churn in this incident: effect of some other root cause
● Long request queues can cause churn
○ ⇒ follower fetches timeout
■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)
○ Outstanding fetch catches up and ISR expands
ISR churn? … unlikely
40.
High local times during incident (e.g., fetch)
Besides, fetch-consumer (not
just follower) has high local time
41.
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…
45.
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
46.
meanwhile in our queuing cluster…
due to climbing
client-id counts
47.
Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
48.
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
49.
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
Generally more effective Sometimes warranted,
but invasive and tedious
50.
How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request
51.
as for rogue clients…
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
“Get apps to use wrapper libraries that
implement good client behavior,
shield from API changes and so on…
58.
● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!
61.
Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for delete=true)
TopicMetadata UpdateMetadata
OffsetCommit
These are (typically
1:N) broker-to-broker
requests
62.
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Read bit off so ties
up at most one API
handler
Request queue
63.
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
64.
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
Processor Response queue
API handler
65.
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queue
Processor Response queue
API handler
Configure socket timeouts
>> MAX(latency)
66.
● Broker to controller
○ ControlledShutdown (KAFKA-1342)
● Controller to broker
○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)
○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t
looked closely yet…
Broker-to-broker request latency improvements