Troubleshooting Kafka's socket server: from incident to resolution

Troubleshooting Kafka’s socket server
from incident to resolution
Joel Koshy
LinkedIn

The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18

The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster

The incident
● Alerts fire; NOC engages SYSOPS/Kafka-SRE
● Kafka-SRE restarts the broker
● Broker failure does not generally cause prolonged application impact
○ but in this incident…

The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
impacted application

The incident
Two brokers report high request and response queue sizes

The incident
Two brokers report high request queue size and request latencies

The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs

Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!

Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP

“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations

Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade

Remediation
x38 x38 x38 x38
Rolling downgrade
Move leaders

Remediation
x38 x38 x38 x38
Rolling downgrade
Firewall

Remediation
x38 x38 x38
Rolling downgrade
Firewall
x25

Remediation
x38 x38 x38
Rolling downgrade
x25
Move leaders

… oh and BTW
be careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8.01 secs]
● Use ionice or rsync --bwlimit
low
low
high

● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue

Understanding queue backups…

API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
New
connections

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
and then turn off
read interest from
that connection
(for ordering)

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Handle request
+ local-time

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Handle request
+ local-time + remote-time
long-poll requests

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Handle request
long-poll requests
Hold if quota
violated
+ quota-time

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Handle request
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Handle request
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
Write
response
+ response-send-time

API layer
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Handle request
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
Write
response
+ response-send-time
Turn read interest
back on from that
connection

Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad

Low

Low
High
but “normal”
(purgatory)

● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes

High local times during incident (e.g., fetch)

How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation

Could these cause high local times?
○ Satisfy eligible delayed produce requests (with acks -1)
Not using acks -1
Should be fast
Maybe
Should be fast
Delayed outside
API thread

● Low ZooKeeper write latencies
● Churn in this incident: effect of some other root cause
● Long request queues can cause churn
○ ⇒ follower fetches timeout
■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)
○ Outstanding fetch catches up and ISR expands
ISR churn? … unlikely

High local times during incident (e.g., fetch)
Besides, fetch-consumer (not
just follower) has high local time

Could these cause high local times?
○ Satisfy eligible delayed produce requests (with acks -1)
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…

Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!

Quota metrics - a quick benchmark
for (clientId ← 0 until N) {
timer.time {
quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack)
}
}

Fixed in KAFKA-2664

meanwhile in our queuing cluster…
due to climbing
client-id counts

Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time

MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro

MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
Generally more effective Sometimes warranted,
but invasive and tedious

How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request

as for rogue clients…
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
“Get apps to use wrapper libraries that
implement good client behavior,
shield from API changes and so on…

not done yet!
this lesson needs repeating...

After deploying the metrics fix to some clusters…

Deployment
Persistent
URP

Applications also begin to report
higher than usual consumer lag

Root cause: zero-copy broke for plaintext
Upgraded
cluster
Rolled
back
With fix
(KAFKA-2517)

● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!

Continuous validation on trunk

Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for delete=true)
TopicMetadata UpdateMetadata
OffsetCommit
These are (typically
1:N) broker-to-broker
requests

API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Read bit off so ties
up at most one API
handler
Request queue

API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…

API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
API handler

API handler
Purgatory
Acceptor
Request queue
API handler
Configure socket timeouts
>> MAX(latency)

● Broker to controller
○ ControlledShutdown (KAFKA-1342)
● Controller to broker
○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)
○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t
looked closely yet…
Broker-to-broker request latency improvements

Troubleshooting Kafka's socket server: from incident to resolution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Troubleshooting Kafka's socket server: from incident to resolution

Similar to Troubleshooting Kafka's socket server: from incident to resolution (20)

Recently uploaded

Recently uploaded (20)

Troubleshooting Kafka's socket server: from incident to resolution