SlideShare a Scribd company logo
1 of 68
Download to read offline
Troubleshooting Kafka’s socket server
from incident to resolution
Joel Koshy
LinkedIn
The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18
The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster
The incident
● Alerts fire; NOC engages SYSOPS/Kafka-SRE
● Kafka-SRE restarts the broker
● Broker failure does not generally cause prolonged application impact
○ but in this incident…
The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
impacted application
The incident
Two brokers report high request and response queue sizes
The incident
Two brokers report high request queue size and request latencies
The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs
Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
… oh and BTW
be careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8.01 secs]
● Use ionice or rsync --bwlimit
low
low
high
The investigation
● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue
Understanding queue backups…
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
New
connections
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
and then turn off
read interest from
that connection
(for ordering)
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
Turn read interest
back on from that
connection
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
High
but “normal”
(purgatory)
Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
High local times during incident (e.g., fetch)
How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Maybe
Should be fast
Delayed outside
API thread
ISR churn?
● Low ZooKeeper write latencies
● Churn in this incident: effect of some other root cause
● Long request queues can cause churn
○ ⇒ follower fetches timeout
■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)
○ Outstanding fetch catches up and ISR expands
ISR churn? … unlikely
High local times during incident (e.g., fetch)
Besides, fetch-consumer (not
just follower) has high local time
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!
Quota metrics - a quick benchmark
for (clientId ← 0 until N) {
timer.time {
quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack)
}
}
Quota metrics - a quick benchmark
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
meanwhile in our queuing cluster…
due to climbing
client-id counts
Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
Generally more effective Sometimes warranted,
but invasive and tedious
How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request
as for rogue clients…
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
“Get apps to use wrapper libraries that
implement good client behavior,
shield from API changes and so on…
not done yet!
this lesson needs repeating...
After deploying the metrics fix to some clusters…
After deploying the metrics fix to some clusters…
Deployment
Persistent
URP
After deploying the metrics fix to some clusters…
Applications also begin to report
higher than usual consumer lag
Root cause: zero-copy broke for plaintext
Upgraded
cluster
Rolled
back
With fix
(KAFKA-2517)
The lesson...
● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!
Continuous validation on trunk
Any other high latency requests?
Image courtesy of ©Nevit Dilmen
Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for delete=true)
TopicMetadata UpdateMetadata
OffsetCommit
These are (typically
1:N) broker-to-broker
requests
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Read bit off so ties
up at most one API
handler
Request queue
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
Processor Response queue
API handler
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queue
Processor Response queue
API handler
Configure socket timeouts
>> MAX(latency)
● Broker to controller
○ ControlledShutdown (KAFKA-1342)
● Controller to broker
○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)
○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t
looked closely yet…
Broker-to-broker request latency improvements
The end

More Related Content

What's hot

Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overviewiamtodor
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
Kafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedInKafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedInAditya Auradkar
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Kafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and OpsKafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and OpsJean-Paul Azar
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...HostedbyConfluent
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...confluent
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentHostedbyConfluent
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in KafkaJayesh Thakrar
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafkaconfluent
 

What's hot (20)

Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Kafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedInKafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedIn
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and OpsKafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and Ops
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 

Similar to Troubleshooting Kafka's socket server: from incident to resolution

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent
 
Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Ying Xu
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Peter Bakas
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance BenchmarkingSantanu Dey
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Kafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message QueueKafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message QueueShafaq Abdullah
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesSagi Brody
 
Reactive Streams and RxJava2
Reactive Streams and RxJava2Reactive Streams and RxJava2
Reactive Streams and RxJava2Yakov Fain
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesLogan Best
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experienceRico Chen
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 

Similar to Troubleshooting Kafka's socket server: from incident to resolution (20)

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance Benchmarking
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Kafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message QueueKafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message Queue
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Reactive Streams and RxJava2
Reactive Streams and RxJava2Reactive Streams and RxJava2
Reactive Streams and RxJava2
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experience
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

Troubleshooting Kafka's socket server: from incident to resolution

  • 1. Troubleshooting Kafka’s socket server from incident to resolution Joel Koshy LinkedIn
  • 2. The incident Occurred a few days after upgrading to pick up quotas and SSL Multi-port KAFKA-1809 KAFKA-1928 SSL KAFKA-1690 x25 x38 October 13 Various quota patches June 3April 5 August 18
  • 3. The incident Broker (which happened to be controller) failed in our queuing Kafka cluster
  • 4. The incident ● Alerts fire; NOC engages SYSOPS/Kafka-SRE ● Kafka-SRE restarts the broker ● Broker failure does not generally cause prolonged application impact ○ but in this incident…
  • 5. The incident Multiple applications begin to report “issues”: socket timeouts to Kafka cluster Posts search was one such impacted application
  • 6. The incident Two brokers report high request and response queue sizes
  • 7. The incident Two brokers report high request queue size and request latencies
  • 8. The incident ● Other observations ○ High CPU load on those brokers ○ Throughput degrades to ~ half the normal throughput ○ Tons of broken pipe exceptions in server logs ○ Application owners report socket timeouts in their logs
  • 9. Remediation Shifted site traffic to another data center “Kafka outage ⇒ member impact Multi-colo is critical!
  • 10. Remediation ● Controller moves did not help ● Firewall the affected brokers ● The above helped, but cluster fell over again after dropping the rules ● Suspect misbehaving clients on broker failure ○ … but x25 never exhibited this issue sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
  • 11.
  • 12. “Good dashboards/alerts Skilled operators Clear communication Audit/log all operations CRM principles apply to operations
  • 13. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade
  • 14. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Move leaders
  • 15. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Firewall
  • 16. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade Firewall x25
  • 17. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade x25 Move leaders
  • 18. … oh and BTW be careful when saving a lot of public-access/server logs: ● Can cause GC [Times: user=0.39 sys=0.01, real=8.01 secs] ● Use ionice or rsync --bwlimit low low high
  • 20. ● Test cluster ○ Tried killing controller ○ Multiple rolling bounces ○ Could not reproduce ● Upgraded the queuing cluster to x38 again ○ Could not reproduce ● So nothing… Attempts at reproducing the issue
  • 22. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory Quota manager
  • 23. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory Quota manager New connections
  • 24. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager and then turn off read interest from that connection (for ordering)
  • 25. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time
  • 26. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time
  • 27. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests
  • 28. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time
  • 29. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time
  • 30. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time
  • 31. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time Turn read interest back on from that connection
  • 32. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad
  • 33. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad Low
  • 34. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad Low High but “normal” (purgatory)
  • 35. Investigating high request times ● First look for high local time ○ then high response send time ■ then high remote (purgatory) time → generally non-issue (but caveats described later) ● High request queue/response queue times are effects, not causes
  • 36. High local times during incident (e.g., fetch)
  • 37. How are fetch requests handled? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation
  • 38. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Maybe Should be fast Delayed outside API thread
  • 40. ● Low ZooKeeper write latencies ● Churn in this incident: effect of some other root cause ● Long request queues can cause churn ○ ⇒ follower fetches timeout ■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread) ○ Outstanding fetch catches up and ISR expands ISR churn? … unlikely
  • 41. High local times during incident (e.g., fetch) Besides, fetch-consumer (not just follower) has high local time
  • 42. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Should be fast Delayed outside API thread Test this…
  • 43. Maintains byte-rate metrics on a per-client-id basis 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS Quota metrics ??!
  • 44. Quota metrics - a quick benchmark for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) } }
  • 45. Quota metrics - a quick benchmark
  • 46. Quota metrics - a quick benchmark Fixed in KAFKA-2664
  • 47. meanwhile in our queuing cluster… due to climbing client-id counts
  • 48. Rolling bounce of cluster forced the issue to recur on brokers that had high client- id metric counts ○ Used jmxterm to check per-client-id metric counts before experiment ○ Hooked up profiler to verify during incident ■ Generally avoid profiling/heapdumps in production due to interference ○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
  • 49. MACRO ● Observe week-over-week trends ● Formulate theories ● Test theory (micro-experiments) ● Deploy fix and validate MICRO (live debugging) ● Instrumentation ● Attach profilers ● Take heapdumps ● Trace-level logs, tcpdump, etc. Troubleshooting: macro vs micro
  • 50. MACRO ● Observe week-over-week trends ● Formulate theories ● Test theory (micro-experiments) ● Deploy fix and validate MICRO (live debugging) ● Instrumentation ● Attach profilers ● Take heapdumps ● Trace-level logs, tcpdump, etc. Troubleshooting: macro vs micro Generally more effective Sometimes warranted, but invasive and tedious
  • 51. How to fix high local times ● Optimize the request’s handling. For e.g.,: ○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901) ○ and KAFKA-1356 ● Make it asynchronous ○ E.g., we will do this for StopReplica in KAFKA-1911 ● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats: ○ Higher memory pressure if request purgatory size grows ○ Expired requests are handled in purgatory expiration thread (which is good) ○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
  • 52. as for rogue clients… 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS “Get apps to use wrapper libraries that implement good client behavior, shield from API changes and so on…
  • 53. not done yet! this lesson needs repeating...
  • 54. After deploying the metrics fix to some clusters…
  • 55. After deploying the metrics fix to some clusters… Deployment Persistent URP
  • 56. After deploying the metrics fix to some clusters… Applications also begin to report higher than usual consumer lag
  • 57. Root cause: zero-copy broke for plaintext Upgraded cluster Rolled back With fix (KAFKA-2517)
  • 59. ● Request queue size ● Response queue sizes ● Request latencies: ○ Total time ○ Local time ○ Response send time ○ Remote time ● Request handler pool idle ratio Monitor these closely!
  • 61. Any other high latency requests? Image courtesy of ©Nevit Dilmen
  • 62. Local times ConsumerMetadata OffsetFetch ControlledShutdown Offsets (by time) Fetch Produce LeaderAndIsr StopReplica (for delete=true) TopicMetadata UpdateMetadata OffsetCommit These are (typically 1:N) broker-to-broker requests
  • 63. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Read bit off so ties up at most one API handler Request queue
  • 64. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queueBut if requesting broker times out and retries…
  • 65. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queueBut if requesting broker times out and retries… Processor Response queue API handler
  • 66. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queue Processor Response queue API handler Configure socket timeouts >> MAX(latency)
  • 67. ● Broker to controller ○ ControlledShutdown (KAFKA-1342) ● Controller to broker ○ StopReplica[delete = true] should be asynchronous (KAFKA-1911) ○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet… Broker-to-broker request latency improvements