1
A year supporting Kafka
Dustin Cote (Customer Operations, Confluent)
Ryan Pridgeon (Customer Operations, Confluent)
2
Prerequisites
● Medium experience with Kafka
● Cursory knowledge of
• Configuring Kafka
• Replication
• Request lifecycles
● Interest in Kafka Ops
● Don’t have these things?
• Kafka: The Definitive Guide
• http://kafka.apache.org/documentation.html
3
Agenda
● Quick Flyover of Concepts
● Discussion on some techniques we use to generically troubleshoot
● Three things we’ve seen trouble with
● For each one
○ What happened
○ Why it happened
○ What we’re doing to make it not happen again
● Wrap up and questions
4
Background
How did we get here?
● Supporting Kafka!
● Subscription customers, mailing list subscribers, our own sweat/blood/tears
Why does it matter?
● Avoid the mistakes of others
● Reduce time to a stable production
● Help improve Kafka!
OK, but who should really care?
● Admins (what should I look out for/why should I upgrade?)
● Developers (how can I be a good citizen/why does my admin look at me like that?)
● Architects (what are good deployment strategies/how have we addressed problem use cases?)
5
Concept Overview: How Requests Flow in Kafka
● Replication: copying messages to other
brokers for durability
● ISR: “In-sync Replica” -- is this replica up
to date?
● Brokers both servers and clients
● Coherence matters
6
Troubleshooting (JMX)
● Why JMX?
○ Lightweight for the broker, lightweight for your storage
○ Designed for historical information and pattern recognition
○ Easily shared (could even publish them to Kafka!) and moved to a new (not local) device
● Critical metrics (http://kafka.apache.org/documentation.html#monitoring)
○ Alert on these
○ Alert != restart
● How hard is it to set up?
○ Plenty of solutions of varying detail and price
○ Find what works for your org
● But what do all of these metrics mean??
7
Troubleshooting (JMX) - Key Broker Resources
8
Example 1 -- ISR Shrink/Expand
● Initial problem description
○ Under-replicated partitions are growing
● Scenario
○ Issue self heals
○ NetworkHandlerAvgIdle stabilizes at 60%
○ Brokers are 0.10.0 with some 0.9.x clients
○ Kafkacat -L requests time out occasionally
● Cause
○ 0.9.x clients were slow to receive responses
○ A blocking call was used to send down converted messages to older clients
○ This tied up network processor threads
9
Example 1 -- ISR Shrink/Expand
10
Example 1 -- ISR Shrink/Expand
● Prevention
○ Warn on ISR Shrinks/Expands
○ Warn on high Network and Request handler utilization/saturation
○ Be mindful of increasing request latency
● Solution
○ Upgrade to 0.10.0.1 with the permanent fix
○ Alternatively you could upgrade the clients
● Moral
○ Treat each issue like a new one making no assumptions about what may be the issue. Use the
metrics available to limit the scope of your investigation.
11
Example 2 -- Failed automation
● Initial problem description
○ 1 broker goes “down” repeatedly
○ Full cluster restart, stabilizing for > 1 hour
○ After whole cluster is up, some partitions are permanently under-replicated
● Scenario
○ Environment: Cloud, Docker
○ For any failure, destroy/rebuild containers
○ Failure = ELB to broker connection failure
● Cause
○ Single broker lost connectivity with the ELB
○ Full cluster restart crushed the controller upon startup (8000+ partitions across 5 brokers).
○ Repeated automatic restarts during stabilization exacerbated problem
12
Example 2 - Methodology
13
Example 2 -- Failed automation
● Prevention
○ Go to the source of truth for broker liveness, ZooKeeper
○ Alert and analyze upon “broker down” instead of triggering a container rebuild
○ Avoid “system reset” as a debugging tool
● Solution
○ Near term: disable controlled shutdown to avoid exposure
○ Long term: reduce the number of partitions and take preventative measures above
● Moral
○ Implement monitoring with JMX and rely on it
○ If you aren’t sure what action to take automatically, tell a human
○ Distributed systems and blind restarts do not mix
14
Example 3 -- Reassignment Storm
● Initial problem description
○ Bad performance, producing is slow, consuming is slow, ISRs are shrinking
● Scenario
○ Adding a new broker
○ Partition reassignment done manually
○ Reassignment tool requires some knowledge of how replication works
● Cause
○ A cluster-wide partition reassignment was started
○ Brokers’ network processors overwhelmed
○ Crushed network processors == everything slows down
○ Prior to 0.10.1, process cannot be throttled
15
Example 3 - Methodology
16
Example 3 -- Reassignment Storm
● Prevention
○ Take into account number of partitions being moved
● Solution
○ Move a small number of partitions at a time
○ Upgrade to 0.10.1 or higher to take advantage of replica throttling
http://kafka.apache.org/documentation.html#rep-throttle
○ Confluent Rebalancer
● Moral
○ Monitor the cluster with JMX to understand loading
○ Anytime you change how data is flowing, test in a stage environment if possible first
17
What did we learn...
● Implement monitoring with JMX and rely on it
● If you aren’t sure what action to take automatically, tell a human
● Stateful distributed systems and blind restarts do not mix
● Monitor the cluster with JMX to understand loading
● Anytime you change how data is flowing, test in a stage environment if possible first
● Not all problems have a singular solution, use metrics to tease out the root cause before acting
18
Troubleshooting (JMX) - Utilization/Saturation
Resource utilization
UnderReplicatedPartitions
RequestHandlerAvgIdlePercent
NetworkProcessorAvgIdlePercent
ResponseQueueSize
IdlePercent
RequestsPerSec
ResponseSendTimeMs,
RequestQueueSize
RequestQueueTimeMs
LocalTimeMs
RemoteTimeMs
Key
Replica Manager
Request Handler Pool
Network Processor Threads
19
Troubleshooting (Logging/Errors)
● Should not drive investigation
● Supplements observed metrics
● Provides context to the observed metrics for further investigation
● Exceptions stacks are useful for spotting bugs
20
Troubleshooting (Methodology) - USE
Summary:
Check Utilization, Saturation and Errors for each
resource
Definitions:
● Utilization : How much work is being performed
● Saturation: No additional work can be performed
● Errors: Error, possibly Warn level messages in the logs
Reasoning:
● Avoid needless work
● Expedite TTR
● Accurate RCAs
Acknowledgments:
“Systems Performance: Enterprise and the Cloud”, Brendan Gregg
21
In Summary...
● Get those JMX metrics monitoring systems in place!
● Understand what your metrics are telling you before taking action
● Only restart if you have a reason to believe it will fix the problem
● When adding clients or brokers, test in a staging environment
● Looking for more Kafka?
• Stream me up, Scotty: Transitioning to the cloud using a streaming data platform -- Gwen Shapira/Bob
Lehmann, Today, 2:40PM 230A
• Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM 212 A-B
• Kafka Summit → https://kafka-summit.org/ (5/8 NYC, 8/28 SF)
• Confluent University Training → https://www.confluent.io/training/
• Docs → http://docs.confluent.io/current
• Confluent Enterprise (built on Kafka) → https://www.confluent.io/product/
22
Thank You!
Dustin Cote | dustin@confluent.io | @TrudgeDMC
Ryan Pridgeon | ryan@confluent.io
Also check out:
Stream me up, Scotty: Transitioning to the cloud using a
streaming data platform -- Gwen Shapira/Bob Lehmann,
Today, 2:40PM 230A
Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM
212 A-B
23

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka

  • 1.
    1 A year supportingKafka Dustin Cote (Customer Operations, Confluent) Ryan Pridgeon (Customer Operations, Confluent)
  • 2.
    2 Prerequisites ● Medium experiencewith Kafka ● Cursory knowledge of • Configuring Kafka • Replication • Request lifecycles ● Interest in Kafka Ops ● Don’t have these things? • Kafka: The Definitive Guide • http://kafka.apache.org/documentation.html
  • 3.
    3 Agenda ● Quick Flyoverof Concepts ● Discussion on some techniques we use to generically troubleshoot ● Three things we’ve seen trouble with ● For each one ○ What happened ○ Why it happened ○ What we’re doing to make it not happen again ● Wrap up and questions
  • 4.
    4 Background How did weget here? ● Supporting Kafka! ● Subscription customers, mailing list subscribers, our own sweat/blood/tears Why does it matter? ● Avoid the mistakes of others ● Reduce time to a stable production ● Help improve Kafka! OK, but who should really care? ● Admins (what should I look out for/why should I upgrade?) ● Developers (how can I be a good citizen/why does my admin look at me like that?) ● Architects (what are good deployment strategies/how have we addressed problem use cases?)
  • 5.
    5 Concept Overview: HowRequests Flow in Kafka ● Replication: copying messages to other brokers for durability ● ISR: “In-sync Replica” -- is this replica up to date? ● Brokers both servers and clients ● Coherence matters
  • 6.
    6 Troubleshooting (JMX) ● WhyJMX? ○ Lightweight for the broker, lightweight for your storage ○ Designed for historical information and pattern recognition ○ Easily shared (could even publish them to Kafka!) and moved to a new (not local) device ● Critical metrics (http://kafka.apache.org/documentation.html#monitoring) ○ Alert on these ○ Alert != restart ● How hard is it to set up? ○ Plenty of solutions of varying detail and price ○ Find what works for your org ● But what do all of these metrics mean??
  • 7.
    7 Troubleshooting (JMX) -Key Broker Resources
  • 8.
    8 Example 1 --ISR Shrink/Expand ● Initial problem description ○ Under-replicated partitions are growing ● Scenario ○ Issue self heals ○ NetworkHandlerAvgIdle stabilizes at 60% ○ Brokers are 0.10.0 with some 0.9.x clients ○ Kafkacat -L requests time out occasionally ● Cause ○ 0.9.x clients were slow to receive responses ○ A blocking call was used to send down converted messages to older clients ○ This tied up network processor threads
  • 9.
    9 Example 1 --ISR Shrink/Expand
  • 10.
    10 Example 1 --ISR Shrink/Expand ● Prevention ○ Warn on ISR Shrinks/Expands ○ Warn on high Network and Request handler utilization/saturation ○ Be mindful of increasing request latency ● Solution ○ Upgrade to 0.10.0.1 with the permanent fix ○ Alternatively you could upgrade the clients ● Moral ○ Treat each issue like a new one making no assumptions about what may be the issue. Use the metrics available to limit the scope of your investigation.
  • 11.
    11 Example 2 --Failed automation ● Initial problem description ○ 1 broker goes “down” repeatedly ○ Full cluster restart, stabilizing for > 1 hour ○ After whole cluster is up, some partitions are permanently under-replicated ● Scenario ○ Environment: Cloud, Docker ○ For any failure, destroy/rebuild containers ○ Failure = ELB to broker connection failure ● Cause ○ Single broker lost connectivity with the ELB ○ Full cluster restart crushed the controller upon startup (8000+ partitions across 5 brokers). ○ Repeated automatic restarts during stabilization exacerbated problem
  • 12.
    12 Example 2 -Methodology
  • 13.
    13 Example 2 --Failed automation ● Prevention ○ Go to the source of truth for broker liveness, ZooKeeper ○ Alert and analyze upon “broker down” instead of triggering a container rebuild ○ Avoid “system reset” as a debugging tool ● Solution ○ Near term: disable controlled shutdown to avoid exposure ○ Long term: reduce the number of partitions and take preventative measures above ● Moral ○ Implement monitoring with JMX and rely on it ○ If you aren’t sure what action to take automatically, tell a human ○ Distributed systems and blind restarts do not mix
  • 14.
    14 Example 3 --Reassignment Storm ● Initial problem description ○ Bad performance, producing is slow, consuming is slow, ISRs are shrinking ● Scenario ○ Adding a new broker ○ Partition reassignment done manually ○ Reassignment tool requires some knowledge of how replication works ● Cause ○ A cluster-wide partition reassignment was started ○ Brokers’ network processors overwhelmed ○ Crushed network processors == everything slows down ○ Prior to 0.10.1, process cannot be throttled
  • 15.
    15 Example 3 -Methodology
  • 16.
    16 Example 3 --Reassignment Storm ● Prevention ○ Take into account number of partitions being moved ● Solution ○ Move a small number of partitions at a time ○ Upgrade to 0.10.1 or higher to take advantage of replica throttling http://kafka.apache.org/documentation.html#rep-throttle ○ Confluent Rebalancer ● Moral ○ Monitor the cluster with JMX to understand loading ○ Anytime you change how data is flowing, test in a stage environment if possible first
  • 17.
    17 What did welearn... ● Implement monitoring with JMX and rely on it ● If you aren’t sure what action to take automatically, tell a human ● Stateful distributed systems and blind restarts do not mix ● Monitor the cluster with JMX to understand loading ● Anytime you change how data is flowing, test in a stage environment if possible first ● Not all problems have a singular solution, use metrics to tease out the root cause before acting
  • 18.
    18 Troubleshooting (JMX) -Utilization/Saturation Resource utilization UnderReplicatedPartitions RequestHandlerAvgIdlePercent NetworkProcessorAvgIdlePercent ResponseQueueSize IdlePercent RequestsPerSec ResponseSendTimeMs, RequestQueueSize RequestQueueTimeMs LocalTimeMs RemoteTimeMs Key Replica Manager Request Handler Pool Network Processor Threads
  • 19.
    19 Troubleshooting (Logging/Errors) ● Shouldnot drive investigation ● Supplements observed metrics ● Provides context to the observed metrics for further investigation ● Exceptions stacks are useful for spotting bugs
  • 20.
    20 Troubleshooting (Methodology) -USE Summary: Check Utilization, Saturation and Errors for each resource Definitions: ● Utilization : How much work is being performed ● Saturation: No additional work can be performed ● Errors: Error, possibly Warn level messages in the logs Reasoning: ● Avoid needless work ● Expedite TTR ● Accurate RCAs Acknowledgments: “Systems Performance: Enterprise and the Cloud”, Brendan Gregg
  • 21.
    21 In Summary... ● Getthose JMX metrics monitoring systems in place! ● Understand what your metrics are telling you before taking action ● Only restart if you have a reason to believe it will fix the problem ● When adding clients or brokers, test in a staging environment ● Looking for more Kafka? • Stream me up, Scotty: Transitioning to the cloud using a streaming data platform -- Gwen Shapira/Bob Lehmann, Today, 2:40PM 230A • Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM 212 A-B • Kafka Summit → https://kafka-summit.org/ (5/8 NYC, 8/28 SF) • Confluent University Training → https://www.confluent.io/training/ • Docs → http://docs.confluent.io/current • Confluent Enterprise (built on Kafka) → https://www.confluent.io/product/
  • 22.
    22 Thank You! Dustin Cote| dustin@confluent.io | @TrudgeDMC Ryan Pridgeon | ryan@confluent.io Also check out: Stream me up, Scotty: Transitioning to the cloud using a streaming data platform -- Gwen Shapira/Bob Lehmann, Today, 2:40PM 230A Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM 212 A-B
  • 23.