More Related Content Similar to Running Thousands of Kafka Clusters on AWS With Mehari Beyene and Tom Schutte | Current 2022 (20) More from HostedbyConfluent (20) Running Thousands of Kafka Clusters on AWS With Mehari Beyene and Tom Schutte | Current 20221. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Kafka on AWS:
Best Practices
Lessons learned from operating
thousands of clusters
Mehari Beyene (he/him)
T U E S D A Y , O C T O B E R 4
Sr. Software Dev Engineer
AWS
Tom Schutte (he/him)
Software Dev Engineer
AWS
2. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Speakers
2
Tom Schutte
Software Engineer
Amazon MSK
Mehari Beyene
Senior Software Engineer
Amazon MSK
3. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Data is everything - everything is data
• 2.5 Million Terabytes of data is generated everyday
• Thousands of Terabytes streamed each day
• Latest data insights are critical
• Used by over 75% of Fortune 100 companies
• Hundreds of data streaming use cases
• Data streaming is still early days…
3
4. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Amazon Managed
Streaming for Apache
Kafka (MSK)
4
5. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon Managed Streaming for Apache Kafka (MSK)
• Offers open source Apache Kafka as
a service to customers
• Customers Can Create, Scale and
Upgrade Kafka clusters
• The MSK team monitors the health
of clusters and mitigate cluster
health problems
• The MSK team periodically update
software, apply patches and make
sure that clusters are healthy and
secure
5
6. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Monitoring Kafka
Clusters at scale
6
7. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Cluster Health Metrics to Monitor
Kafka & Zookeeper Metrics
• JMX metrics emitted by Kafka
& Zookeeper
Host Level Metrics
• CPU
• Memory
• Disk Usage
• Network Connectivity
Metrics from Agents
• Agents heartbeat
• Healthy/Unhealthy
7
8. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Challenges of monitoring at scale
• Flexibility of alarming
• Aggregate system health
• Prevent large issues from obscuring
• Cluster and Node level monitoring
• … Automate!
8
9. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon MSK’s Monitoring Architecture
• Stream metrics from each node
• Ingest records into a Flink application
• Filter metrics of interest
• Tune the sensitivity of each alarm
• Record health state information
• Take action!
9
10. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Automated Mitigation
at scale
10
11. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Failure Modes
Compute
• Degraded Hardware
• High Memory usage
• Overloaded CPU
Storage
• Disk Full
• Slow or Stuck disk
• Corrupted disk
Networking
• Inaccessible Network
interfaces
• Slow Dns Propagation
• Data Center Outages
11
12. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Challenges of Automated Mitigation
• Heterogeneity of Fleet
• Node types
• Kafka Versions
• Customized configurations and features
• Recovery from large scale events
12
13. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Automated Mitigations
• Terminate and Replace Nodes
• Restart Nodes
• Detach/Attach Volumes
• Replace Volumes
• Restart/Update Software
13
14. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Patching
Regularly Patch
• Operating System
• Kafka/Zookeeper Software
• Agents
Challenges
• Cluster availability
• Heterogeneous Fleet
• Zero Day Vulnerabilities
14
15. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon MSK Patching Tenets
• Update all software running on Clusters
• No impact to Cluster availability
• Should be done regularly
• Fast enough to patch an entire fleet and
slow enough not to disrupt Cluster
availability
15
16. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
On Demand Updates
16
17. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Update Dimensions
Compute
• Node Type
• Number of Brokers
Storage
• Increase disk size
• Auto Scaling
• Provisioned throughput
Connectivity
• Authentication and Encryption
• Public end points
17
18. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon MSK Update Tenets
• Guardrails for stable updates
• Safe and controlled – rolling restart, monitoring, automated
mitigation
• Speed matters
18
19. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
• Scalable monitoring and alarming system
• Automated detection and mitigation
• Regular and continuous patching
• Controlled mutation of clusters
19
20. © 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Thank you!
© 2022, Amazon Web Services, Inc. or its affiliates.
Mehari Beyene
mehbey@amazon.com
Tom Schutte
tomschu@amazon.com