SlideShare a Scribd company logo
1 of 20
Download to read offline
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Kafka on AWS:
Best Practices
Lessons learned from operating
thousands of clusters
Mehari Beyene (he/him)
T U E S D A Y , O C T O B E R 4
Sr. Software Dev Engineer
AWS
Tom Schutte (he/him)
Software Dev Engineer
AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Speakers
2
Tom Schutte
Software Engineer
Amazon MSK
Mehari Beyene
Senior Software Engineer
Amazon MSK
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Data is everything - everything is data
• 2.5 Million Terabytes of data is generated everyday
• Thousands of Terabytes streamed each day
• Latest data insights are critical
• Used by over 75% of Fortune 100 companies
• Hundreds of data streaming use cases
• Data streaming is still early days…
3
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Amazon Managed
Streaming for Apache
Kafka (MSK)
4
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon Managed Streaming for Apache Kafka (MSK)
• Offers open source Apache Kafka as
a service to customers
• Customers Can Create, Scale and
Upgrade Kafka clusters
• The MSK team monitors the health
of clusters and mitigate cluster
health problems
• The MSK team periodically update
software, apply patches and make
sure that clusters are healthy and
secure
5
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Monitoring Kafka
Clusters at scale
6
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Cluster Health Metrics to Monitor
Kafka & Zookeeper Metrics
• JMX metrics emitted by Kafka
& Zookeeper
Host Level Metrics
• CPU
• Memory
• Disk Usage
• Network Connectivity
Metrics from Agents
• Agents heartbeat
• Healthy/Unhealthy
7
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Challenges of monitoring at scale
• Flexibility of alarming
• Aggregate system health
• Prevent large issues from obscuring
• Cluster and Node level monitoring
• … Automate!
8
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon MSK’s Monitoring Architecture
• Stream metrics from each node
• Ingest records into a Flink application
• Filter metrics of interest
• Tune the sensitivity of each alarm
• Record health state information
• Take action!
9
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
Automated Mitigation
at scale
10
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Failure Modes
Compute
• Degraded Hardware
• High Memory usage
• Overloaded CPU
Storage
• Disk Full
• Slow or Stuck disk
• Corrupted disk
Networking
• Inaccessible Network
interfaces
• Slow Dns Propagation
• Data Center Outages
11
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Challenges of Automated Mitigation
• Heterogeneity of Fleet
• Node types
• Kafka Versions
• Customized configurations and features
• Recovery from large scale events
12
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Automated Mitigations
• Terminate and Replace Nodes
• Restart Nodes
• Detach/Attach Volumes
• Replace Volumes
• Restart/Update Software
13
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Patching
Regularly Patch
• Operating System
• Kafka/Zookeeper Software
• Agents
Challenges
• Cluster availability
• Heterogeneous Fleet
• Zero Day Vulnerabilities
14
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon MSK Patching Tenets
• Update all software running on Clusters
• No impact to Cluster availability
• Should be done regularly
• Fast enough to patch an entire fleet and
slow enough not to disrupt Cluster
availability
15
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
© 2022, Amazon Web Services, Inc. or its affiliates.
On Demand Updates
16
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Update Dimensions
Compute
• Node Type
• Number of Brokers
Storage
• Increase disk size
• Auto Scaling
• Provisioned throughput
Connectivity
• Authentication and Encryption
• Public end points
17
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Amazon MSK Update Tenets
• Guardrails for stable updates
• Safe and controlled – rolling restart, monitoring, automated
mitigation
• Speed matters
18
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
• Scalable monitoring and alarming system
• Automated detection and mitigation
• Regular and continuous patching
• Controlled mutation of clusters
19
© 2022, Amazon Web Services, Inc. or its affiliates.
LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS
Thank you!
© 2022, Amazon Web Services, Inc. or its affiliates.
Mehari Beyene
mehbey@amazon.com
Tom Schutte
tomschu@amazon.com

More Related Content

Similar to Running Thousands of Kafka Clusters on AWS With Mehari Beyene and Tom Schutte | Current 2022

Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
All Things Open
 

Similar to Running Thousands of Kafka Clusters on AWS With Mehari Beyene and Tom Schutte | Current 2022 (20)

CICDforModernApplications_Stockholm.pdf
CICDforModernApplications_Stockholm.pdfCICDforModernApplications_Stockholm.pdf
CICDforModernApplications_Stockholm.pdf
 
A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
A deep dive into Amazon MSK - ADB206 - Chicago AWS SummitA deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
 
AWS DevDay Berlin - Automating building blocks choices you will face with con...
AWS DevDay Berlin - Automating building blocks choices you will face with con...AWS DevDay Berlin - Automating building blocks choices you will face with con...
AWS DevDay Berlin - Automating building blocks choices you will face with con...
 
Building Modern Applications on AWS.pptx
Building Modern Applications on AWS.pptxBuilding Modern Applications on AWS.pptx
Building Modern Applications on AWS.pptx
 
SRV205 Architectures and Strategies for Building Modern Applications on AWS
 SRV205 Architectures and Strategies for Building Modern Applications on AWS SRV205 Architectures and Strategies for Building Modern Applications on AWS
SRV205 Architectures and Strategies for Building Modern Applications on AWS
 
kreuzwerker AWS Modernizing Legacy Operations with Containerized Solutions 20...
kreuzwerker AWS Modernizing Legacy Operations with Containerized Solutions 20...kreuzwerker AWS Modernizing Legacy Operations with Containerized Solutions 20...
kreuzwerker AWS Modernizing Legacy Operations with Containerized Solutions 20...
 
NET309_Best Practices for Securing an Amazon Virtual Private Cloud
NET309_Best Practices for Securing an Amazon Virtual Private CloudNET309_Best Practices for Securing an Amazon Virtual Private Cloud
NET309_Best Practices for Securing an Amazon Virtual Private Cloud
 
Running kubernetes with amazon eks
Running kubernetes with amazon eksRunning kubernetes with amazon eks
Running kubernetes with amazon eks
 
Trusted Application Delivery: Achieving Ultimate Security
Trusted Application Delivery: Achieving Ultimate SecurityTrusted Application Delivery: Achieving Ultimate Security
Trusted Application Delivery: Achieving Ultimate Security
 
AWS DevDay Cologne - Automating building blocks choices you will face with co...
AWS DevDay Cologne - Automating building blocks choices you will face with co...AWS DevDay Cologne - Automating building blocks choices you will face with co...
AWS DevDay Cologne - Automating building blocks choices you will face with co...
 
[AWS Dev Day] 앱 현대화 | DevOps 개발자가 되기 위한 쿠버네티스 핵심 활용 예제 알아보기 - 정영준 AWS 솔루션즈 아키...
[AWS Dev Day] 앱 현대화 | DevOps 개발자가 되기 위한 쿠버네티스 핵심 활용 예제 알아보기 - 정영준 AWS 솔루션즈 아키...[AWS Dev Day] 앱 현대화 | DevOps 개발자가 되기 위한 쿠버네티스 핵심 활용 예제 알아보기 - 정영준 AWS 솔루션즈 아키...
[AWS Dev Day] 앱 현대화 | DevOps 개발자가 되기 위한 쿠버네티스 핵심 활용 예제 알아보기 - 정영준 AWS 솔루션즈 아키...
 
Simplifying Microsoft Architectures with AWS Services
Simplifying Microsoft Architectures with AWS Services Simplifying Microsoft Architectures with AWS Services
Simplifying Microsoft Architectures with AWS Services
 
SRV313 Introduction to Building Web Apps on AWS
 SRV313 Introduction to Building Web Apps on AWS SRV313 Introduction to Building Web Apps on AWS
SRV313 Introduction to Building Web Apps on AWS
 
Simplificando Arquiteturas Microsoft com os Serviços da AWS - ARC204 - Sao P...
Simplificando Arquiteturas Microsoft com os Serviços da AWS -  ARC204 - Sao P...Simplificando Arquiteturas Microsoft com os Serviços da AWS -  ARC204 - Sao P...
Simplificando Arquiteturas Microsoft com os Serviços da AWS - ARC204 - Sao P...
 
PaaS – From Code to Running Application using AWS Elastic Beanstalk (DEV323) ...
PaaS – From Code to Running Application using AWS Elastic Beanstalk (DEV323) ...PaaS – From Code to Running Application using AWS Elastic Beanstalk (DEV323) ...
PaaS – From Code to Running Application using AWS Elastic Beanstalk (DEV323) ...
 
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
 
Frome Code to Cloud: Exploring AWS CDK for Infrastructure Management
Frome Code to Cloud: Exploring AWS CDK for Infrastructure ManagementFrome Code to Cloud: Exploring AWS CDK for Infrastructure Management
Frome Code to Cloud: Exploring AWS CDK for Infrastructure Management
 
Control Planes on Kubernetes and Policy Validation
Control Planes on Kubernetes and Policy ValidationControl Planes on Kubernetes and Policy Validation
Control Planes on Kubernetes and Policy Validation
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 
Architecting security and governance through policy guardrails in Amazon EKS ...
Architecting security and governance through policy guardrails in Amazon EKS ...Architecting security and governance through policy guardrails in Amazon EKS ...
Architecting security and governance through policy guardrails in Amazon EKS ...
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 

Recently uploaded (20)

ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 

Running Thousands of Kafka Clusters on AWS With Mehari Beyene and Tom Schutte | Current 2022

  • 1. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS © 2022, Amazon Web Services, Inc. or its affiliates. Kafka on AWS: Best Practices Lessons learned from operating thousands of clusters Mehari Beyene (he/him) T U E S D A Y , O C T O B E R 4 Sr. Software Dev Engineer AWS Tom Schutte (he/him) Software Dev Engineer AWS
  • 2. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Speakers 2 Tom Schutte Software Engineer Amazon MSK Mehari Beyene Senior Software Engineer Amazon MSK
  • 3. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Data is everything - everything is data • 2.5 Million Terabytes of data is generated everyday • Thousands of Terabytes streamed each day • Latest data insights are critical • Used by over 75% of Fortune 100 companies • Hundreds of data streaming use cases • Data streaming is still early days… 3
  • 4. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS © 2022, Amazon Web Services, Inc. or its affiliates. Amazon Managed Streaming for Apache Kafka (MSK) 4
  • 5. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Amazon Managed Streaming for Apache Kafka (MSK) • Offers open source Apache Kafka as a service to customers • Customers Can Create, Scale and Upgrade Kafka clusters • The MSK team monitors the health of clusters and mitigate cluster health problems • The MSK team periodically update software, apply patches and make sure that clusters are healthy and secure 5
  • 6. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS © 2022, Amazon Web Services, Inc. or its affiliates. Monitoring Kafka Clusters at scale 6
  • 7. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Cluster Health Metrics to Monitor Kafka & Zookeeper Metrics • JMX metrics emitted by Kafka & Zookeeper Host Level Metrics • CPU • Memory • Disk Usage • Network Connectivity Metrics from Agents • Agents heartbeat • Healthy/Unhealthy 7
  • 8. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Challenges of monitoring at scale • Flexibility of alarming • Aggregate system health • Prevent large issues from obscuring • Cluster and Node level monitoring • … Automate! 8
  • 9. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Amazon MSK’s Monitoring Architecture • Stream metrics from each node • Ingest records into a Flink application • Filter metrics of interest • Tune the sensitivity of each alarm • Record health state information • Take action! 9
  • 10. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS © 2022, Amazon Web Services, Inc. or its affiliates. Automated Mitigation at scale 10
  • 11. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Failure Modes Compute • Degraded Hardware • High Memory usage • Overloaded CPU Storage • Disk Full • Slow or Stuck disk • Corrupted disk Networking • Inaccessible Network interfaces • Slow Dns Propagation • Data Center Outages 11
  • 12. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Challenges of Automated Mitigation • Heterogeneity of Fleet • Node types • Kafka Versions • Customized configurations and features • Recovery from large scale events 12
  • 13. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Automated Mitigations • Terminate and Replace Nodes • Restart Nodes • Detach/Attach Volumes • Replace Volumes • Restart/Update Software 13
  • 14. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Patching Regularly Patch • Operating System • Kafka/Zookeeper Software • Agents Challenges • Cluster availability • Heterogeneous Fleet • Zero Day Vulnerabilities 14
  • 15. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Amazon MSK Patching Tenets • Update all software running on Clusters • No impact to Cluster availability • Should be done regularly • Fast enough to patch an entire fleet and slow enough not to disrupt Cluster availability 15
  • 16. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS © 2022, Amazon Web Services, Inc. or its affiliates. On Demand Updates 16
  • 17. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Update Dimensions Compute • Node Type • Number of Brokers Storage • Increase disk size • Auto Scaling • Provisioned throughput Connectivity • Authentication and Encryption • Public end points 17
  • 18. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Amazon MSK Update Tenets • Guardrails for stable updates • Safe and controlled – rolling restart, monitoring, automated mitigation • Speed matters 18
  • 19. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS • Scalable monitoring and alarming system • Automated detection and mitigation • Regular and continuous patching • Controlled mutation of clusters 19
  • 20. © 2022, Amazon Web Services, Inc. or its affiliates. LESSONS LEARNED FROM RUNNING THOUSANDS OF KAFKA CLUSTERS ON AWS Thank you! © 2022, Amazon Web Services, Inc. or its affiliates. Mehari Beyene mehbey@amazon.com Tom Schutte tomschu@amazon.com