SlideShare a Scribd company logo
Building A kafka cluster in AWS that
will survive AZ crash
A Little bit about our Production
● 2.5 Billion requests per day and
growing
● Located at AWS
● Micro service architecture
● Kafka is our main message bus
● Most of the code Is written in
Clojure
● Almost all of the services are
consuming and/or producing
from/to Kafka
What This lecture include
● Quick overview on Kafka
● Why did we choose Kafka?
● Decisions to make when building a Kafka cluster
● Planing for fault tolerance
● Setting the defaults
● Automate the fault tolerance
● Reassign partitions and changing retention on the fly
● Adding Metrics
● Testing the cluster
● Demo for managing the cluster
Quick Kafka overview
● Open Source Message Bus developed by Linkedin
● Designed as a distributed system
● Offers high throughput for both publishing and subscribing
● Persist messages on disk
● Supports multi-subscribers and automatically balances the consumers during failure
TERMS
● A stream of messages of a particular type is defined as a topic
● A Message is defined as a payload and a Topic is a category to which messages are
published
● A Producer can be anyone who can publish messages to a Topic
● The published messages stored at a set of servers called Brokers or cluster
● A Consumer can subscribe to one or more Topics and consume messages from
brokers
Why Kafka?
● It fit's our Architecture for stream of
events: most of our services consume, run
logic and then act or produce a new event
● We need resilient solution since it's our
main message bus
● It's nicely scale out
● Same messages are often consumed by
different services, that enabled natively by
Kafka. Messages are not deleted when
consumed but after retention period
● Our large and growing number of
message require high throughput
Key decisions when building a cluster
● Which instance type to use?
● How many brokers do we need?
● How to spread brokers between AZ?
● Whats the right defaults regarding retention, number of
partitions, replication factor, flush intervals, etc
● What the right setting for each topic
● Log directories split up
● Zookeeper ensemble size
Planing fault tolerance
● Launch enough brokers to support
failures
● Spread brokers between AZ
● Set the replication factor to match at
least the number of AZ
● Grantee that each partition is spread
between all configured AZ
● Make sure that Zookeeper instances
are spread between AZ
● Add automation to add new brokers
fast
● Add alerts for failures
Automate the fault tolerance
● Auto calculation the spread of brokers per topic that will guarantee
at least one broker in each AZ and evenly spread partitions between
the brokers to balance load.
● Generate a JSON with the data above that compatible with Kafka
reassign format.
● Script the above steps per topic and get topic,partitions as
parameters.
● Add the automation above also to new created topics.
● Automate broker addition with chef and aws prepared AMI.
● Enable auto leader rebalance
DEMO
● Show basic commands / scripts / Kafka Usage
● Show internal scripts to automate brokers split
up of a topic between AZ
● Show how to reassign partitions and keep fault
tolerance
● Show how to change retention
● Show how to run console consumer for testing
● Show AppsFlyer cluster / Kafawebview /
metrics / dashboards
Collecting metrics, Building
dashboard and alerts
● Metrics are being send to statsd and
graphite via AirBnb reporter
● Additional application metrics are
being sent by internal service that
measure lag
● We create dashboard with all the
relevant metrics
● Alerts are being setup upon the
relevant metrics to monitor
● Health check is being set for each
broker
Testing the cluster
● Build a dashboard to see the testing
effects
● Stopping one/two brokers
● Kill an entire AZ
● Stop one ZK
● Reassign partitions in runtime
● Change retention in Runtime
● Generate additional load and check
performance
● Do combination of all the above

More Related Content

What's hot

Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safeconfluent
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...HostedbyConfluent
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Introduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - MadridIntroduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - MadridPaolo Castagna
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafkaconfluent
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaLászló-Róbert Albert
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? confluent
 
... No it's Apache Kafka!
... No it's Apache Kafka!... No it's Apache Kafka!
... No it's Apache Kafka!makker_nl
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Timothy Spann
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platformconfluent
 

What's hot (20)

Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Introduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - MadridIntroduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - Madrid
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
 
... No it's Apache Kafka!
... No it's Apache Kafka!... No it's Apache Kafka!
... No it's Apache Kafka!
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
 

Similar to Kafka aws

Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bbNitin Kumar
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdfTarekHamdi8
 
Confluent Tech Talk Korea
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Koreaconfluent
 
Kafka basics and best prectices
Kafka basics and best precticesKafka basics and best prectices
Kafka basics and best precticesRohitSingh542417
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenDimosthenis Botsaris
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Drivenarconsis
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdfTarekHamdi8
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Otávio Carvalho
 

Similar to Kafka aws (20)

kafka
kafkakafka
kafka
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Event driven-arch
Event driven-archEvent driven-arch
Event driven-arch
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Confluent Tech Talk Korea
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Korea
 
Apache kafka part 1
Apache kafka part  1Apache kafka part  1
Apache kafka part 1
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Kafka basics and best prectices
Kafka basics and best precticesKafka basics and best prectices
Kafka basics and best prectices
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdf
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 

More from Ariel Moskovich (12)

Consul scale
Consul scaleConsul scale
Consul scale
 
Kafka ops-new
Kafka ops-newKafka ops-new
Kafka ops-new
 
Docker appsflyer
Docker appsflyerDocker appsflyer
Docker appsflyer
 
Advanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the Field
 
Practical Monitoring Techniques
Practical Monitoring TechniquesPractical Monitoring Techniques
Practical Monitoring Techniques
 
Consul
ConsulConsul
Consul
 
sensu
sensusensu
sensu
 
devopstools
devopstoolsdevopstools
devopstools
 
Bouncer
BouncerBouncer
Bouncer
 
Devopstools
DevopstoolsDevopstools
Devopstools
 
Docker in prod
Docker in prodDocker in prod
Docker in prod
 
Docker tlv
Docker tlvDocker tlv
Docker tlv
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 

Kafka aws

  • 1. Building A kafka cluster in AWS that will survive AZ crash
  • 2. A Little bit about our Production ● 2.5 Billion requests per day and growing ● Located at AWS ● Micro service architecture ● Kafka is our main message bus ● Most of the code Is written in Clojure ● Almost all of the services are consuming and/or producing from/to Kafka
  • 3. What This lecture include ● Quick overview on Kafka ● Why did we choose Kafka? ● Decisions to make when building a Kafka cluster ● Planing for fault tolerance ● Setting the defaults ● Automate the fault tolerance ● Reassign partitions and changing retention on the fly ● Adding Metrics ● Testing the cluster ● Demo for managing the cluster
  • 4. Quick Kafka overview ● Open Source Message Bus developed by Linkedin ● Designed as a distributed system ● Offers high throughput for both publishing and subscribing ● Persist messages on disk ● Supports multi-subscribers and automatically balances the consumers during failure TERMS ● A stream of messages of a particular type is defined as a topic ● A Message is defined as a payload and a Topic is a category to which messages are published ● A Producer can be anyone who can publish messages to a Topic ● The published messages stored at a set of servers called Brokers or cluster ● A Consumer can subscribe to one or more Topics and consume messages from brokers
  • 5. Why Kafka? ● It fit's our Architecture for stream of events: most of our services consume, run logic and then act or produce a new event ● We need resilient solution since it's our main message bus ● It's nicely scale out ● Same messages are often consumed by different services, that enabled natively by Kafka. Messages are not deleted when consumed but after retention period ● Our large and growing number of message require high throughput
  • 6. Key decisions when building a cluster ● Which instance type to use? ● How many brokers do we need? ● How to spread brokers between AZ? ● Whats the right defaults regarding retention, number of partitions, replication factor, flush intervals, etc ● What the right setting for each topic ● Log directories split up ● Zookeeper ensemble size
  • 7. Planing fault tolerance ● Launch enough brokers to support failures ● Spread brokers between AZ ● Set the replication factor to match at least the number of AZ ● Grantee that each partition is spread between all configured AZ ● Make sure that Zookeeper instances are spread between AZ ● Add automation to add new brokers fast ● Add alerts for failures
  • 8. Automate the fault tolerance ● Auto calculation the spread of brokers per topic that will guarantee at least one broker in each AZ and evenly spread partitions between the brokers to balance load. ● Generate a JSON with the data above that compatible with Kafka reassign format. ● Script the above steps per topic and get topic,partitions as parameters. ● Add the automation above also to new created topics. ● Automate broker addition with chef and aws prepared AMI. ● Enable auto leader rebalance
  • 9. DEMO ● Show basic commands / scripts / Kafka Usage ● Show internal scripts to automate brokers split up of a topic between AZ ● Show how to reassign partitions and keep fault tolerance ● Show how to change retention ● Show how to run console consumer for testing ● Show AppsFlyer cluster / Kafawebview / metrics / dashboards
  • 10. Collecting metrics, Building dashboard and alerts ● Metrics are being send to statsd and graphite via AirBnb reporter ● Additional application metrics are being sent by internal service that measure lag ● We create dashboard with all the relevant metrics ● Alerts are being setup upon the relevant metrics to monitor ● Health check is being set for each broker
  • 11. Testing the cluster ● Build a dashboard to see the testing effects ● Stopping one/two brokers ● Kill an entire AZ ● Stop one ZK ● Reassign partitions in runtime ● Change retention in Runtime ● Generate additional load and check performance ● Do combination of all the above