SlideShare a Scribd company logo
1 of 20
Download to read offline
Strictly Confidential 1
Flavors of HA
Sep 26, 2023 Chandra Kuchi
Sreeram Ramji
Robinhood Markets, Inc
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 2
01
What is High
Availability(HA)?
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 3
● High Availability (HA) refers to systems that are
designed to be robust and operational
continuously without any noticeable downtime
● The objective of HA is to eliminate or minimize
disruptions by ensuring that failures within the
system do not result in service interruptions or
data loss
Tenets of High Availability
● Building simple systems: Availability
decreases with more complex systems being
built. Availability is inversely proportional to the
complexity of the systems being built
● Redundancy: Multiple instances of
applications or systems are run so that if one
fails, another can take over without
interruption. System does not have any single
points of failure
● Failover: The ability of a system to
automatically transfer control to a standby
system when a failure occurs
● Observability first system: An observability
first system with tight SLAs for critical end user
journeys
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 4
02
Why HA in
Streaming
systems?
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 5
Some examples of systems that can’t afford to
have downtime or delay:
● Equities/Crypto order placement flow
● Marketdata streaming and serving
● Clearing systems
● Account management flows
Robinhood heavily relies on
streaming services like kafka
and event driven activity
snapshots for its critical user
journeys. Thus no downtime
is tolerable in these systems
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 6
03
How?
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 7
Strategy/Approach to High
Availability
● Improve reliability of a single
kafka cluster deployment
● Create redundant kafka
clusters
● Improve client behaviour on
failures
● Increase application
redundancy
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 8
04
Reliability of a single kafka
cluster
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential
AZ redundancy
zookeeper multi AZ
● kafka cluster == many EC2 machines
● spread across 3 availability zones
(AZ) per environment
● topic-partitions are replicated across
AZs, so we can tolerate entire AZ
outages
● zookeeper cluster == 5 EC2 machines
○ spread across AZ as well
● We now have 4 layers of HA
○ tolerate machine failure
○ tolerate AWS AZ outage
○ tolerate ZK node failure
kafka cluster
availability zone:
us-blah-x
availability zone:
us-blah-y
availability zone:
us-blah-z
broker: 10000
broker: 10001
broker: 10002
broker: 20000
broker: 20001
broker: 20002
broker: 40000
broker: 40001
broker: 40002
topic-part. A topic-part. B topic-part. B
topic-part. B topic-part. A topic-part. A
topic-part. C topic-part. C topic-part. C
zookeeper
zookeeper
zookeeper
zookeeper
zookeeper
zookeeper
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential
Data Redundancy ● kafka-broker-provisioner
○ in-house provisioning script
○ creates EBS (external AWS disk)
volume and mounts it to each kafka
broker machine
○ EBS remains after a node is churned
- aka data not lost when a node
dies, is mounted to the new
replacement node (faster
bootstrapping) using volume tags
managed by the provisioner
kafka cluster
broker: 10000
EBS
Disk
link: /local/filesystem
system: kafka.server
kafka
config
files
kafka
config
files
supervisorctl: statsd
supervisorctl: pki
….
certs
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 11
05
Redundant kafka clusters
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential
Strictly Confidential 12
Sharded Clusters
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential
Strictly Confidential 13
Applications using Sharded Kafka
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 14
06
Improve client behavior
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 15
Improve kafka client behavior on
failures
● Abstract the concept of sharding from
end-users
● Building
MultiClusterProducer/Consumer clients
with no change in interface
● Automated failure based fallbacks by
deny listing a shard on error threshold
● Chaos testing to ensure one cluster
outage does not make the kafka clients
failopen
● Building a consumer proxy - Please
checkout a deep dive on this
tomorrow at 11:30 AM
● Deadletterqueue enforcement for all
critical consumers - Checkout a deep
dive on this in Current 2022
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 16
07
Increase Application
Redundancy
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential
Strictly Confidential 17
Multi Kubernetes Cluster deployments
Kube cluster 1 Kube cluster 2 Kube cluster
N..
kafka cluster 1
ec2 VMs
ec2 VMs
ec2 VMs
topic A
kafka cluster 2
topic B
NO
● All applications/consumers should
be equally distributed across fault
domains(Kubernetes clusters/
Availability zones/network
boundaries)
● If there are singletons like change
data consumers then they run
across N clusters in active-standby
mode using leader election
● All important stateful applications
like flink applications need to run in
active-active mode across N
clusters with active standby for
output consumption
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
Strictly Confidential 18
08
Future
Strictly Confidential 19
● Multi Produce/Consume(Spray) for
marketdata
● Multi region backups
● Consumer proxy integration for Spray
● Moving this architecture to be kubernetes
native on top of our custom kafka operator
Strictly Confidential 20
Thank you
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.

More Related Content

Similar to Flavors of HA

Making your PostgreSQL Database Highly Available
Making your PostgreSQL Database Highly AvailableMaking your PostgreSQL Database Highly Available
Making your PostgreSQL Database Highly AvailableEDB
 
MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!
MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!
MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!Miguel Araújo
 
MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)
MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)
MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)Miguel Araújo
 
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo PruscinoOracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo PruscinoMarkus Michalewicz
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to operate MySQL InnoDB Cluster with MySQL Shell
How to operate MySQL InnoDB Cluster with MySQL ShellHow to operate MySQL InnoDB Cluster with MySQL Shell
How to operate MySQL InnoDB Cluster with MySQL ShellFrederic Descamps
 
Achieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStackAchieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStackEric Zhaohui Ji
 
Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...
Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...
Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...Phil Wilkins
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to DeploymentAerospike, Inc.
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaTodd Palino
 
Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29
Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29
Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29MysoreMuleSoftMeetup
 
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...confluent
 
MySQL Fabric - High Availability & Automated Sharding for MySQL
MySQL Fabric - High Availability & Automated Sharding for MySQLMySQL Fabric - High Availability & Automated Sharding for MySQL
MySQL Fabric - High Availability & Automated Sharding for MySQLTed Wennmark
 
Cisco Connect Ottawa 2018 consuming public and private clouds
Cisco Connect Ottawa 2018 consuming public and private cloudsCisco Connect Ottawa 2018 consuming public and private clouds
Cisco Connect Ottawa 2018 consuming public and private cloudsCisco Canada
 
Confluent Steaming Webinar - Cape Town - Vitality
Confluent Steaming Webinar - Cape Town - VitalityConfluent Steaming Webinar - Cape Town - Vitality
Confluent Steaming Webinar - Cape Town - Vitalityconfluent
 
Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...
Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...
Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...Kai Wähner
 

Similar to Flavors of HA (20)

Making your PostgreSQL Database Highly Available
Making your PostgreSQL Database Highly AvailableMaking your PostgreSQL Database Highly Available
Making your PostgreSQL Database Highly Available
 
MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!
MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!
MySQL Shell/AdminAPI - MySQL Architectures Made Easy For All!
 
MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)
MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)
MySQL Router - Explore The Secrets (MySQL Belgian Days 2024)
 
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo PruscinoOracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to operate MySQL InnoDB Cluster with MySQL Shell
How to operate MySQL InnoDB Cluster with MySQL ShellHow to operate MySQL InnoDB Cluster with MySQL Shell
How to operate MySQL InnoDB Cluster with MySQL Shell
 
Achieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStackAchieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStack
 
Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...
Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...
Fluentd – Making Logging Easy & Effective in a Multi-cloud & Hybrid Environme...
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
SD Times - Docker v2
SD Times - Docker v2SD Times - Docker v2
SD Times - Docker v2
 
Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29
Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29
Platform configuration on CloudHub 2.0 | MuleSoft Mysore Meetup #29
 
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
 
MySQL Fabric - High Availability & Automated Sharding for MySQL
MySQL Fabric - High Availability & Automated Sharding for MySQLMySQL Fabric - High Availability & Automated Sharding for MySQL
MySQL Fabric - High Availability & Automated Sharding for MySQL
 
Cisco Connect Ottawa 2018 consuming public and private clouds
Cisco Connect Ottawa 2018 consuming public and private cloudsCisco Connect Ottawa 2018 consuming public and private clouds
Cisco Connect Ottawa 2018 consuming public and private clouds
 
Confluent Steaming Webinar - Cape Town - Vitality
Confluent Steaming Webinar - Cape Town - VitalityConfluent Steaming Webinar - Cape Town - Vitality
Confluent Steaming Webinar - Cape Town - Vitality
 
Oracle NoSQL
Oracle NoSQLOracle NoSQL
Oracle NoSQL
 
Online spanish meetup #2
Online spanish meetup #2Online spanish meetup #2
Online spanish meetup #2
 
Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...
Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...
Confluent Platform 5.4 + Apache Kafka 2.4 Overview (RBAC, Tiered Storage, Mul...
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)Wonjun Hwang
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxMasterG
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 

Recently uploaded (20)

UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 

Flavors of HA

  • 1. Strictly Confidential 1 Flavors of HA Sep 26, 2023 Chandra Kuchi Sreeram Ramji Robinhood Markets, Inc “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 2. Strictly Confidential 2 01 What is High Availability(HA)? “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 3. Strictly Confidential 3 ● High Availability (HA) refers to systems that are designed to be robust and operational continuously without any noticeable downtime ● The objective of HA is to eliminate or minimize disruptions by ensuring that failures within the system do not result in service interruptions or data loss Tenets of High Availability ● Building simple systems: Availability decreases with more complex systems being built. Availability is inversely proportional to the complexity of the systems being built ● Redundancy: Multiple instances of applications or systems are run so that if one fails, another can take over without interruption. System does not have any single points of failure ● Failover: The ability of a system to automatically transfer control to a standby system when a failure occurs ● Observability first system: An observability first system with tight SLAs for critical end user journeys “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 4. Strictly Confidential 4 02 Why HA in Streaming systems? “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 5. Strictly Confidential 5 Some examples of systems that can’t afford to have downtime or delay: ● Equities/Crypto order placement flow ● Marketdata streaming and serving ● Clearing systems ● Account management flows Robinhood heavily relies on streaming services like kafka and event driven activity snapshots for its critical user journeys. Thus no downtime is tolerable in these systems “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 6. Strictly Confidential 6 03 How? “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 7. Strictly Confidential 7 Strategy/Approach to High Availability ● Improve reliability of a single kafka cluster deployment ● Create redundant kafka clusters ● Improve client behaviour on failures ● Increase application redundancy “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 8. Strictly Confidential 8 04 Reliability of a single kafka cluster “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 9. Strictly Confidential AZ redundancy zookeeper multi AZ ● kafka cluster == many EC2 machines ● spread across 3 availability zones (AZ) per environment ● topic-partitions are replicated across AZs, so we can tolerate entire AZ outages ● zookeeper cluster == 5 EC2 machines ○ spread across AZ as well ● We now have 4 layers of HA ○ tolerate machine failure ○ tolerate AWS AZ outage ○ tolerate ZK node failure kafka cluster availability zone: us-blah-x availability zone: us-blah-y availability zone: us-blah-z broker: 10000 broker: 10001 broker: 10002 broker: 20000 broker: 20001 broker: 20002 broker: 40000 broker: 40001 broker: 40002 topic-part. A topic-part. B topic-part. B topic-part. B topic-part. A topic-part. A topic-part. C topic-part. C topic-part. C zookeeper zookeeper zookeeper zookeeper zookeeper zookeeper “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 10. Strictly Confidential Data Redundancy ● kafka-broker-provisioner ○ in-house provisioning script ○ creates EBS (external AWS disk) volume and mounts it to each kafka broker machine ○ EBS remains after a node is churned - aka data not lost when a node dies, is mounted to the new replacement node (faster bootstrapping) using volume tags managed by the provisioner kafka cluster broker: 10000 EBS Disk link: /local/filesystem system: kafka.server kafka config files kafka config files supervisorctl: statsd supervisorctl: pki …. certs “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 11. Strictly Confidential 11 05 Redundant kafka clusters “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 12. Strictly Confidential Strictly Confidential 12 Sharded Clusters “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 13. Strictly Confidential Strictly Confidential 13 Applications using Sharded Kafka “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 14. Strictly Confidential 14 06 Improve client behavior “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 15. Strictly Confidential 15 Improve kafka client behavior on failures ● Abstract the concept of sharding from end-users ● Building MultiClusterProducer/Consumer clients with no change in interface ● Automated failure based fallbacks by deny listing a shard on error threshold ● Chaos testing to ensure one cluster outage does not make the kafka clients failopen ● Building a consumer proxy - Please checkout a deep dive on this tomorrow at 11:30 AM ● Deadletterqueue enforcement for all critical consumers - Checkout a deep dive on this in Current 2022 “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 16. Strictly Confidential 16 07 Increase Application Redundancy “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 17. Strictly Confidential Strictly Confidential 17 Multi Kubernetes Cluster deployments Kube cluster 1 Kube cluster 2 Kube cluster N.. kafka cluster 1 ec2 VMs ec2 VMs ec2 VMs topic A kafka cluster 2 topic B NO ● All applications/consumers should be equally distributed across fault domains(Kubernetes clusters/ Availability zones/network boundaries) ● If there are singletons like change data consumers then they run across N clusters in active-standby mode using leader election ● All important stateful applications like flink applications need to run in active-active mode across N clusters with active standby for output consumption “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
  • 19. Strictly Confidential 19 ● Multi Produce/Consume(Spray) for marketdata ● Multi region backups ● Consumer proxy integration for Spray ● Moving this architecture to be kubernetes native on top of our custom kafka operator
  • 20. Strictly Confidential 20 Thank you “Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.