"Robinhood uses Kafka in every line of its business, from stock and crypto trading to its self-clearing system and online data analytics. Different systems need different reliability guarantees. We will talk about various architectures we use to achieve these desperate goals while maintaining sanity in our client side code.
This talk discusses how we removed SPoF through investments in Kafka infrastructure and our client libraries, letting us support a multitude of requirements of various systems inside robinhood. In addition we will discuss:
- Learnings from building client libraries to support these HA strategies
- K8s sidecars to help our applications work with sharded architecture
- Observability and debuggability tools we built to support use cases"
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Flavors of HA
1. Strictly Confidential 1
Flavors of HA
Sep 26, 2023 Chandra Kuchi
Sreeram Ramji
Robinhood Markets, Inc
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
2. Strictly Confidential 2
01
What is High
Availability(HA)?
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
3. Strictly Confidential 3
● High Availability (HA) refers to systems that are
designed to be robust and operational
continuously without any noticeable downtime
● The objective of HA is to eliminate or minimize
disruptions by ensuring that failures within the
system do not result in service interruptions or
data loss
Tenets of High Availability
● Building simple systems: Availability
decreases with more complex systems being
built. Availability is inversely proportional to the
complexity of the systems being built
● Redundancy: Multiple instances of
applications or systems are run so that if one
fails, another can take over without
interruption. System does not have any single
points of failure
● Failover: The ability of a system to
automatically transfer control to a standby
system when a failure occurs
● Observability first system: An observability
first system with tight SLAs for critical end user
journeys
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
4. Strictly Confidential 4
02
Why HA in
Streaming
systems?
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
5. Strictly Confidential 5
Some examples of systems that can’t afford to
have downtime or delay:
● Equities/Crypto order placement flow
● Marketdata streaming and serving
● Clearing systems
● Account management flows
Robinhood heavily relies on
streaming services like kafka
and event driven activity
snapshots for its critical user
journeys. Thus no downtime
is tolerable in these systems
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
6. Strictly Confidential 6
03
How?
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
7. Strictly Confidential 7
Strategy/Approach to High
Availability
● Improve reliability of a single
kafka cluster deployment
● Create redundant kafka
clusters
● Improve client behaviour on
failures
● Increase application
redundancy
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
8. Strictly Confidential 8
04
Reliability of a single kafka
cluster
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
9. Strictly Confidential
AZ redundancy
zookeeper multi AZ
● kafka cluster == many EC2 machines
● spread across 3 availability zones
(AZ) per environment
● topic-partitions are replicated across
AZs, so we can tolerate entire AZ
outages
● zookeeper cluster == 5 EC2 machines
○ spread across AZ as well
● We now have 4 layers of HA
○ tolerate machine failure
○ tolerate AWS AZ outage
○ tolerate ZK node failure
kafka cluster
availability zone:
us-blah-x
availability zone:
us-blah-y
availability zone:
us-blah-z
broker: 10000
broker: 10001
broker: 10002
broker: 20000
broker: 20001
broker: 20002
broker: 40000
broker: 40001
broker: 40002
topic-part. A topic-part. B topic-part. B
topic-part. B topic-part. A topic-part. A
topic-part. C topic-part. C topic-part. C
zookeeper
zookeeper
zookeeper
zookeeper
zookeeper
zookeeper
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
10. Strictly Confidential
Data Redundancy ● kafka-broker-provisioner
○ in-house provisioning script
○ creates EBS (external AWS disk)
volume and mounts it to each kafka
broker machine
○ EBS remains after a node is churned
- aka data not lost when a node
dies, is mounted to the new
replacement node (faster
bootstrapping) using volume tags
managed by the provisioner
kafka cluster
broker: 10000
EBS
Disk
link: /local/filesystem
system: kafka.server
kafka
config
files
kafka
config
files
supervisorctl: statsd
supervisorctl: pki
….
certs
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
11. Strictly Confidential 11
05
Redundant kafka clusters
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
12. Strictly Confidential
Strictly Confidential 12
Sharded Clusters
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
13. Strictly Confidential
Strictly Confidential 13
Applications using Sharded Kafka
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
14. Strictly Confidential 14
06
Improve client behavior
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
15. Strictly Confidential 15
Improve kafka client behavior on
failures
● Abstract the concept of sharding from
end-users
● Building
MultiClusterProducer/Consumer clients
with no change in interface
● Automated failure based fallbacks by
deny listing a shard on error threshold
● Chaos testing to ensure one cluster
outage does not make the kafka clients
failopen
● Building a consumer proxy - Please
checkout a deep dive on this
tomorrow at 11:30 AM
● Deadletterqueue enforcement for all
critical consumers - Checkout a deep
dive on this in Current 2022
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
16. Strictly Confidential 16
07
Increase Application
Redundancy
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
17. Strictly Confidential
Strictly Confidential 17
Multi Kubernetes Cluster deployments
Kube cluster 1 Kube cluster 2 Kube cluster
N..
kafka cluster 1
ec2 VMs
ec2 VMs
ec2 VMs
topic A
kafka cluster 2
topic B
NO
● All applications/consumers should
be equally distributed across fault
domains(Kubernetes clusters/
Availability zones/network
boundaries)
● If there are singletons like change
data consumers then they run
across N clusters in active-standby
mode using leader election
● All important stateful applications
like flink applications need to run in
active-active mode across N
clusters with active standby for
output consumption
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.
19. Strictly Confidential 19
● Multi Produce/Consume(Spray) for
marketdata
● Multi region backups
● Consumer proxy integration for Spray
● Moving this architecture to be kubernetes
native on top of our custom kafka operator
20. Strictly Confidential 20
Thank you
“Robinhood” and the Robinhood feather logo are registered trademarks of Robinhood Markets, Inc. All other names are trademarks of and/or registered trademarks of their respective owners.