SlideShare a Scribd company logo
1 of 97
Disaster Recovery Solutions Deep Dive
Customer Success Engineering
August 2022
Table of Contents
2
1. Brokers, Zookeeper, Producers &
Consumers
A quick Primer
3. Stretch Clusters & Multi-Region Cluster
An asynchronous, multi-region solution
2. Disaster Recovery Options - Cluster
Linking & Schema Linking
A synchronous and optionally asynchronous
solution
4. Summary
Which solution is right for me?
01. Brokers, Zookeepers,
Producers & Consumers 101
Brokers & Zookeeper
Apache Kafka: Scale Out Vs. Failover
5
Broker 1
Topic1
partition1
Broker 2 Broker 3 Broker 4
Topic1
partition1
Topic1
partition1
Topic1
partition2
Topic1
partition2
Topic1
partition2
Topic1
partition3
Topic1
partition4
Topic1
partition3
Topic1
partition3
Topic1
partition4
Topic1
partition4
Apache Zookeeper - Cluster coordination
6
6
Broker 1
partition
Broker 2
(controller) Broker 3 Broker 4
Zookeeper 2
partition
partition
Zookeeper 1
Zookeeper 3
(leader)
partition
partition
partition
partition
Stores metadata:
heartbeats, watches,
controller elections,
cluster/topic configs,
permissions writes go to leader
Clients
Smart Clients to dumb pipes
Producer
8
P
partition 1
partition 2
partition 3
partition 4
A Kafka producer sends data to
multiple partitions based on
partitioning strategy (default is
hash(key) % no of partitions).
Data is sent in batch per partition then batched in request per broker.
Can configure batch size, linger, parallel connections per broker
Producer
9
P
partition 1
partition 2
partition 3
partition 4
A producer can choose to get
acknowledgement (acks) from 0,
1, or ALL (in-sync) replicas of
the partition
Consumer
10
C
A consumer polls data from
partitions it has been assigned
based on a subscription
Consumer
11
C
As the consumer reads the data and
processes the data, it can commit
offsets (where it has read up to) in
different ways (per time interval,
individual records, or “end of current
batch”)
commit
offset
heartbeat
poll records
Consumers - Consumer Groups
12
C
C
C1
C
C
C2
Different applications can
independently read from same
topic partitions at their own pace
Consumers - Consumer group members
13
C C
C C
Within the same application (consumer
group), different partitions can be
assigned to different consumers to
increase parallel consumption as well as
support failover
Make Kafka
Widely Accessible
to Developers
14
Enable all developers to leverage Kafka throughout the
organization with a wide variety of Confluent clients
Confluent Clients
Battle-tested and high performing
producer and consumer APIs (plus
admin client)
2. Disaster Recovery
Options
Why Disaster Recovery?
Recent Regional Cloud
Outages
1
7
AWS Azure GCP
Dec 2021: An unexplained
AWS outage created
business disruptions all day
(CNBC)
Nov 2020: A Kinesis outage
brought down over a dozen
AWS services for 17 hours
in us-east-1
(CRN, AWS)
Apr 1 2021: Some critical
Azure services were
unavailable for an hour
(Coralogix)
Sept 2018: South Central
US region was unavailable
for over a day
(The Register)
Nov 2021: An outage that
affected Home Depot, Snap,
Spotify, and Etsy
(Bloomberg)
Outages hurt business
performance
1
8
A data center or a region
may be down for multiple
hours–up to a day–based
on hist
Data Center
has an outage
The applications in that
data center that run your
business go of
Mission-critical
applications fail
Customers are unable to
place orders, discover
products, receive service,
etc.
Customer
Impact
Revenue is lost directly
from the inability to do
business during downtime,
and indirectly by damaging
brand image and customer
trust
Financial/Reput
a
Failure Types
1
9
Transient Failures Permanent Failures (Data Loss)
Transient failures in data-centers or
clusters are common and worth
protecting against for business
continuity purposes.
Regional outages are rare but still
worth protecting against for mission
critical systems.
Outages are typically transient but
occasionally permanent. Users
accidentally delete topics, human
error occurs.
If your data is unrecoverable and
mission critical, you need an
additional complementary solution.
Failure Scenarios
Data-Center / Regional
Outages
Platform Failures Human Error
Data-Centers have single
points of failure associated with
hardware resulting in
associated outages.
Regional Outages arise from
failures in the underlying cloud
provider.
People delete topics, clusters
and worse.
Unexpected behaviour arise
from standard operations and
within the CI/CD pipeline.
Load is applied unevenly or in
short bursts by batch
processing systems.
Performance limitations arise
unexpectedly.
Bugs occur in Kafka,
Zookeeper and associated
systems.
Cluster Linking & Schema
Linking
22
Cluster Linking
Cluster Linking, built into Confluent Platform and
Confluent Cloud allows you to directly connect
clusters together mirroring topics from one cluster
to another.
Cluster Linking makes it easier to build multi-cluster,
multi-cloud, and hybrid cloud deployments.
Active cluster
Consumers
Producers
clicks
clicks
Topics
DR cluster
clicks
clicks
Mirror Topics
Cluster Link
Primary Region DR Region
23
Schema Linking
Schema Linking, built into Schema Registry allows you
to directly connect Schema Registry clusters
together mirroring subjects or entire contexts.
Contexts, introduced alongside Schema Linking allows
you to create namespaces within Schema Registry
which ensures mirrored subjects don’t run into
schema naming clashes.
Active cluster
Consumers
Producers
clicks
clicks
Schemas
DR cluster
clicks
clicks
Mirror Schemas
Schema Link
Primary Region DR Region
Consumers
Producers
24
Prefixing
Prefixing allows you to add a prefix to a topic and if
desired the associated consumer group to avoid
topic and consumer group naming clashes between
the primary and Disaster Recovery cluster.
This is important when used in an active-active setup
and required to use a two way Cluster Link strategy
which is the recommended approach.
Active cluster
Consumer-Group
clicks
clicks
Topic
DR cluster
clicks
clicks
DR-topic
Cluster Link
Primary Region DR Region
DR-Consumer-Group
Active-Passive
25
HA/DR Active-Passive
1. Steady state
Setup
● The cluster link can
automatically create mirror
topics for any new topics on
the active cluster
● Historical data is replicated &
incoming data is synced in
real-time
Active cluster
Consumers
Producers
clicks
clicks
topics
DR cluster
clicks
clicks
mirror topics
Cluster Link
Primary Region DR Region
HA/DR Active-Passive
2. Failover
1. Detect a regional outage via
metrics going to zero in that
region; decide to failover
2. Call failover API on mirror
topics to make them writable
3. Update DNS to point at DR
cluster
4. Start clients in DR region
Active cluster
Consumers
Producers
clicks
clicks
topics
DR cluster
clicks
clicks
mirror topics
failover
REST API or CLI
Consumers
Producers
Primary Region DR Region
HA/DR Active-Passive
3. Fail forward
The standard strategy is to “fail
forward” promoting the DR region
to be their new Primary Region:
● Cloud regions offer identical
service
● They moved all of their
applications & data systems to
the DR region
● Failing back would introduce
risk with little benefit
To fail forward, simply:
1. Delete topics on original
cluster (or spin up new cluster)
2. Establish cluster link in reverse
direction
Active DR cluster
clicks
clicks
mirror topics
DR Active cluster
clicks
clicks
mirror topics
Cluster Link
Consumers
Producers
Primary DR Region DR Primary Region
HA/DR Active-Passive
3. Failback (alternative)
If you can’t fail forward and need
to failback to the original region:
1. Delete topics on Primary
cluster (or spin up a new
cluster)
2. Establish a cluster link in the
reverse direction
3. When Primary has caught up,
migrate producers &
consumers back:
a. Stop clients
b. promote mirror topic(s)
c. Restart clients pointed at
Primary cluster
DR cluster
clicks
clicks
mirror topics
Consumers
Producers
Cluster Link
Primary Region DR Region
Primary cluster
clicks
clicks
mirror topics
Synced
asynchronously
HA/DR - Consumers must tolerate some duplicates
Consumers must tolerate
duplicate messages
because Cluster Linking is
asynchronous.
Primary cluster
Consumer X
A B C D
Topic
Consumer X offset
at time
of outage
DR cluster
A B C D
Mirror Topic
Consumer X offset
at time of failover
... ...
A B C C D ...
Consumes message
C twice
Active-Passive
Bi-Directional Cluster Linking
31
DR cluster
“East”
HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback
1. Steady state
Setup
For a topic named clicks
● We create duplicate topics on
both the Primary and DR
cluster
● Create prefixed cluster links in
both directions
● Produce records to clicks on
the Primary cluster
● Consumers consume from a
Regex pattern
Primary cluster
“West”
clicks
Consumers
.*clicks
Producers
clicks Add prefix
west
clicks
clicks clicks
west.clicks
east.clicks
Add prefix
east
DR cluster
“East”
HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback
2. Outage strikes!
An outage in the primary region
stops:
● Stops producers & consumers
in primary region
● Temporarily pauses cluster
link mirroring
● A small set of data may not
have been replicated yet to
the DR cluster – this is your
“RPO”
Primary cluster
“West”
clicks
Consumers
.*clicks
Producers
clicks
clicks
clicks clicks
west.clicks
east.clicks
DR cluster
“East”
HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback
3. Failover
To failover:
● Move consumers and
producers to the DR cluster -
keep the same topic names /
regex
● Consumers consume both
○ Pre-failover data in
west.clicks
○ Post-failover data in clicks
● Don’t delete the cluster link
● Disable clicks -> west.clicks
offset replication
Primary cluster
“West”
clicks
Consumers
.*clicks
Producers
clicks
clicks
clicks clicks
west.clicks
east.clicks
DR cluster
“East”
HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback
4. Recovery
If/when the outage is over:
● The primary-to-DR cluster link
automatically recovers the
lagged data (RPO) from the
primary cluster
Note: this data will be “late arriving”
to the consumers
● New records generated to the
DR cluster will automatically
begin replicating to the primary
Primary cluster
“West”
clicks
Consumers
.*clicks
Producers
clicks
Recovers
data
Fails back
data clicks
clicks clicks
west.clicks
east.clicks
DR cluster
“East”
HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback
5. Failback
To failback to the primary region
Consumers need to pick up at the
end of the writable topics, so:
● Ensure that all consumer
groups have 0 consumer lag
for their DR topics e.g.
west.clicks
● Reset all consumer offsets to
the last offsets (LEO), this
can be done by the platform
operator
Finally, move consumers & producers
back to Primary
● Each producer / consumer
group can be moved
independently
Primary cluster
“West”
clicks
Consumers
.*clicks
Producers
clicks
Recovers
data
Fails back
data clicks
clicks clicks
west.clicks
east.clicks
Reset consumers to
resume here
move
move
DR cluster
“East”
Primary cluster
“West”
clicks
Recovers
data
Fails back
data clicks
clicks clicks
west.clicks
east.clicks
HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback
6. And beyond
Re-enable clicks -> west.clicks
consumer offset replication
Once consumer lag is 0 on
east.clicks, then reset all
consumer groups to Log End
Offset (last offset of the partition)
on “clicks” on DR cluster
Consumers
.*clicks
Producers
clicks
Reset consumers to
resume here
Active-Active
Bi-Directional Cluster Linking
38
West cluster
clicks
east.clicks
East cluster
west.clicks
clicks
Consumers
.*clicks
Producers
Add prefix
west
Add prefix
east
Consumers
.*clicks
Producers
Applications /
Web Traffic
Load Balancer
(example)
Applications Applications
39
HA/DR Bi-Directional Cluster Linking: Active-Active
1. Steady state
West cluster
clicks
east.clicks
East cluster
west.clicks
clicks
Consumers
.*clicks
Producers
Add prefix
west
Add prefix
east
Consumers
.*clicks
Producers
Applications /
Web Traffic
Load Balancer
(example)
Applications Applications
40
HA/DR Bi-Directional Cluster Linking: Active-Active
2. Outage strikes!
West cluster
clicks
east.clicks
East cluster
west.clicks
clicks
Consumers
.*clicks
Producers
Add prefix
west
Add prefix
east
Consumers
.*clicks
Producers
Applications /
Web Traffic
Load Balancer
(example)
Applications Applications
re-route
41
HA/DR Bi-Directional Cluster Linking: Active-Active
3. Failover
West cluster
clicks
east.clicks
East cluster
west.clicks
clicks
Consumers
.*clicks
Producers
Add prefix
west
Add prefix
east
Consumers
.*clicks
Producers
Applications /
Web Traffic
Load Balancer
(example)
Applications Applications
Any remaining pre-failure data is
automatically recovered by the
consumers
re-route
42
HA/DR Bi-Directional Cluster Linking: Active-Active
4. Return to Steady State
43
Stretch Cluster
A Stretch Cluster is ONE Kafka cluster that is
“stretched” across multiple availability zones or data
centers.
Uses Kafka internal replication features to achieve
RPO = 0 & low RTO.
3. Stretch Clusters & Multi-Region Cluster
44
Stretch Cluster - Why?
45
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
1. Steady State
Setup
● Any unknown number of
brokers represented here by
brokers 1-4 spread across 2
DCs
● A standard three node
Zookeeper cluster spread
across 2 DCs
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
1. Steady State… continued
Setup
● Any unknown number of
brokers represented here by
brokers 1-4 spread across 2
DCs
● A standard three node
Zookeeper cluster spread
across 2 DCs
● We’ll also assume a
replication-factor of 3,
min.insync.replicas of 2 and
acks=all
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Unused
Broker
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
2. DC Outage
An outage in DC “West”
● … let’s start by just focusing
on Kafka.
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Unused
Broker
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
2. DC Outage
An outage in DC “West”
● Min.insync.replicas can no
longer be met and we lose
availability
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Unused
Broker
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
3. Fixing Broker Availability
Increase to rf=4
● Looks like we’ve solved our
issue…
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Replica 4
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
3. Fixing Broker Availability… But
Increase to rf=4
● Looks like we’ve solved our
issue… but, if our 2 replicas
are down or out of sync then
we lose availability unless we
trigger an unclean leader
election and accept data loss.
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Out of Sync
Replica 4
Out of Sync
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
4. Fixing Data Loss
Increase to min.insync.replicas
to 3
● Consumers continue to
operate
● Producers continue to operate
once we revert to
min.insync.replicas=2
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Replica 4
Out of Sync
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
4. Fixing Data Loss… But What About Zookeeper?
DC “West”
Replica 1
Replica 2
Zookeeper 1
Zookeeper 2
DC “East”
Replica 3
Replica 4
Out of Sync
Zookeeper 3
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
4. Fixing Data Loss… But What About Zookeeper?
DC “West”
Zookeeper 1
Zookeeper 2
DC “East”
Zookeeper 3
Broker 1
Broker 2
Broker 3
Broker 4
Stretch Cluster: Non-Stretch Cluster Cluster Behaviour
4. Fixing Data Loss… But What About Zookeeper?
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 3
Stretch Cluster - 2 DC
56
Stretch Cluster: 2 DC + Observer
1. Steady State
Setup
● A minimum of 4 brokers
● 6 Zookeeper nodes, one of
which is an observer
● Replication factor of 4,
min.insync.replicas of 3 and
acks=all
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3
Zookeeper 6
(Observer)
Stretch Cluster: 2 DC + Observer
2. DC Outage - On observer DC
An outage in DC “East”
● Consumers continue to
operate
● Producers continue to
operate once we revert to
min.insync.replicas=2
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3
Zookeeper 6
(Observer)
Stretch Cluster: 2 DC + Observer
3. DC Outage - On non-observer DC
An outage in DC “West”
● We can’t reach Zookeeper
quorum!
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3
Zookeeper 6
(Observer)
Stretch Cluster: 2 DC + Observer
3. DC Outage - On non-observer DC… but
An outage in DC “West”
● We promote the Zookeeper
observer to a full follower
● Remove Zookeeper 1, 2 &
3 from quorum list
● Perform rolling restart of
Zookeeper nodes
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster: 2 DC + Observer
3. DC Outage - On non-observer DC
An outage in DC “West”
● Consumers continue to
operate
● Producers continue to
operate once we revert to
min.insync.replicas=2
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster: 2 DC + Observer
4. Network Partition
A network partition occurs
between DCs
● Consumers continue to
operate as usual up until
they’ve consumed all fully
replicated data
● Producer will fail as we can
no longer meet
min.insync.replicas=3
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3
Zookeeper 6
(Observer)
Stretch Cluster: 2 DC + Observer
5. Fixing Network Partition
A network partition occurs
between DCs
● We manually shutdown DC
“East” then update
min.insync.replicas=2
● Clients resume operating
as normal
● Consumers failing over
from DC “East” will
consume some duplicate
records
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3
Zookeeper 6
(Observer)
64
Observer Risk!
Zookeeper observers solve our availability and split-
brain issues but risk data loss!
DC “West”
Zookeeper
Leader
Zookeeper
Follower
Zookeeper
Follower
DC “East”
Zookeeper
Follower (Out
of Sync)
Zookeeper
Follower (Out
of Sync)
Zookeeper
Observer (Out
of Sync)
Quorum
65
Hierarchical Quorum
Hierarchical Quorum involves getting consensus
between multiple Zookeeper “groups” which each
form their own quorum. In the case of two DC
hierarchy, consensus must be reached between
BOTH DCs.
DC “West”
Zookeeper 1
(Leader)
Zookeeper 2
Zookeeper 3
DC “East”
Zookeeper 4
Zookeeper 5
Zookeeper 6
Quorum
Stretch Cluster: 2 DC + Hierarchical Quorum
1. Steady State
Setup
● A minimum of 4 brokers
● 6 Zookeeper nodes, arranged
into two groups
● Replication factor of 4,
min.insync.replicas of 3 and
acks=all
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster: 2 DC + Hierarchical Quorum
2. DC Outage
An outage in DC “East”
● Consumers continue to
operate for leaders on DC
“West”
● Leaders can’t be elected
and configuration updates
can’t be made until we
have hierarchical quorum
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster: 2 DC + Hierarchical Quorum
3. DC Outage
An outage in DC “East”
● Remove DC “East”
Zookeeper group from
hierarchy
● Revert to
min.insync.replicas=2
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster: 2 DC + Hierarchical Quorum
4. Network Partition
A network partition occurs
between DCs
● Consumers continue to
operate as usual up until
they’ve consumed all fully
replicated data
● Producer will fail as we can
no longer meet
min.insync.replicas=3
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster: 2 DC + Hierarchical Quorum
5. Fixing Network Partition
A network partition occurs
between DCs
● We manually shutdown DC
“East”, remove from the
hierarchy & update
min.insync.replicas=2
● Clients resume operating
as normal
● Consumers failing over
from DC “East” will
consume some duplicate
records
DC “West”
Broker 1
Broker 2
Zookeeper 1
Zookeeper 2
DC “East”
Broker 3
Broker 4
Zookeeper 4
Zookeeper 5
Zookeeper 3 Zookeeper 6
Stretch Cluster - 2.5 DC
71
Stretch Cluster: 2.5 DC
1. Steady State
Setup
● A minimum of 4 brokers
● 3 Zookeeper nodes
● Replication factor of 4,
min.insync.replicas of 3 and
acks=all
● Note: It’s actually better for the
DC’s with brokers to be
closest
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 3
Broker 4
Zookeeper 3
DC “Central”
Zookeeper 2
Stretch Cluster: 2.5 DC
2. DC Outage
An outage in DC “West”
● Consumers continue to
operate
● Producers continue to operate
once we revert to
min.insync.replicas=2
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 3
Broker 4
Zookeeper 3
DC “Central”
Zookeeper 2
Stretch Cluster: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● Consumers connected to DC
“East” continue to operate
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 3
Broker 4
Zookeeper 3
DC “Central”
Zookeeper 2
Stretch Cluster: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● Consumers connected to DC
“West” continue to operate
until they’ve processed all fully
replicated records
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 3
Broker 4
Zookeeper 3
DC “Central”
Zookeeper 2
Stretch Cluster: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● Producers connected to DC
“East” continue to operate
once we revert to
min.insync.replicas=2
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 3
Broker 4
Zookeeper 3
DC “Central”
Zookeeper 2
Stretch Cluster: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● Producers connected to DC
“West” continue to operate
once we shutdown DC “West”,
failover and revert to
min.insync.replicas=2
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 3
Broker 4
Zookeeper 3
DC “Central”
Zookeeper 2
Multi-Region Cluster
78
79
Multi-Region Clusters:
Followers vs Observers
Followers are normal replicas, however, observers act
the same except that they are not considered for
acks=all produce requests.
DC “West”
Producers
Follower
Synchronous
Leader
DC “East”
Observer
Asynchronous
80
Multi-Region Clusters:
Automatic Observer
Promotion
As of Confluent Platform v6.1 observers can be
configured to be promoted to meet the
ObserverPromotionPolicy, including:
● Under-min-isr: Promoted if in-sync replica size
drops below min.insync.replicas
● Under-replicated: Promoted to cover any replica
which is no longer insync
● Leader-is-observe: Promoted if the current
leader is an observer
DC “West”
Producers
Follower
Synchronous
Leader
DC “East”
Follower
Asynchronous
Multi-Region Clusters: 2.5 DC
1. Steady State
Setup
● A minimum of 6 brokers
● 3 Zookeeper nodes
● Replication factor of 4, 2
additional observers,
min.insync.replicas of 3 and
acks=all
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
Broker 5
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer1)
Broker 6
(Observer2)
Multi-Region Clusters: 2.5 DC
2. DC Outage
An outage in DC “West”
● The Observer in DC “East” is
promoted
● Consumers and Producers
continue to operate as usual
● RPO = 0
● RTO ~ 0
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
(Replica 3)
Broker 5
(Replica 4)
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer1)
Broker 6
(Replica 5)
Multi-Region Clusters: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● The Observer in DC “East” is
promoted
● Consumers and Producers
connected to DC “East”
continue to operate as usual
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
(Replica 3)
Broker 5
(Replica 4)
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer1)
Broker 6
(Replica 5)
Multi-Region Clusters: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● The Observer in DC “West”
cannot be promoted as it has
no Zookeeper Quorum
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
(Replica 3)
Broker 5
(Replica 4)
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer 1)
Broker 6
(Replica 5)
Multi-Region Clusters: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● Consumers connected to DC
“West” continue to operate
until they’ve processed all fully
replicated records. Once we
shutdown DC “West” the
consumers will failover and
consume from the same point.
This will result in duplicate
consumption
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
(Replica 3)
Broker 5
(Replica 4)
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer 1)
Broker 6
(Replica 5)
Multi-Region Clusters: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● Producers connected to DC
“West” fail as we can no longer
meet min.insync.replicas=3
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
(Replica 3)
Broker 5
(Replica 4)
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer 1)
Broker 6
(Replica 5)
Multi-Region Clusters: 2.5 DC
3. DC Network Partition
A network partition in DC
“West”
● To continue operating as
normal we must manually
shutdown DC “West
DC “West”
Broker 1
Broker 2
Zookeeper 1
DC “East”
Broker 4
(Replica 3)
Broker 5
(Replica 4)
Zookeeper 3
DC “Central”
Zookeeper 2
Broker 3
(Observer 1)
Broker 6
(Replica 5)
Stretch Cluster - 3 DC (ICG EAP KaaS)
88
Multi-Region Clusters: 3 DC
1. Steady State
Setup
● 9 brokers from which 3 brokers
from MWDC stores only
observers
● 5 Zookeeper nodes
● Replication factor of 4, 1
additional observers,
min.insync.replicas of 3 and
acks=all
DC “MW”
Broker 1
(Observer)
Broker 2
(Observer)
Zookeeper 1
DC “NJ”
Broker 7
Broker 8
Zookeeper 4
DC “NY”
Zookeeper 2
Broker 3
(Observer)
Broker 9
Broker 4
Broker 5
Broker 6
Zookeeper 3 Zookeeper 4
Multi-Region Clusters: 3 DC
2. DC Outage
An outage in DC “NY”
● The Observers in DC “MW” is
promoted
● Consumers and Producers
continue to operate as usual
● RPO = 0
● RTO ~ 0
DC “MWDC”
Broker 1
(Replica)
Broker 2
(Replica)
Zookeeper 1
DC “NJ”
Broker 7
Broker 8
Zookeeper 4
DC “NY”
Zookeeper 2
Broker 3
(Replica)
Broker 9
Broker 4
Broker 5
Broker 6
Zookeeper 3 Zookeeper 4
Multi-Region Clusters: 3 DC
3. DC Network Partition
A network partition in DC “NJ”
● The Observer in DC “MW” is
promoted
● Consumers and Producers
connected to DC “NY” continue
to operate as usual
DC “MW”
Broker 1
(Replica)
Broker 2
(Replica)
Zookeeper 1
DC “NJ”
Broker 7
Broker 8
Zookeeper 4
DC “NY”
Zookeeper 2
Broker 3
(Replica)
Broker 9
Broker 4
Broker 5
Broker 6
Zookeeper 3 Zookeeper 4
Multi-Region Clusters: 3 DC
3. DC Network Partition
A network partition in DC “NJ”
● Consumers connected to DC
“NJ” continue to operate until
they’ve processed all fully
replicated records. Once we
shutdown DC “NJ” or
application is restarted, the
consumers will failover and
consume from the same point.
This will result in duplicate
consumption
DC “MW”
Broker 1
(Replica)
Broker 2
(Replica)
Zookeeper 1
DC “NJ”
Broker 7
Broker 8
Zookeeper 4
DC “NY”
Zookeeper 2
Broker 3
(Replica)
Broker 9
Broker 4
Broker 5
Broker 6
Zookeeper 3 Zookeeper 4
Multi-Region Clusters: 3 DC
3. DC Network Partition
A network partition in DC “NJ”
● Producers connected to DC
“NJ” fail as we can no longer
meet min.insync.replicas=3
● Once application is restarted,
the producers will failover and
produce the data connecting to
DC “NY” / DC “MW”
DC “MW”
Broker 1
(Replica)
Broker 2
(Replica)
Zookeeper 1
DC “NJ”
Broker 7
Broker 8
Zookeeper 4
DC “NY”
Zookeeper 2
Broker 3
(Replica)
Broker 9
Broker 4
Broker 5
Broker 6
Zookeeper 3 Zookeeper 5
Multi-Region Clusters: 3 DC
3. DC Network Partition
A network partition in DC “NJ”
● To continue operating as
normal we must manually
shutdown DC “NJ”
DC “MW”
Broker 1
(Replica)
Broker 2
(Replica)
Zookeeper 1
DC “NJ”
Broker 7
Broker 8
Zookeeper 4
DC “NY”
Zookeeper 2
Broker 3
(Replica)
Broker 9
Broker 4
Broker 5
Broker 6
Zookeeper 3 Zookeeper 5
4. Summary
Comparison
96
Supported Cluster Linking
Stretch Cluster / Multi-Region
Cluster
Replicator / MirrorMaker 2
RPO=0 ✓
RTO=~0 ✓ ✓ ✓
Active-Active ✓ ✓ ✓
Failover With All Clients ✓ ✓
Failover With Transactions ✓ ✓
Failover Maintains Record Ordering
✓
✓
Smooth Failback ✓ ✓
Handles Full Cluster Failure ✓ ✓
Hybrid Cloud / Multi-Cloud ✓ ✓
Open Source ✓* ✓*
Preserves Metadata ✓ ✓ ✓*
Citi Tech Talk  Disaster Recovery Solutions Deep Dive

More Related Content

What's hot

Apache Kafka® and API Management
Apache Kafka® and API ManagementApache Kafka® and API Management
Apache Kafka® and API Managementconfluent
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get startedIsmaeel Enjreny
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharingconfluent
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
 
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...Severalnines
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 

What's hot (20)

Apache Kafka® and API Management
Apache Kafka® and API ManagementApache Kafka® and API Management
Apache Kafka® and API Management
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get started
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 

Similar to Citi Tech Talk Disaster Recovery Solutions Deep Dive

A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...HostedbyConfluent
 
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...HostedbyConfluent
 
Mmckeown hadr that_conf
Mmckeown hadr that_confMmckeown hadr that_conf
Mmckeown hadr that_confMike McKeown
 
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBMBeyond the Brokers | Emma Humber and Andrew Borley, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBMHostedbyConfluent
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMEconfluent
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)confluent
 
Federation manager demo
Federation manager demoFederation manager demo
Federation manager demoPLUMgrid
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Federated Kubernetes: As a Platform for Distributed Scientific Computing
Federated Kubernetes: As a Platform for Distributed Scientific ComputingFederated Kubernetes: As a Platform for Distributed Scientific Computing
Federated Kubernetes: As a Platform for Distributed Scientific ComputingBob Killen
 
Hybrid Cloud Solutions (with Datapipe)
Hybrid Cloud Solutions (with Datapipe)Hybrid Cloud Solutions (with Datapipe)
Hybrid Cloud Solutions (with Datapipe)RightScale
 
Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...
Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...
Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...HostedbyConfluent
 
MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08Kenny Gryp
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Getting Started with Kafka on k8s
Getting Started with Kafka on k8sGetting Started with Kafka on k8s
Getting Started with Kafka on k8sVMware Tanzu
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfLeah Cole
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMAvailability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMHostedbyConfluent
 
Scenarios for building Hybrid Cloud
Scenarios for building Hybrid CloudScenarios for building Hybrid Cloud
Scenarios for building Hybrid CloudPracheta Budhwar
 

Similar to Citi Tech Talk Disaster Recovery Solutions Deep Dive (20)

A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...
 
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
 
Mmckeown hadr that_conf
Mmckeown hadr that_confMmckeown hadr that_conf
Mmckeown hadr that_conf
 
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBMBeyond the Brokers | Emma Humber and Andrew Borley, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
 
Federation manager demo
Federation manager demoFederation manager demo
Federation manager demo
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Federated Kubernetes: As a Platform for Distributed Scientific Computing
Federated Kubernetes: As a Platform for Distributed Scientific ComputingFederated Kubernetes: As a Platform for Distributed Scientific Computing
Federated Kubernetes: As a Platform for Distributed Scientific Computing
 
Hybrid Cloud Solutions (with Datapipe)
Hybrid Cloud Solutions (with Datapipe)Hybrid Cloud Solutions (with Datapipe)
Hybrid Cloud Solutions (with Datapipe)
 
Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...
Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...
Implementing a Data Mesh with Apache Kafka with Adam Bellemare | Kafka Summit...
 
MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Getting Started with Kafka on k8s
Getting Started with Kafka on k8sGetting Started with Kafka on k8s
Getting Started with Kafka on k8s
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMAvailability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
 
Scenarios for building Hybrid Cloud
Scenarios for building Hybrid CloudScenarios for building Hybrid Cloud
Scenarios for building Hybrid Cloud
 

More from confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Recently uploaded

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Recently uploaded (20)

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Citi Tech Talk Disaster Recovery Solutions Deep Dive

  • 1. Disaster Recovery Solutions Deep Dive Customer Success Engineering August 2022
  • 2. Table of Contents 2 1. Brokers, Zookeeper, Producers & Consumers A quick Primer 3. Stretch Clusters & Multi-Region Cluster An asynchronous, multi-region solution 2. Disaster Recovery Options - Cluster Linking & Schema Linking A synchronous and optionally asynchronous solution 4. Summary Which solution is right for me?
  • 5. Apache Kafka: Scale Out Vs. Failover 5 Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  • 6. Apache Zookeeper - Cluster coordination 6 6 Broker 1 partition Broker 2 (controller) Broker 3 Broker 4 Zookeeper 2 partition partition Zookeeper 1 Zookeeper 3 (leader) partition partition partition partition Stores metadata: heartbeats, watches, controller elections, cluster/topic configs, permissions writes go to leader
  • 8. Producer 8 P partition 1 partition 2 partition 3 partition 4 A Kafka producer sends data to multiple partitions based on partitioning strategy (default is hash(key) % no of partitions). Data is sent in batch per partition then batched in request per broker. Can configure batch size, linger, parallel connections per broker
  • 9. Producer 9 P partition 1 partition 2 partition 3 partition 4 A producer can choose to get acknowledgement (acks) from 0, 1, or ALL (in-sync) replicas of the partition
  • 10. Consumer 10 C A consumer polls data from partitions it has been assigned based on a subscription
  • 11. Consumer 11 C As the consumer reads the data and processes the data, it can commit offsets (where it has read up to) in different ways (per time interval, individual records, or “end of current batch”) commit offset heartbeat poll records
  • 12. Consumers - Consumer Groups 12 C C C1 C C C2 Different applications can independently read from same topic partitions at their own pace
  • 13. Consumers - Consumer group members 13 C C C C Within the same application (consumer group), different partitions can be assigned to different consumers to increase parallel consumption as well as support failover
  • 14. Make Kafka Widely Accessible to Developers 14 Enable all developers to leverage Kafka throughout the organization with a wide variety of Confluent clients Confluent Clients Battle-tested and high performing producer and consumer APIs (plus admin client)
  • 17. Recent Regional Cloud Outages 1 7 AWS Azure GCP Dec 2021: An unexplained AWS outage created business disruptions all day (CNBC) Nov 2020: A Kinesis outage brought down over a dozen AWS services for 17 hours in us-east-1 (CRN, AWS) Apr 1 2021: Some critical Azure services were unavailable for an hour (Coralogix) Sept 2018: South Central US region was unavailable for over a day (The Register) Nov 2021: An outage that affected Home Depot, Snap, Spotify, and Etsy (Bloomberg)
  • 18. Outages hurt business performance 1 8 A data center or a region may be down for multiple hours–up to a day–based on hist Data Center has an outage The applications in that data center that run your business go of Mission-critical applications fail Customers are unable to place orders, discover products, receive service, etc. Customer Impact Revenue is lost directly from the inability to do business during downtime, and indirectly by damaging brand image and customer trust Financial/Reput a
  • 19. Failure Types 1 9 Transient Failures Permanent Failures (Data Loss) Transient failures in data-centers or clusters are common and worth protecting against for business continuity purposes. Regional outages are rare but still worth protecting against for mission critical systems. Outages are typically transient but occasionally permanent. Users accidentally delete topics, human error occurs. If your data is unrecoverable and mission critical, you need an additional complementary solution.
  • 20. Failure Scenarios Data-Center / Regional Outages Platform Failures Human Error Data-Centers have single points of failure associated with hardware resulting in associated outages. Regional Outages arise from failures in the underlying cloud provider. People delete topics, clusters and worse. Unexpected behaviour arise from standard operations and within the CI/CD pipeline. Load is applied unevenly or in short bursts by batch processing systems. Performance limitations arise unexpectedly. Bugs occur in Kafka, Zookeeper and associated systems.
  • 21. Cluster Linking & Schema Linking
  • 22. 22 Cluster Linking Cluster Linking, built into Confluent Platform and Confluent Cloud allows you to directly connect clusters together mirroring topics from one cluster to another. Cluster Linking makes it easier to build multi-cluster, multi-cloud, and hybrid cloud deployments. Active cluster Consumers Producers clicks clicks Topics DR cluster clicks clicks Mirror Topics Cluster Link Primary Region DR Region
  • 23. 23 Schema Linking Schema Linking, built into Schema Registry allows you to directly connect Schema Registry clusters together mirroring subjects or entire contexts. Contexts, introduced alongside Schema Linking allows you to create namespaces within Schema Registry which ensures mirrored subjects don’t run into schema naming clashes. Active cluster Consumers Producers clicks clicks Schemas DR cluster clicks clicks Mirror Schemas Schema Link Primary Region DR Region Consumers Producers
  • 24. 24 Prefixing Prefixing allows you to add a prefix to a topic and if desired the associated consumer group to avoid topic and consumer group naming clashes between the primary and Disaster Recovery cluster. This is important when used in an active-active setup and required to use a two way Cluster Link strategy which is the recommended approach. Active cluster Consumer-Group clicks clicks Topic DR cluster clicks clicks DR-topic Cluster Link Primary Region DR Region DR-Consumer-Group
  • 26. HA/DR Active-Passive 1. Steady state Setup ● The cluster link can automatically create mirror topics for any new topics on the active cluster ● Historical data is replicated & incoming data is synced in real-time Active cluster Consumers Producers clicks clicks topics DR cluster clicks clicks mirror topics Cluster Link Primary Region DR Region
  • 27. HA/DR Active-Passive 2. Failover 1. Detect a regional outage via metrics going to zero in that region; decide to failover 2. Call failover API on mirror topics to make them writable 3. Update DNS to point at DR cluster 4. Start clients in DR region Active cluster Consumers Producers clicks clicks topics DR cluster clicks clicks mirror topics failover REST API or CLI Consumers Producers Primary Region DR Region
  • 28. HA/DR Active-Passive 3. Fail forward The standard strategy is to “fail forward” promoting the DR region to be their new Primary Region: ● Cloud regions offer identical service ● They moved all of their applications & data systems to the DR region ● Failing back would introduce risk with little benefit To fail forward, simply: 1. Delete topics on original cluster (or spin up new cluster) 2. Establish cluster link in reverse direction Active DR cluster clicks clicks mirror topics DR Active cluster clicks clicks mirror topics Cluster Link Consumers Producers Primary DR Region DR Primary Region
  • 29. HA/DR Active-Passive 3. Failback (alternative) If you can’t fail forward and need to failback to the original region: 1. Delete topics on Primary cluster (or spin up a new cluster) 2. Establish a cluster link in the reverse direction 3. When Primary has caught up, migrate producers & consumers back: a. Stop clients b. promote mirror topic(s) c. Restart clients pointed at Primary cluster DR cluster clicks clicks mirror topics Consumers Producers Cluster Link Primary Region DR Region Primary cluster clicks clicks mirror topics
  • 30. Synced asynchronously HA/DR - Consumers must tolerate some duplicates Consumers must tolerate duplicate messages because Cluster Linking is asynchronous. Primary cluster Consumer X A B C D Topic Consumer X offset at time of outage DR cluster A B C D Mirror Topic Consumer X offset at time of failover ... ... A B C C D ... Consumes message C twice
  • 32. DR cluster “East” HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback 1. Steady state Setup For a topic named clicks ● We create duplicate topics on both the Primary and DR cluster ● Create prefixed cluster links in both directions ● Produce records to clicks on the Primary cluster ● Consumers consume from a Regex pattern Primary cluster “West” clicks Consumers .*clicks Producers clicks Add prefix west clicks clicks clicks west.clicks east.clicks Add prefix east
  • 33. DR cluster “East” HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback 2. Outage strikes! An outage in the primary region stops: ● Stops producers & consumers in primary region ● Temporarily pauses cluster link mirroring ● A small set of data may not have been replicated yet to the DR cluster – this is your “RPO” Primary cluster “West” clicks Consumers .*clicks Producers clicks clicks clicks clicks west.clicks east.clicks
  • 34. DR cluster “East” HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback 3. Failover To failover: ● Move consumers and producers to the DR cluster - keep the same topic names / regex ● Consumers consume both ○ Pre-failover data in west.clicks ○ Post-failover data in clicks ● Don’t delete the cluster link ● Disable clicks -> west.clicks offset replication Primary cluster “West” clicks Consumers .*clicks Producers clicks clicks clicks clicks west.clicks east.clicks
  • 35. DR cluster “East” HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback 4. Recovery If/when the outage is over: ● The primary-to-DR cluster link automatically recovers the lagged data (RPO) from the primary cluster Note: this data will be “late arriving” to the consumers ● New records generated to the DR cluster will automatically begin replicating to the primary Primary cluster “West” clicks Consumers .*clicks Producers clicks Recovers data Fails back data clicks clicks clicks west.clicks east.clicks
  • 36. DR cluster “East” HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback 5. Failback To failback to the primary region Consumers need to pick up at the end of the writable topics, so: ● Ensure that all consumer groups have 0 consumer lag for their DR topics e.g. west.clicks ● Reset all consumer offsets to the last offsets (LEO), this can be done by the platform operator Finally, move consumers & producers back to Primary ● Each producer / consumer group can be moved independently Primary cluster “West” clicks Consumers .*clicks Producers clicks Recovers data Fails back data clicks clicks clicks west.clicks east.clicks Reset consumers to resume here move move
  • 37. DR cluster “East” Primary cluster “West” clicks Recovers data Fails back data clicks clicks clicks west.clicks east.clicks HA/DR Bi-Directional Cluster Linking: Automatic Data Recovery & Failback 6. And beyond Re-enable clicks -> west.clicks consumer offset replication Once consumer lag is 0 on east.clicks, then reset all consumer groups to Log End Offset (last offset of the partition) on “clicks” on DR cluster Consumers .*clicks Producers clicks Reset consumers to resume here
  • 39. West cluster clicks east.clicks East cluster west.clicks clicks Consumers .*clicks Producers Add prefix west Add prefix east Consumers .*clicks Producers Applications / Web Traffic Load Balancer (example) Applications Applications 39 HA/DR Bi-Directional Cluster Linking: Active-Active 1. Steady state
  • 40. West cluster clicks east.clicks East cluster west.clicks clicks Consumers .*clicks Producers Add prefix west Add prefix east Consumers .*clicks Producers Applications / Web Traffic Load Balancer (example) Applications Applications 40 HA/DR Bi-Directional Cluster Linking: Active-Active 2. Outage strikes!
  • 41. West cluster clicks east.clicks East cluster west.clicks clicks Consumers .*clicks Producers Add prefix west Add prefix east Consumers .*clicks Producers Applications / Web Traffic Load Balancer (example) Applications Applications re-route 41 HA/DR Bi-Directional Cluster Linking: Active-Active 3. Failover
  • 42. West cluster clicks east.clicks East cluster west.clicks clicks Consumers .*clicks Producers Add prefix west Add prefix east Consumers .*clicks Producers Applications / Web Traffic Load Balancer (example) Applications Applications Any remaining pre-failure data is automatically recovered by the consumers re-route 42 HA/DR Bi-Directional Cluster Linking: Active-Active 4. Return to Steady State
  • 43. 43 Stretch Cluster A Stretch Cluster is ONE Kafka cluster that is “stretched” across multiple availability zones or data centers. Uses Kafka internal replication features to achieve RPO = 0 & low RTO.
  • 44. 3. Stretch Clusters & Multi-Region Cluster 44
  • 45. Stretch Cluster - Why? 45
  • 46. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 1. Steady State Setup ● Any unknown number of brokers represented here by brokers 1-4 spread across 2 DCs ● A standard three node Zookeeper cluster spread across 2 DCs DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 3
  • 47. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 1. Steady State… continued Setup ● Any unknown number of brokers represented here by brokers 1-4 spread across 2 DCs ● A standard three node Zookeeper cluster spread across 2 DCs ● We’ll also assume a replication-factor of 3, min.insync.replicas of 2 and acks=all DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Unused Broker Zookeeper 3
  • 48. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 2. DC Outage An outage in DC “West” ● … let’s start by just focusing on Kafka. DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Unused Broker Zookeeper 3
  • 49. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 2. DC Outage An outage in DC “West” ● Min.insync.replicas can no longer be met and we lose availability DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Unused Broker Zookeeper 3
  • 50. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 3. Fixing Broker Availability Increase to rf=4 ● Looks like we’ve solved our issue… DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Replica 4 Zookeeper 3
  • 51. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 3. Fixing Broker Availability… But Increase to rf=4 ● Looks like we’ve solved our issue… but, if our 2 replicas are down or out of sync then we lose availability unless we trigger an unclean leader election and accept data loss. DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Out of Sync Replica 4 Out of Sync Zookeeper 3
  • 52. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 4. Fixing Data Loss Increase to min.insync.replicas to 3 ● Consumers continue to operate ● Producers continue to operate once we revert to min.insync.replicas=2 DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Replica 4 Out of Sync Zookeeper 3
  • 53. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 4. Fixing Data Loss… But What About Zookeeper? DC “West” Replica 1 Replica 2 Zookeeper 1 Zookeeper 2 DC “East” Replica 3 Replica 4 Out of Sync Zookeeper 3
  • 54. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 4. Fixing Data Loss… But What About Zookeeper? DC “West” Zookeeper 1 Zookeeper 2 DC “East” Zookeeper 3 Broker 1 Broker 2 Broker 3 Broker 4
  • 55. Stretch Cluster: Non-Stretch Cluster Cluster Behaviour 4. Fixing Data Loss… But What About Zookeeper? DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 3
  • 56. Stretch Cluster - 2 DC 56
  • 57. Stretch Cluster: 2 DC + Observer 1. Steady State Setup ● A minimum of 4 brokers ● 6 Zookeeper nodes, one of which is an observer ● Replication factor of 4, min.insync.replicas of 3 and acks=all DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6 (Observer)
  • 58. Stretch Cluster: 2 DC + Observer 2. DC Outage - On observer DC An outage in DC “East” ● Consumers continue to operate ● Producers continue to operate once we revert to min.insync.replicas=2 DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6 (Observer)
  • 59. Stretch Cluster: 2 DC + Observer 3. DC Outage - On non-observer DC An outage in DC “West” ● We can’t reach Zookeeper quorum! DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6 (Observer)
  • 60. Stretch Cluster: 2 DC + Observer 3. DC Outage - On non-observer DC… but An outage in DC “West” ● We promote the Zookeeper observer to a full follower ● Remove Zookeeper 1, 2 & 3 from quorum list ● Perform rolling restart of Zookeeper nodes DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 61. Stretch Cluster: 2 DC + Observer 3. DC Outage - On non-observer DC An outage in DC “West” ● Consumers continue to operate ● Producers continue to operate once we revert to min.insync.replicas=2 DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 62. Stretch Cluster: 2 DC + Observer 4. Network Partition A network partition occurs between DCs ● Consumers continue to operate as usual up until they’ve consumed all fully replicated data ● Producer will fail as we can no longer meet min.insync.replicas=3 DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6 (Observer)
  • 63. Stretch Cluster: 2 DC + Observer 5. Fixing Network Partition A network partition occurs between DCs ● We manually shutdown DC “East” then update min.insync.replicas=2 ● Clients resume operating as normal ● Consumers failing over from DC “East” will consume some duplicate records DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6 (Observer)
  • 64. 64 Observer Risk! Zookeeper observers solve our availability and split- brain issues but risk data loss! DC “West” Zookeeper Leader Zookeeper Follower Zookeeper Follower DC “East” Zookeeper Follower (Out of Sync) Zookeeper Follower (Out of Sync) Zookeeper Observer (Out of Sync) Quorum
  • 65. 65 Hierarchical Quorum Hierarchical Quorum involves getting consensus between multiple Zookeeper “groups” which each form their own quorum. In the case of two DC hierarchy, consensus must be reached between BOTH DCs. DC “West” Zookeeper 1 (Leader) Zookeeper 2 Zookeeper 3 DC “East” Zookeeper 4 Zookeeper 5 Zookeeper 6 Quorum
  • 66. Stretch Cluster: 2 DC + Hierarchical Quorum 1. Steady State Setup ● A minimum of 4 brokers ● 6 Zookeeper nodes, arranged into two groups ● Replication factor of 4, min.insync.replicas of 3 and acks=all DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 67. Stretch Cluster: 2 DC + Hierarchical Quorum 2. DC Outage An outage in DC “East” ● Consumers continue to operate for leaders on DC “West” ● Leaders can’t be elected and configuration updates can’t be made until we have hierarchical quorum DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 68. Stretch Cluster: 2 DC + Hierarchical Quorum 3. DC Outage An outage in DC “East” ● Remove DC “East” Zookeeper group from hierarchy ● Revert to min.insync.replicas=2 DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 69. Stretch Cluster: 2 DC + Hierarchical Quorum 4. Network Partition A network partition occurs between DCs ● Consumers continue to operate as usual up until they’ve consumed all fully replicated data ● Producer will fail as we can no longer meet min.insync.replicas=3 DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 70. Stretch Cluster: 2 DC + Hierarchical Quorum 5. Fixing Network Partition A network partition occurs between DCs ● We manually shutdown DC “East”, remove from the hierarchy & update min.insync.replicas=2 ● Clients resume operating as normal ● Consumers failing over from DC “East” will consume some duplicate records DC “West” Broker 1 Broker 2 Zookeeper 1 Zookeeper 2 DC “East” Broker 3 Broker 4 Zookeeper 4 Zookeeper 5 Zookeeper 3 Zookeeper 6
  • 71. Stretch Cluster - 2.5 DC 71
  • 72. Stretch Cluster: 2.5 DC 1. Steady State Setup ● A minimum of 4 brokers ● 3 Zookeeper nodes ● Replication factor of 4, min.insync.replicas of 3 and acks=all ● Note: It’s actually better for the DC’s with brokers to be closest DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 3 Broker 4 Zookeeper 3 DC “Central” Zookeeper 2
  • 73. Stretch Cluster: 2.5 DC 2. DC Outage An outage in DC “West” ● Consumers continue to operate ● Producers continue to operate once we revert to min.insync.replicas=2 DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 3 Broker 4 Zookeeper 3 DC “Central” Zookeeper 2
  • 74. Stretch Cluster: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● Consumers connected to DC “East” continue to operate DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 3 Broker 4 Zookeeper 3 DC “Central” Zookeeper 2
  • 75. Stretch Cluster: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● Consumers connected to DC “West” continue to operate until they’ve processed all fully replicated records DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 3 Broker 4 Zookeeper 3 DC “Central” Zookeeper 2
  • 76. Stretch Cluster: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● Producers connected to DC “East” continue to operate once we revert to min.insync.replicas=2 DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 3 Broker 4 Zookeeper 3 DC “Central” Zookeeper 2
  • 77. Stretch Cluster: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● Producers connected to DC “West” continue to operate once we shutdown DC “West”, failover and revert to min.insync.replicas=2 DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 3 Broker 4 Zookeeper 3 DC “Central” Zookeeper 2
  • 79. 79 Multi-Region Clusters: Followers vs Observers Followers are normal replicas, however, observers act the same except that they are not considered for acks=all produce requests. DC “West” Producers Follower Synchronous Leader DC “East” Observer Asynchronous
  • 80. 80 Multi-Region Clusters: Automatic Observer Promotion As of Confluent Platform v6.1 observers can be configured to be promoted to meet the ObserverPromotionPolicy, including: ● Under-min-isr: Promoted if in-sync replica size drops below min.insync.replicas ● Under-replicated: Promoted to cover any replica which is no longer insync ● Leader-is-observe: Promoted if the current leader is an observer DC “West” Producers Follower Synchronous Leader DC “East” Follower Asynchronous
  • 81. Multi-Region Clusters: 2.5 DC 1. Steady State Setup ● A minimum of 6 brokers ● 3 Zookeeper nodes ● Replication factor of 4, 2 additional observers, min.insync.replicas of 3 and acks=all DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 Broker 5 Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer1) Broker 6 (Observer2)
  • 82. Multi-Region Clusters: 2.5 DC 2. DC Outage An outage in DC “West” ● The Observer in DC “East” is promoted ● Consumers and Producers continue to operate as usual ● RPO = 0 ● RTO ~ 0 DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 (Replica 3) Broker 5 (Replica 4) Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer1) Broker 6 (Replica 5)
  • 83. Multi-Region Clusters: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● The Observer in DC “East” is promoted ● Consumers and Producers connected to DC “East” continue to operate as usual DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 (Replica 3) Broker 5 (Replica 4) Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer1) Broker 6 (Replica 5)
  • 84. Multi-Region Clusters: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● The Observer in DC “West” cannot be promoted as it has no Zookeeper Quorum DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 (Replica 3) Broker 5 (Replica 4) Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer 1) Broker 6 (Replica 5)
  • 85. Multi-Region Clusters: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● Consumers connected to DC “West” continue to operate until they’ve processed all fully replicated records. Once we shutdown DC “West” the consumers will failover and consume from the same point. This will result in duplicate consumption DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 (Replica 3) Broker 5 (Replica 4) Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer 1) Broker 6 (Replica 5)
  • 86. Multi-Region Clusters: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● Producers connected to DC “West” fail as we can no longer meet min.insync.replicas=3 DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 (Replica 3) Broker 5 (Replica 4) Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer 1) Broker 6 (Replica 5)
  • 87. Multi-Region Clusters: 2.5 DC 3. DC Network Partition A network partition in DC “West” ● To continue operating as normal we must manually shutdown DC “West DC “West” Broker 1 Broker 2 Zookeeper 1 DC “East” Broker 4 (Replica 3) Broker 5 (Replica 4) Zookeeper 3 DC “Central” Zookeeper 2 Broker 3 (Observer 1) Broker 6 (Replica 5)
  • 88. Stretch Cluster - 3 DC (ICG EAP KaaS) 88
  • 89. Multi-Region Clusters: 3 DC 1. Steady State Setup ● 9 brokers from which 3 brokers from MWDC stores only observers ● 5 Zookeeper nodes ● Replication factor of 4, 1 additional observers, min.insync.replicas of 3 and acks=all DC “MW” Broker 1 (Observer) Broker 2 (Observer) Zookeeper 1 DC “NJ” Broker 7 Broker 8 Zookeeper 4 DC “NY” Zookeeper 2 Broker 3 (Observer) Broker 9 Broker 4 Broker 5 Broker 6 Zookeeper 3 Zookeeper 4
  • 90. Multi-Region Clusters: 3 DC 2. DC Outage An outage in DC “NY” ● The Observers in DC “MW” is promoted ● Consumers and Producers continue to operate as usual ● RPO = 0 ● RTO ~ 0 DC “MWDC” Broker 1 (Replica) Broker 2 (Replica) Zookeeper 1 DC “NJ” Broker 7 Broker 8 Zookeeper 4 DC “NY” Zookeeper 2 Broker 3 (Replica) Broker 9 Broker 4 Broker 5 Broker 6 Zookeeper 3 Zookeeper 4
  • 91. Multi-Region Clusters: 3 DC 3. DC Network Partition A network partition in DC “NJ” ● The Observer in DC “MW” is promoted ● Consumers and Producers connected to DC “NY” continue to operate as usual DC “MW” Broker 1 (Replica) Broker 2 (Replica) Zookeeper 1 DC “NJ” Broker 7 Broker 8 Zookeeper 4 DC “NY” Zookeeper 2 Broker 3 (Replica) Broker 9 Broker 4 Broker 5 Broker 6 Zookeeper 3 Zookeeper 4
  • 92. Multi-Region Clusters: 3 DC 3. DC Network Partition A network partition in DC “NJ” ● Consumers connected to DC “NJ” continue to operate until they’ve processed all fully replicated records. Once we shutdown DC “NJ” or application is restarted, the consumers will failover and consume from the same point. This will result in duplicate consumption DC “MW” Broker 1 (Replica) Broker 2 (Replica) Zookeeper 1 DC “NJ” Broker 7 Broker 8 Zookeeper 4 DC “NY” Zookeeper 2 Broker 3 (Replica) Broker 9 Broker 4 Broker 5 Broker 6 Zookeeper 3 Zookeeper 4
  • 93. Multi-Region Clusters: 3 DC 3. DC Network Partition A network partition in DC “NJ” ● Producers connected to DC “NJ” fail as we can no longer meet min.insync.replicas=3 ● Once application is restarted, the producers will failover and produce the data connecting to DC “NY” / DC “MW” DC “MW” Broker 1 (Replica) Broker 2 (Replica) Zookeeper 1 DC “NJ” Broker 7 Broker 8 Zookeeper 4 DC “NY” Zookeeper 2 Broker 3 (Replica) Broker 9 Broker 4 Broker 5 Broker 6 Zookeeper 3 Zookeeper 5
  • 94. Multi-Region Clusters: 3 DC 3. DC Network Partition A network partition in DC “NJ” ● To continue operating as normal we must manually shutdown DC “NJ” DC “MW” Broker 1 (Replica) Broker 2 (Replica) Zookeeper 1 DC “NJ” Broker 7 Broker 8 Zookeeper 4 DC “NY” Zookeeper 2 Broker 3 (Replica) Broker 9 Broker 4 Broker 5 Broker 6 Zookeeper 3 Zookeeper 5
  • 96. Comparison 96 Supported Cluster Linking Stretch Cluster / Multi-Region Cluster Replicator / MirrorMaker 2 RPO=0 ✓ RTO=~0 ✓ ✓ ✓ Active-Active ✓ ✓ ✓ Failover With All Clients ✓ ✓ Failover With Transactions ✓ ✓ Failover Maintains Record Ordering ✓ ✓ Smooth Failback ✓ ✓ Handles Full Cluster Failure ✓ ✓ Hybrid Cloud / Multi-Cloud ✓ ✓ Open Source ✓* ✓* Preserves Metadata ✓ ✓ ✓*

Editor's Notes

  1. Today we are going to go through a comparison of disaster recovery solutions.
  2. So, a quick summary of our agenda: We’ll discuss what disaster recovery solutions are and why we should use them; We’ll discuss cluster linking & schema linking as a complementary component; We’ll cover Stretch Clusters and multi-region clusters as an extension; We’ll go into S3 backup and restore as a complementary solution; We’ll cover legacy solutions including Replicator and Mirrormaker2; and Finally, we’ll wrap up with a summary.
  3. So, let’s start with what disaster recovery solutions are and why you should use them.
  4. So, let’s start with what disaster recovery solutions are and why you should use them.
  5. Let’s start with why you should care. I’ve included here five examples of regional outages across the three largest cloud providers. It should go without saying, but if these incidents occurs on services like AWS, Azure and GCP then we should assume they can happen to you to.
  6. So, why does this matter? Simply put, outages hurt business performance! As an example of this: Say you experience a cloud region outage - This service may be down for multiple hours, up to a day based on historical experience; Then mission-critical applications will fail - The applications in that region that run your business go offline; As such customer are impacted - Customers are unable to place orders, discover products, receive service, etc; and Ultimately, we are impacted financially - Revenue is lost directly from the inability to do business during downtime, and indirectly by damaging brand image and customer trust.
  7. Now that we know why we want to avoid them, let’s review the types of failures we need to consider: First, we have transient failures - Transient failures refer to disaster scenarios which involve a temporary failure where part or the entirety of the platform go offline, but, no records or information is necessarily lost forever; Second, we have permanent failures - Permanent failures are characterized by their associated data loss due to hardware failures or human error.
  8. There are an endless number of failure scenarios but they broadly fall under: Data-center & regional outages - These occur due to a range of hardware, infrastructure and cloud provider issues resulting in both transient and permanent failures. However, they only impact services running within the zone of impact; Platform failures - These arise from a range of issues such as batch processing systems overwhelming the Kafka cluster and bugs in Kafka or third party software. The key issue here is that transient or permanent, the entire Kafka cluster may be impacted and occasionally this failure will propagate between Kafka clusters; and Human error - People make mistakes and design choices in our software can have broad and unexpected consequences. At best these cause transient outages. At worst they cause permanent data loss across our Kafka clusters.
  9. Generally Available (GA) since Confluent Platform v7 Cluster Linking is Confluent’s preferred and long term supported solution to Disaster Recovery. Schema Linking is a complementary feature Generally Available since Confluent Platform v7.1.
  10. Cluster Linking, built into Confluent Platform and Confluent Cloud allows you to directly connect clusters together mirroring topics from one cluster to another. Cluster Linking works by replicating topics byte for byte including records with their associated offsets, consumer groups with their associated offsets, topics with their associated configuration and ACLs. Some key features Cluster Linking supports are: Low Recovery Time Objectives, effectively RTO=0; It supports active-active setups, so you can make use of both clusters and failover in whichever direction is required; Failover support for all clients including Librdkafka based clients; Smooth failback after disaster recovery scenarios; The ability to support multi-cloud and hybrid cloud setups; As described it preserves metadata including offsets, consumer groups, topics and ACLs; As it copies data byte for byte it avoids decompression and recompression, saving significant resources; and As it’s built into CP Server you don’t need any additional components like Kafka Connect. Some of the challenges associated with using Cluster Linking are: The nature of asynchronous replication means that it cannot support RTO=0; Presently the DR cluster is unaware of transactions meaning it cannot cancel a transaction mid way through processing during a disaster recovery scenario; and To support high availability you must accept a small to moderate breach of record ordering guarantees.
  11. Schema Linking, built into Schema Registry allows you to directly connect Schema Registry clusters together mirroring subjects or entire contexts. Contexts, introduced alongside Schema Linking allow you to create namespaces within Schema Registry which ensures mirrored subjects don’t run into schema naming clashes. Schema Linking complements Cluster Linking by allowing you to copy all associated schemas alongside the data without needing a centralised Schema Registry which would otherwise be the case.
  12. Prefixing allows you to add a prefix to a topic and if desired the associated consumer group to avoid topic and consumer group naming clashes between the primary and DR cluster. This is important when used in an active-active setup and required to use a two way Cluster Link strategy which is the recommended approach and we’ll go into this later.
  13. So, how does this work in practice? Let’s start by discussing a standard active-passive setup
  14. First, starting with a standard cluster and an empty DR cluster we create our Cluster Linking rules, which can be used to copy any current or new topics matching our criteria. And, will replicate historic data as well as sync all future data in real time.
  15. Now, we experience our hypothetical failure of our primary cluster. During this scenario our monitoring alerts us of the failure and we manually (ideally scripted) trigger a failover using the REST API or CLI. During this, we then update our DNS entry to redirect all clients to the Disaster Recovery cluster and either restart our clients or wait for them to reconnect.
  16. A standard strategy is to “fail forward” promoting the DR region to be their new Primary Region, this is because: Cloud regions offer identical service; They already moved all of their applications & data systems to the DR region; and Failing back would introduce risk with little benefit. To fail forward, simply: Delete topics on original cluster (or spin up new cluster) Establish cluster link in reverse direction You may optionally implement a solution to retrieve the subset of data which had not yet been replicated at the time of the original failover.
  17. If it is required that you failback to the original cluster the solution is to wipe the original cluster, cluster link back until you’re syncronised and failover again back to the original cluster.
  18. As cluster linking is asynchronous it means that the consumer offsets which describe which records the consumer has processed may not yet have been replicated across at the time of failover. This may result in the consumer reprocessing these records when it fails over to the DR cluster.
  19. Bi-directional Cluster Linking is an alternative which vastly simplifies your DR strategy.
  20. Let’s start with our steady state, we: Create duplicate topics on both the Primary and DR cluster; Create prefixed cluster links in both directions; Produce records to clicks on the Primary cluster; and Consume records from both all varients of the clicks topic from the Primary cluster using a regex pattern. Note, at this point data is being generated to the clicks topic in the primary cluster and replicated to west.clicks in the DR cluster. However, no data is being produced to the clicks topic in the DR cluster.
  21. Now, an outage occurs impacting the primary region, this results in: The producers and consumers go down in the primary region; and The cluster links are temporarily paused. It’s important to note here that a small amount of data has not yet been replicated to your DR cluster at this point.
  22. To recover, we update our DNS entry to redirect all clients to the Disaster Recovery cluster and either restart our clients or wait for them to reconnect. It’s important to specify that we don’t change topic name or regex during this failover. The consumers will continue to consume pre-failover data in west.clicks and post-failover data in clicks, both from the DR cluster. It’s important to note that unlike the prior strategy we don’t delete the cluster links. You’ll also need to temporarily disable offset replication from clicks to west.clicks. This will stop the stale consumer offsets overwriting the new consumer offsets when the Primary Cluster is brought back online.
  23. When the outage is over we automatically recover the records which had yet to be replicated. This means that although the records will arrive out of order, assuming the primary cluster eventually recovers, your RPO=0. New records generated to the DR cluster will also automatically begin replicating to the primary.
  24. To failback to the primary region. Consumers need to pick up at the end of the writable topics, so: Ensure that all consumer groups have 0 consumer lag for their DR topics e.g. west.clicks Reset all consumer offsets to the last offsets (LEO), this can be done by the platform operator Finally, move consumers & producers back to Primary cluster: Each producer / consumer group can be moved independently
  25. Now that we’ve moved our consumers back to the Primary Cluster we can re-enable consumer offset replication between clicks and west.clicks. Once consumer Once consumer lag is 0 on east.clicks, then reset all consumer groups to Log End Offset (last offset of the partition) on “clicks” on DR cluster.
  26. Bi-directional Cluster Linking is an alternative which vastly simplifies your DR strategy.
  27. Bi-Directional Cluster Linking easily translates to an active-active strategy as seen here where we use a load balancer to spread load across the clusters.
  28. Again, we see an outage which brings down our West cluster (formerly considered our primary cluster).
  29. Now we utilise our load balancer to re-route traffic to our East cluster.
  30. Finally once our West cluster recovers we re-route traffic back to it.
  31. Looking at Stretch Clusters A Stretch Cluster works by splitting a cluster over more than one data center. Some key features Stretch Clusters supports are: RPO=0 and low RTO; As it’s a single cluster it doesn’t require failover / failback or syncing of metadata; and For the same reasons it supports all clients as well as transactions and maintains record ordering. Some of the challenges associated with using a Stretch Cluster are: It requires low latency connections between data centers. We recommend 50ms, but it can support up to 100ms; It increases end-to-end latency; and It doesn’t protect against certain types of failures, such as: Full cluster failures; or Deleting topics. There are a few different approaches to implementing Stretch Clusters and we’ll review them now.
  32. Let’s start with discussing what problem Stretch Clusters are solving.
  33. Let’s start with discussing what problem Stretch Clusters are solving.
  34. Let’s start with our steady state which is: An unknown number of brokers represented here by brokers one through four spread across two DCs; and A standard three node Zookeeper cluster spread across two DCs.
  35. We’ll also assume a replication-factor of three, min.insync.replicas of two and acks=all.
  36. Now, let’s look at what happens when DC “East” fails. But, let’s start by just focusing on Kafka.
  37. First, although we have no data-loss, we now are no longer able to meet min.insync.replicas of two, so we lose availability. We could drop min.insync.replicas to one but then if another broker failed during this period we would lose data.
  38. To resolve this, the immediate solution is to increase replication factor to four. Now when we lose the first two replicas we still have two replicas available to meet the min.insync.replicas requirement.
  39. But, if our 2 replicas are down or out of sync then we lose availability unless we trigger an unclean leader election and accept data loss.
  40. To resolve this, we increase min.insync.replicas to three and in the case of a failure scenario we roll back to min.insync.replicas of two.
  41. But… what about Zookeeper?
  42. We were ignoring it as a factor but now that we’ve solved the Kafka component of the issue we need to consider Zookeepers behaviour as well.
  43. For a Zookeeper cluster to be considered available it needs a minimum of (n/2 + 1) nodes available. This allows it to achieve quorum and which is required to elect a leader or commit writes. Zookeeper has two out of three nodes unavailable and as such cannot form quorum and is now offline.
  44. This is where our first Stretch Cluster design comes into play.
  45. Here, our steady state is: An unknown number of brokers represented here by brokers one through four spread across two DCs; Six Zookeeper nodes with three in each DCs of which one is an observer; and Replication factor of 4, min.insync.replicas of 3 and acks=all.
  46. Now, let’s assume DC “East” which has our Zookeeper observer has an outage. We still have three copies of the Zookeeper data allowing us to reach a “degraded” quorum. Consumers continue to operate as normal and producers continue to operate once we revert to min.insync.replicas=2.
  47. Now, let’s assume DC West has an outage. We can no longer reach Zookeeper quorum so new leaders can’t be elected.
  48. But… we can modify Zookeeper six’s configuration to change it to a standard follower. We must also update Zookeeper four, five and six’s configuration to remove Zookeeper one, two and three from the list of quorum participants and perform a rolling restart so they receive the new configuration.
  49. Now consumers continue to operate. Producers continue to operate once we revert to min.insync.replicas=2.
  50. Now let’s assume a network partition arises. Consumers continue to operate as usual up until they’ve consumed all fully replicated data. Producer will fail as we can no longer meet min.insync.replicas=3.
  51. We manually shutdown DC “East” then update min.insync.replicas=2. Clients resume operating as normal. Consumers failing over from DC “East” will consume some duplicate records.
  52. It’s important to raise the risk run with using Zookeeper observers. While they solve our availability and split-brain issues, they risk data loss in the unlikely scenario that the data-center with the observer is out of sync at the time the data-center without one fails.
  53. Our second option is to use hierarchical quorum. Hierarchical quorum involves getting consensus between multiple Zookeeper “groups” which each form their own quorum.
  54. Here, our steady state is: An unknown number of brokers represented here by brokers one through four spread across two DCs; Six Zookeeper nodes with three in each DCs and each DC representing a Zookeeper group in the hierarchy; and Replication factor of 4, min.insync.replicas of 3 and acks=all.
  55. Now, let’s assume a DC outage on DC “East” We still have three copies of the Zookeeper data, however, we only have consensus from one group, meaning we no longer meet the requirement for hierarchical quorum Consumers continue to operate for leaders on DC “West”, but new leaders can’t be elected on DC “West” and configuration updates can’t be made until we have hierarchical quorum
  56. To resolve this we must remove the DC “East” Zookeeper group from hierarchy then update min.insync.replicas to two.
  57. Now let’s assume a network partition arises. Consumers continue to operate as usual up until they’ve consumed all fully replicated data. Producer will fail as we can no longer meet min.insync.replicas=3 on either DC.
  58. We manually shutdown DC “East”, remove from the hierarchy & update min.insync.replicas=2 Clients now resume operating as normal, but, consumers failing over from DC “East” will consume some duplicate records.
  59. Next, let’s review the “gold standard” for Stretch Clusters which is 2.5 DCs.
  60. Just like our previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using Zookeeper observers, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  61. We still have two copies of the Zookeeper data allowing us to reach a “degraded” quorum. Consumers continue to operate Producers continue to operate once we revert to min.insync.replicas=2
  62. Now, let’s assume DC “West” becomes network partitioned from the rest of the cluster. Consumers connected to DC “East” continue to operate.
  63. Consumers connected to DC “West” continue to operate until they’ve processed all fully replicated records.
  64. Producers connected to DC “East” continue to operate once we revert to min.insync.replicas=2.
  65. Producers connected to DC “West” continue to operate once we shutdown DC “West”, failover and revert to min.insync.replicas=2.
  66. Let’s looks at how Multi-Region Clusters can enhance a standard Stretch Cluster.
  67. Followers are normal replicas, however, observers act the same except that they are not considered for acks=all produce requests.
  68. As of CP v6.1 observers can be configured to be promoted to meet the ObserverPromotionPolicy, including: Under-min-isr: Promoted if in-sync replica size drops below min.insync.replicas Under-replicated: Promoted to cover any replica which is no longer insync Leader-is-observe: Promoted if the current leader is an observer
  69. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  70. In the case of an outage on DC “West” our Observer on DC “East” gets promoted to a full fledge replica and as such continue to meet min.insync.replicas of 3 and operate as usual.
  71. In the case of a network partition which separates DC “West” from DC “Central” and DC “East” the Observer in DC “East” is promoted to a full replica and all clients connected to it continue to operate as normal.
  72. The Observer in DC “West” cannot be promoted as it has no Zookeeper Quorum.
  73. Consumers connected to DC “West” continue to operate until they’ve processed all fully replicated records. Once we shutdown DC “West” the consumers will failover and consume from the same point. This will result in duplicate consumption.
  74. Producers connected to DC “West” fail as we can no longer meet min.insync.replicas=3.
  75. To continue operating as normal we must manually shutdown DC “West.
  76. Next, let’s review the “gold standard” for Stretch Clusters which is 2.5 DCs.
  77. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  78. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  79. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  80. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  81. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  82. In this example we extend our 2.5 DC Stretch Cluster with an additional Observer on either of the two primary DCs. Just like out previous solution we use a minimum of four brokers split across two DCs, replication factor of four, min.insync.replicas of three and acks all. Where this solution differs is that instead of using a hierarchical Zookeeper cluster, we instead spread the Zookeeper nodes across three DC’s, ideally the DC’s containing brokers should be located closter together. This ensure that under both a single network partition, or DC failure we still always have exactly one leader.
  83. Finally, we’ll wrap up with a quick summary and recommendations.
  84. So, I’ll leave this here for you to review in your own time, however, the key details are: Cluster Linking should be considered the default solution; Stretch Clusters should be utilised when RPO=0 is required and cross DC latency is acceptable; and Mirrormaker can be used if an open source solution is mandatory but is highly advised against.
  85. Thanks for listening.