Failover-Apachecon-Asia-2022.pptx

When Failure is Not an Option
David Kjerrumgaard
2022.07.29

• Committer on the Apache Pulsar
project.
• Former Principal Software Engineer on
Splunk’s Pulsar-as-a-Service team.
• Global Streaming Practice Director at
Streamlio & Hortonworks

• Author of Pulsar in Action
• Co-author, Practical Hive

When Failure is Not an
Option
Introducing Pulsar’s Failover Client

Defining Availability
• Uptime is measured
as the ratio of
uptime to downtime
within a year.
• Each layer build on
the previous one.

Multifaceted Availability
• Availability is a
concern across multiple
layers.
• Each of these have their
own uptime metric
• Application uptime is
equal to the lowest
uptime metric across all
layers.

Pulsar‘s Availability
Features

Platform Availability Features
• Stateless brokers
• Redundant components
across all layers.
• Ability to leverage cloud
native features like
stateful sets to maintain
minimum replica count

Data Availability Features
• Self-healing replicated data storage.
• Rack placement policies
• Geo-replication of data

Application Availability Features
• Connection-aware clients
that automatically
detect and recover in
the event a client
disconnects from one of
the brokers.
• Completely transparent
to the application.

Availability in Pulsar Before 2.10
• Apache Pulsar can only provide high-availability.
• Application availability is the weakest link.

What was missing?
• Up until now, Pulsar clients could only interact with a
single Pulsar cluster and were unable to detect and
respond to a cluster-level failure event.
• In the event of a complete cluster failure, these
clients cannot reroute their messages to a
secondary/standby cluster automatically.
• In such a scenario, any application that uses the Pulsar
client is vulnerable to a prolonged outage since the
clients could not establish a connection to an active
cluster.

Pre-2.10 Cluster Failover
• To redirect the clients from the “active” to the standby
cluster, the DNS entry for the Pulsar endpoint that the client
applications are using must be updated to point to the load
balancer of the standby cluster.
• Pulsar clients are
configured to use a single
static URL to connect
• The DNS record is updated
to point to the regional
load balancer

What is wrong with this approach?
• It requires your DevOps team to monitor the health of your
Pulsar clusters and manually update the DNS record to
point to the stand-by cluster when the active cluster is
down.
• This cutover is not automatic, and the recovery time is
determined by the response time of your DevOps team.
• Even after the DNS record has been changed, it will take
some additional time before the DNS cache is refreshed.

Failover Clients
Two new Cluster Cut-Over Strategies

Two new approaches
• There are two new cluster failover strategies
included in the upcoming 2.10 release.
• One supports automatic failover in the event of a
cluster outage, while the other enables you to
control the switch-over through an HTTP endpoint.

Automated Failover
• The AutoClusterFailover failover strategy
automatically switches from the primary cluster to a
stand-by cluster in the event of a cluster outage.
• This behavior is controlled by a probe task that
monitors the primary cluster.
• When it finds the primary cluster is unavailable for
more than failoverDelayMs, it will switch the
client connections over to the secondary cluster.

Controlled Failover
• The ControlledClusterFailover strategy,
supports switching from the primary cluster to a
stand-by cluster in response to a signal sent
from an external service.
• This strategy enables your administrators to
trigger the cluster switch over.

Demo Time!
https://github.com/david-streamlio/cluster-failover-demo

What am I going to demo?
• Automatic Failover:
• Step 1: Start an application that uses the Automatic Failover client
to produce data to a topic.
• Step 2: Start consumers on both the active & standby clusters.
• Step 3: Stop the active Pulsar cluster
• Step 4: Observe the flow of data shift from the active to the standby
cluster
• Step 5: Restart the primary cluster
• Step 6: Observe the flow of data shift back to the primary cluster

What am I going to demo?
• Controlled Failover:
• Step 1: Start the REST Endpoint service.
• Step 2: Start an application that uses the Controlled Failover client to
produce data to a topic.
• Step 3: Start consumers on both the active & standby clusters.
• Step 4: Trigger the controller to switch to a different Pulsar cluster after
approximately 20 messages
• Step 5: Observe the flow of data shift from the active to the standby cluster
• Step 6: Trigger the controller to switch to the original Pulsar cluster after
approximately 30 messages

Summary
• Release 2.10 of Pulsar includes two new failover
clients that provide continuous availability for your
Pulsar applications
• I demonstrated how to configure and use the Automatic
failover client when producing messages.
• The Controlled Failover client is harder to implement
because it requires an additional service to be
written, but it does provide more flexibility.

Thanks for Attending
Scan the QR Code to
learn more about Apache
Pulsar.
Explore the Code
https://github.com/david-streamlio/cluster-failover-demo

Failover-Apachecon-Asia-2022.pptx

Recommended

Recommended

More Related Content

Similar to Failover-Apachecon-Asia-2022.pptx

Similar to Failover-Apachecon-Asia-2022.pptx (20)

Recently uploaded

Recently uploaded (20)

Failover-Apachecon-Asia-2022.pptx

Editor's Notes