Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency. In this talk, learn how this can be validated for Apache Pulsar Kubernetes deployments. Various failures are injected using Chaos Mesh to simulate network and other infrastructure failure conditions. There are many questions that are asked about failure scenarios, but it could be hard to find answers to these important questions. When a failure happens, how long does it take to recover? Does it cause unavailability? How does it impact throughput and latency? Are the guarantees of no message loss and strong message ordering kept, even when components fail? If a complete availability zone fails, is the system configured correctly to handle AZ failures? This talk will help you find answers to these questions and apply the tooling and practices to your own testing and validation.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022
1. Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Tech Deep Dive
Validating Apache
Pulsar’s Behavior
under Failure
Conditions
Lari Hotari
Engineering Coach • DataStax
1
2. Lari Hotari is an Apache Pulsar
committer and PMC member. He has
worked on the Java platform since 1997
and has contributed to open source for
over 20 years.
Lari Hotari
Engineering Coach, Streaming
Customer Reliability Engineering
DataStax
Lari.Hotari@datastax.com
@lhotari
2
3. 3
Validating Apache Pulsar’s Behavior
under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
4. Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: Provided service meets
the service consumer’s requirements
with very low downtime.
4
Expectation: “two nines”
(99% available) or more.
5. Validating Apache Pulsar’s Behavior under Failure Conditions
Availability
5
Availability %
Downtime per day
(24 hours)
99% ("two nines") 14.4 minutes
99.5% ("two and a half nines") 7.20 minutes
99.9% ("three nines") 1.44 minutes
99.95% ("three and a half nines") 43.2 seconds
99.99% ("four nines") 8.64 seconds
99.995% ("four and a half nines") 4.32 seconds
99.999% ("five nines") 864 milliseconds
● During uptime, the provided service meets the
agreed level of operational quality and
performance defined in operational SLA
● The service consumer’s needs are met when
service disruptions don’t cause essential
negative business impact.
Some factors impacting the availability figures
● Reporting interval
● What is considered as downtime?
○ Total Failure vs Service Degradation / Partial
Failure
○ High error rate? Exceeding latency requirements?
6. Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: At-least-once
message delivery. Published
messages aren’t lost in the system
in any case.
6
Consuming state is
preserved so that the
messages aren’t
skipped in
consuming.
The system will
redeliver messages
which aren’t
acknowledged.
7. Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: Messages are delivered
to a consumer in the same order as
the publisher has published them in to
a single topic.
7
8. Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: The messaging
system can be used for use cases
where there is a low latency
requirement.
8
Applications can expect messages to
be published with low latency and the
end-to-end latency from publishing to
consuming is expected to be low and
predictable.
9. Validating Apache Pulsar’s Behavior under Failure Conditions
Highly available
Summary of Expectations
9
No message loss
Strong message ordering
Predictable read and
write latency
11. Validating Apache Pulsar’s Behavior under Failure Conditions
Failure Conditions
What could possibly go wrong?
11
12. Validating Apache Pulsar’s Behavior under Failure Conditions
How to think about the different ways and decide what to validate?
● Learning from real production systems
○ Incident reports / post mortems
● System analysis methods coming from
○ Reliability Engineering
■ Reliability Modeling
○ Systems Reliability Theory
■ FMEA/FMECA (Failure mode and effects analysis)
○ Risk assessment theory
■ Risk analysis 12
13. Validating Apache Pulsar’s Behavior under Failure Conditions
Examples of failure conditions for Pulsar validation
● Broker/Bookie/Zookeeper node fails
● All components in an availability zone fail
● Network disconnected -> Network partitioning / Split-Brain
● Network limited bandwidth / increased latency
● Network flappy connectivity
● Network packet loss
● Bookie/Zookeeper disk fails
13
14. Validating Apache Pulsar’s Behavior under Failure Conditions
Examples of other conditions for Pulsar validation
● Broker scale-up / scale-down
● Bookie scale-up / scale-down
● Broker/Bookie/Zookeeper software upgrade
Performance / Load testing related failure conditions:
● Message publishing overload
● Message consuming overload
14
15. Validating Apache Pulsar’s Behavior under Failure Conditions
Unknown failure conditions - these will always exist
“Reports that say that something hasn't happened are always
interesting to me, because as we know, there are known knowns;
there are things we know we know. We also know there are known
unknowns; that is to say we know there are some things we do not
know. But there are also unknown unknowns—the ones we don't
know we don't know. And if one looks throughout the history of our
country and other free countries incident reports*
, it is the latter
category that tends to be the difficult ones.”
- Donald Rumsfeld
*, adapted to SRE
15
17. Validating Apache Pulsar’s Behavior under Failure Conditions
● Useful for collaboration and communicating with stakeholders
● Written test plan with specific test cases and documented
expectations
○ Test case descriptions include the definition of the failure
condition
● Test reports that capture essential results for analysis
17
Test plans and test reports
18. Validating Apache Pulsar’s Behavior under Failure Conditions
Test plan example
18
Test case format:
- Test case identifier + title
- Description and intent
- Procedure
- Expected outcome
19. Validating Apache Pulsar’s Behavior under Failure Conditions
Test report example
19
Analysis and
status update to
stakeholders
20. Validating Apache Pulsar’s Behavior under Failure Conditions
Validation approaches
20
Test Environment with Test Workload
● Resilience Testing
● Chaos Testing
Production Environment with Production Workload
● Resilience Engineering
● Chaos Testing
21. Validating Apache Pulsar’s Behavior under Failure Conditions
Chaos Testing
● Requires test tooling for fault injection
● Fault injection can be used to put specific infrastructure
components into a failed or degraded state which can be
controlled by the chaos testing framework
21
22. Validating Apache Pulsar’s Behavior under Failure Conditions
Test workload
22
Simulated
Workload Created
With Test Tooling
Test Applications In
A Test Environment
Anonymized /
Shadowed
Production Traffic
23. Validating Apache Pulsar’s Behavior under Failure Conditions
Test workload generation
● NoSQLBench, ASL 2.0 license,
https://github.com/nosqlbench/nosqlbench
○ Originally created for testing nosql
databases, but has been since then
adapted for testing messaging systems
● pulsar-perf
○ Comes with Apache Pulsar distribution
● Custom test workload generator applications
23
24. Validating Apache Pulsar’s Behavior under Failure Conditions
Tooling requirement for validating Pulsar’s behavior
● end-to-end observability
○ NoSQLBench pulsar driver features:
■ Measure End-to-end Message
Processing Latency
■ Detect Message Out-of-order,
Message Loss, and Message
Duplication
24
Highly
available
No message
loss
Strong
message
ordering
Predictable
read and write
latency
25. Validating Apache Pulsar’s Behavior under Failure Conditions
Example of NoSQLBench Pulsar driver metrics rendered with Grafana
25
End-to-end publish-to-consume latency and error metrics
27. Validating Apache Pulsar’s Behavior under Failure Conditions
Detecting ordering issues
27
Pulsar Java client ordering issues fixed since Pulsar version 2.8.2:
● [Java Client] Remove data race in MultiTopicsConsumerImpl to ensure correct message order #12456
● [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779
28. Validating Apache Pulsar’s Behavior under Failure Conditions
Automation choices
● No automation - interactive testing
● Custom script / in-house test framework
● Fallout
○ Open source test orchestration harness
○ Automates creation of environment, workload
execution, data collection and analysis
○ Plugin architecture integrates with common tools
28
30. Validating Apache Pulsar’s Behavior under Failure Conditions
k8s cluster
Deployment view of example setup
30
Chaos Mesh
Pulsar deployment:
brokers, bookies,
zookeepers
Test workload: Nosqlbench
jobs run as k8s jobs on
dedicated k8s node pool
Prometheus Graphite
Exporter
Prometheus
Grafana
Grafana
dashboards
Grafana renderer
Test control scripts
34. Validating Apache Pulsar’s Behavior under Failure Conditions
Four Cornerstones of Resilience
34
Knowing what to
EXPECT
Knowing what to
DO
Knowing what has
HAPPENED
Knowing what to
LOOK FOR
Anticipation Monitoring Response Learning
Erik Hollnagel’s Four Cornerstones of Resilience