This document discusses testing distributed databases like Cassandra for fault tolerance. It recommends testing at scale by simulating production workloads and failure scenarios over extended periods. Critical factors to test include node performance, configuration, repair, and mean time to recovery from single node, rack, availability zone and full data center failures both within and beyond the hint window. The goal is to validate that the database can sustain workloads and recover from failures at the expected utilization levels.
So let’s start off with some assumptions we are going to make.
You have adequate hardware to meet your workload needs (or at least you think you do.)
You are adept at installing and configuring Cassandra or DSE.
You are able to and have designed a sound data model. There are lot’s of other presentations about data modeling…
So why do we build distributed systems and especially use distributed databases like Cassandra?
To get these specific capabilities. None of this should be a surprise to anyone.
While these are fantastic reasons for using distributed systems, these capabilities don’t come without any cost.
Understanding how distributed systems can fail is even more critical, because there are many connected parts that make up the powerful whole.
If a component fails and you aren’t prepared for it, or don’t understand what failure looks like, or what recovery takes, you are no better of than with a single image system, in fact you may be in a worse place.
While Distributed systems are much complex than single image systems
Go through list…
These are assumptions that I see people using Cassandra in the field make all the time
The scariest part of this is, all of these are true (well, except for the first one)
But they aren’t true in all circumstances, or even in many circumstances.
Making blind assumptions about how something as complex as Cassandra is going to work is dangerous.
What we are here to talk about today are the factors we need to be aware of in a testing plan and approaches you should be taking to ensure that Cassandra / DSE is going to provide you the benefits it promises.
Go through list…
These are assumptions that I see people using Cassandra in the field make all the time
The scariest part of this is, all of these are true (well, except for the first one)
But they aren’t true in all circumstances, or even in many circumstances.
Making blind assumptions about how something as complex as Cassandra is going to work is dangerous.
What we are here to talk about today are the factors we need to be aware of in a testing plan and approaches you should be taking to ensure that Cassandra / DSE is going to provide you the benefits it promises.
Testing for a high write workload on STCS does not prepare you for a high read workload on LCS.
Testing for a high write workload on STCS does not prepare you for a high read workload on LCS.
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
This is the test that almost everyone runs (at least once)
This is the test that almost everyone runs (at least once)
This is the test that almost everyone runs (at least once)
This is the test that almost everyone runs (at least once)
This is the test that almost everyone runs (at least once)
This is the test that almost everyone runs (at least once)
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems
While Distributed systems are much complex than single image systems