Every company likes to brag about their successes, but not many are willing to talk about their failures. At PagerDuty we have been rigorously tracking downtime in order to analyze it and learn from our mistakes - we even blog about these failures publicly. Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.