The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.