Early detection and correction of cluster health issues is a vital part of daily cluster management, no matter the size. Building and managing a healthy cluster is the best cure for meeting service level agreements and preventing or avoiding elongated troubleshooting. A cluster is effective and efficient when problems are detected and eliminated early. Fortunately, deploying simple tools and processes prevents minor problems from becoming major headaches. This talk covers how we developed, tested, and deployed a comprehensive health process based on real life events and experiences. The table driven health check runs a full scan in ~2 seconds and includes: a checklist, ‘positive’ error pattern matching, enabling and disabling node blacklisting, logging, validating file systems, processing very large log files, trapping in-rack network faults (adds 5 seconds to accurately detect packet loss), and recommissioning nodes into production.