As the Yelp infrastructure and engineering team grew, so did the pain of managing Nagios. Problems like splitting alerting across multiple teams, providing high availability and managing nagios systems in multiple environments had become pressing. As we grew towards a service oriented architecture and pushed some services out into the cloud, we rapidly needed more automated monitoring configuration.
An evolutionary solution wasn’t going to solve all of our problems, we needed to revolutionize our monitoring. Sensu is built from the ground up to solve many of our issues and be easy to extend.
This talk covers our puppet ‘monitoring_check’ API (that sets up monitoring for our services within puppet), how and why we deploy Sensu and our custom handlers and escalations, along with how we provide automatic ‘self service’ monitoring for dynamic services and how we deal with the challenges posed by the more ephemeral nature of cloud architectures.