At Netflix we are building Content Delivery Network called OpenConnect to power the traffic from Netflix customers (that currently takes up to 36.5% of peak internet traffic in the US). Currently the network consists of thousands of caches spread around the world and we are actively deploying more as Netflix is adding new customers and coming into new markets.
Apparently, monitoring is important part of the our work day as we operate and grow the system, make changes to the network and the software powering the caches to make sure Netflix customers are not affected.
While we follow 'testing in production' development style, we don't have 24/7 NOC and the whole network is maintained by relatively small operations team. Given the size of the system we have something failing all the time, but the network is resilient to small failures. Therefore, while we want to track all issues, not all of them are equally urgent.
Given specifics of the problem domain we decided to build our own monitoring system, optimized for our environment and providing:
* Integration with different metric sources to get monitoring signals
* Programmable API for automated tools to communicate with the monitoring system
* Prioritization of issues
* Aggregation of metrics per logical groups representing structure of the monitored system
* UI elements providing OPS with control over visualization of data, issues troubleshooting and triage
While currently our monitoring system is targeted for our problem domain we believe that our experience in building our monitoring tools will benefit the community and can be adapted to any distributed system.
Main topics:
* The concept of stateful monitoring and alerting based on state changes
* Issues aggregation and prioritization
* Building UI that turns your monitoring system into collaborative tool for ops to detect, triage and troubleshoot issues
* Lessons learned