Successfully reported this slideshow.
Your SlideShare is downloading. ×

Monitorama 2015 Monitoring OpenConnect CDN

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 75 Ad

Monitorama 2015 Monitoring OpenConnect CDN

Download to read offline

At Netflix we are building Content Delivery Network called OpenConnect to power the traffic from Netflix customers (that currently takes up to 36.5% of peak internet traffic in the US). Currently the network consists of thousands of caches spread around the world and we are actively deploying more as Netflix is adding new customers and coming into new markets.
Apparently, monitoring is important part of the our work day as we operate and grow the system, make changes to the network and the software powering the caches to make sure Netflix customers are not affected.
While we follow 'testing in production' development style, we don't have 24/7 NOC and the whole network is maintained by relatively small operations team. Given the size of the system we have something failing all the time, but the network is resilient to small failures. Therefore, while we want to track all issues, not all of them are equally urgent.
Given specifics of the problem domain we decided to build our own monitoring system, optimized for our environment and providing:
* Integration with different metric sources to get monitoring signals
* Programmable API for automated tools to communicate with the monitoring system
* Prioritization of issues
* Aggregation of metrics per logical groups representing structure of the monitored system
* UI elements providing OPS with control over visualization of data, issues troubleshooting and triage

While currently our monitoring system is targeted for our problem domain we believe that our experience in building our monitoring tools will benefit the community and can be adapted to any distributed system.
Main topics:
* The concept of stateful monitoring and alerting based on state changes
* Issues aggregation and prioritization
* Building UI that turns your monitoring system into collaborative tool for ops to detect, triage and troubleshoot issues
* Lessons learned

At Netflix we are building Content Delivery Network called OpenConnect to power the traffic from Netflix customers (that currently takes up to 36.5% of peak internet traffic in the US). Currently the network consists of thousands of caches spread around the world and we are actively deploying more as Netflix is adding new customers and coming into new markets.
Apparently, monitoring is important part of the our work day as we operate and grow the system, make changes to the network and the software powering the caches to make sure Netflix customers are not affected.
While we follow 'testing in production' development style, we don't have 24/7 NOC and the whole network is maintained by relatively small operations team. Given the size of the system we have something failing all the time, but the network is resilient to small failures. Therefore, while we want to track all issues, not all of them are equally urgent.
Given specifics of the problem domain we decided to build our own monitoring system, optimized for our environment and providing:
* Integration with different metric sources to get monitoring signals
* Programmable API for automated tools to communicate with the monitoring system
* Prioritization of issues
* Aggregation of metrics per logical groups representing structure of the monitored system
* UI elements providing OPS with control over visualization of data, issues troubleshooting and triage

While currently our monitoring system is targeted for our problem domain we believe that our experience in building our monitoring tools will benefit the community and can be adapted to any distributed system.
Main topics:
* The concept of stateful monitoring and alerting based on state changes
* Issues aggregation and prioritization
* Building UI that turns your monitoring system into collaborative tool for ops to detect, triage and troubleshoot issues
* Lessons learned

Advertisement
Advertisement

More Related Content

Viewers also liked (20)

Recently uploaded (20)

Advertisement

Monitorama 2015 Monitoring OpenConnect CDN

  1. 1. Monitoring OpenConnect CDN Sergey Fedorov, Netflix Monitorama 2015 Sergey Fedorov, Netflix, Monitorama 2015
  2. 2. What is OpenConnect 36.5% US downstream traffic * * 2015 Sandvine reportSergey Fedorov, Netflix, Monitorama 2015
  3. 3. OpenConnect Cache Appliance Space/Power optimized 10/40Gbs network interface FreeBSD OS NGinx server Bird routing proxy Gizmodo, “This box can hold an entire Netflix” http://gizmodo.com/this-box-can-hold-an-entire-netflix-1592590450 Sergey Fedorov, Netflix, Monitorama 2015
  4. 4. Network Transit Internet Exchange ISP embedded Sergey Fedorov, Netflix, Monitorama 2015
  5. 5. Sergey Fedorov, Netflix, Monitorama 2015 Intelligent clients
  6. 6. Control Plane end-user content request router client location network conditions server utilization content distribution Sergey Fedorov, Netflix, Monitorama 2015
  7. 7. Who we are Sergey Fedorov Stefan Praszalowicz Sergey Fedorov, Netflix, Monitorama 2015
  8. 8. Monitoring challenge
  9. 9. Testing in prod* Network changes Firmware deployments App pushes Updating content ... Sergey Fedorov, Netflix, Monitorama 2015
  10. 10. Sergey Fedorov, Netflix, Monitorama 2015 CachesClients Control Plane Micro services Network Capacity Config Content Telemetry (Atlas) Logs (ElasticSearch) Data sources METRICS
  11. 11. Something breaks all the time
  12. 12. Big problems start small
  13. 13. Context matters Sergey Fedorov, Netflix, Monitorama 2015
  14. 14. Sergey Fedorov, Netflix, Monitorama 2015
  15. 15. Small SRE team
  16. 16. Elastic
  17. 17. How we do it
  18. 18. Netflix Clients Caches Network ConfigData sources ...... ... Sergey Fedorov, Netflix, Monitorama 2015
  19. 19. Netflix Clients Caches Network ConfigData sources ...... ... Orchestration Data processing stream processorspollers Sergey Fedorov, Netflix, Monitorama 2015
  20. 20. FSMState processing Netflix Clients Caches Network ConfigData sources ...... ... Orchestration Data processing stream processorspollers Sergey Fedorov, Netflix, Monitorama 2015
  21. 21. MAINTENANCE start fixing end fixing threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  22. 22. start fixing end fixing action: ok from: cpu threshold=75% MAINTENANCE Sergey Fedorov, Netflix, Monitorama 2015
  23. 23. start fixing end fixing action: ok from: cpu threshold=75% MAINTENANCE Sergey Fedorov, Netflix, Monitorama 2015
  24. 24. start fixing end fixing action: ok from: cpu threshold=75% MAINTENANCE Sergey Fedorov, Netflix, Monitorama 2015
  25. 25. start fixing end fixing action: ok from: cpu threshold=75% MAINTENANCE Sergey Fedorov, Netflix, Monitorama 2015
  26. 26. MAINTENANCE start fixing end fixing action: silence from: config threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  27. 27. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  28. 28. MAINTENANCE start fixing end fixing action: silence from: config threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  29. 29. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  30. 30. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  31. 31. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  32. 32. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  33. 33. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  34. 34. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  35. 35. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  36. 36. MAINTENANCE start fixing end fixing action: unsilence from: config threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  37. 37. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  38. 38. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  39. 39. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  40. 40. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  41. 41. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  42. 42. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  43. 43. MAINTENANCE start fixing end fixing action: start_fix from: user threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  44. 44. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  45. 45. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  46. 46. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  47. 47. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  48. 48. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  49. 49. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  50. 50. MAINTENANCE start fixing end fixing action: break from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  51. 51. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  52. 52. MAINTENANCE start fixing end fixing action: ok from: cpu threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  53. 53. MAINTENANCE start fixing end fixing action: end_fix from: user threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  54. 54. MAINTENANCE start fixing end fixing threshold=75% Sergey Fedorov, Netflix, Monitorama 2015
  55. 55. FSMState processing Netflix Clients Caches Network ConfigData sources ...... ... Orchestration Data processing stream processorspollers Sergey Fedorov, Netflix, Monitorama 2015
  56. 56. FSMState processing Netflix Clients Caches Network ConfigData sources ...... ... Orchestration Data processing stream processorspollers Events processing Event handlers
  57. 57. STATE TRANSITION EVENT ● OLD STATE ● NEW STATE ● Input action ● Metric name ● Action metadata ○ metric value ○ comments ○ tags ○ timestamp ○ ... Event handlers Triggers an event Event handlers RULES Sergey Fedorov, Netflix, Monitorama 2015
  58. 58. Sergey Fedorov, Netflix, Monitorama 2015 Events priority Escalation Do Never Notice Warning Critical Severity Info Do Next Do Last Do Now 0 1 2 3
  59. 59. Notice Warning Critical Severity Info 0 1 2 3Escalation Notice Warning Critical Severity Info 0 1 2 3 Notifications Sergey Fedorov, Netflix, Monitorama 2015
  60. 60. FSMState processing Netflix Clients Caches Network ConfigData sources ...... ... Orchestration Data processing stream processorspollers Events processing Event handlers
  61. 61. Aggregation C Cluster Cache state = aggregation of states of its metrics Cluster state = aggregation of states of its caches OK all OK DEGRADED some BROKEN or DEGRADED BROKEN most BROKEN All caches are OK → cluster state is OK Sergey Fedorov, Netflix, Monitorama 2015
  62. 62. Aggregation C Cluster OK all OK DEGRADED some BROKEN or DEGRADED BROKEN most BROKEN 2/12 caches are BROKEN → cluster state is DEGRADED Sergey Fedorov, Netflix, Monitorama 2015
  63. 63. Aggregation C Cluster OK all OK DEGRADED some BROKEN or DEGRADED BROKEN most BROKEN 7/12 caches are BROKEN → cluster state is BROKEN Sergey Fedorov, Netflix, Monitorama 2015
  64. 64. FSMState processing Netflix Clients Caches Network ConfigData sources ...... ... Orchestration Data processing stream processorspollers Events processing Event handlers
  65. 65. Challenges Setup Sergey Fedorov, Netflix, Monitorama 2015
  66. 66. Challenges Setup Predefined groupings Sergey Fedorov, Netflix, Monitorama 2015
  67. 67. Challenges Setup Predefined groupings UI Sergey Fedorov, Netflix, Monitorama 2015
  68. 68. Challenges Setup Predefined groupings UI Issues correlation Sergey Fedorov, Netflix, Monitorama 2015
  69. 69. Challenges Setup Predefined groupings UI Issues correlation Failure forecasting Sergey Fedorov, Netflix, Monitorama 2015
  70. 70. Challenges Setup Predefined groupings UI Issues correlation Failure forecasting OSS Sergey Fedorov, Netflix, Monitorama 2015
  71. 71. Feedback
  72. 72. jobs.netflix.com/jobs/1693/ jobs.netflix.com/jobs/2240/ Sergey Fedorov OpenConnect, Netflix sfedorov@netflix.com

×