Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Herding cats & catching fire: Workday's telemetry & middleware

87 views

Published on

In this talk from Sensu Summit 2018, David Beaurpere, Principal Software Engineer for the Observability Group at Workday Ltd, discusses how Sensu 1.x evolved from a Nagios replacement to the backbone monitoring data collection and transport at Workday.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Herding cats & catching fire: Workday's telemetry & middleware

  1. 1. Herding Cats & Catching Fire --- Workday’s telemetry middleware --- (David Beaurpere, sensu summit 2018)
  2. 2. A Classic Sensu Tale
  3. 3. Multisite SNMP
  4. 4. Multisite SNMP Appliances
  5. 5. Appliances Uchiwa Multisite SNMP
  6. 6. Appliances Fabrics Uchiwa Multisite SNMP Ephemerals Kubernetes
  7. 7. Clients All the Metrics
  8. 8. ● REST API specification ○ Near zero uptake ● Prometheus ○ Instrumentation libraries for all major stacks ○ All the primitives: counters, gauges, histograms, labelling, etc. ○ Very active community ○ Enough “Rad factor” for an easier sale ○ ... ● JMX based custom protocol ○ Setup collector once then 100% “in app” ○ Richer decoration options ○ Implies in-house an instrumentation utilities ○ Implies JVM ● NRPE scripts and PERFDATA ○ 100% backward compatible (minus some limitations) ○ Still tons of those scripts in the wild ○ More metrics means new “checks” ○ Limited decorability Clients
  9. 9. Exporte r Metric Endpoi nt Query API Why bother polling Prometheus via sensu? ● A more flexible discovery and polling model ● Less Networking hassle ● Generally happier SEC folks ● Bridge to non-Prometheus TSDB ● Metric HTTP Endpoints and the PCL ○ Consistent mean to empower customer teams ○ Major tool in our “As A Service” belt ● Exporters ○ Convenient bridge to 3rd parties instrumentation ○ Node Exporter for consistent host level metrics ○ SNMP Exporter, a path out of Nagios ● Prometheus server’s Query API ○ Collect summaries off local high density data sets ○ Bridge islands (e.g. Kubernetes) to main TSDB What did we get from it?
  10. 10. Exporte r Metric Endpoi nt Query API The Plumbing
  11. 11. ● Prometheus collector Plugin ○ A Sensu check plugin ○ Basically makes an HTTP GET and output result to STDOUT ○ Multiple auth supported ○ Optional filters ○ Alerting switches # /opt/mon/plugins/wd_get_prometheus_metrics.rb --help Usage: /opt/mon/plugins/wd_get_prometheus_metrics.rb -H, --host STRING endpoint host. Default to "localhost" -p, --port INT endpoint port. Default to XXXX -q, --query_path STRING endpoint path. Default to ‘/metrics’ ... -R, --read_timeout INT read timeout limit -l, --limit LIMIT alerts if allowance is exceeded -W, --warn-no-data alerts (warning) on no data -E, --excludes regexes ";" separated list of regex to exclude from the scrape -I, --includes regexes ";" separated list of regex to Include to the scrape Exporte r Metric Endpoi nt Query API
  12. 12. ● Prometheus collector Plugin ● Prometheus extension ○ A Sensu server handler ○ Enforce allowance ○ Decorate the payload ○ Maintain throughput metric ○ Persist the payload on the file system { "handlers": { "prometheus_to_file": { "type": "set", "handlers": [ "prometheus_to_file_ext", "Bigpanda_ext" ] } }, "prometheus_to_file_ext": { "input_dir": "/data/sensu/spool/wrap-prometheus-input", "push_metric_interval": 60, "allowances": { "default": 5000, "prometheus_snmp": 50000, "kubelet": 10000, ... } } } ... ● Prometheus collector PluginExporte r Metric Endpoi nt Query API
  13. 13. ● Prometheus collector Plugin ● Prometheus extension { "input_dir": "/data/sensu/spool/wrap-prometheus-input", "output_dir": "/data/sensu/spool/wrap-prometheus-processed", "bad_output_dir": "/data/sensu/spool/wrap-prometheus-rejected", "wavefront_host": “wavefront.services.wd” "wavefront_port": 2878 "wavefront_histogram": 29400 "histogram_timeout": 600 "redis_host": “localhost” "redis_port": 26379 "redis_password_file": “/etc/redis/.password” } ● Prometheus collector Plugin ● Prometheus extension ● Wavefront Sender ○ Continuously watch the payload staging directory ○ Parse the payload & convert to Wavefront format ○ Handle histograms ○ Flush to wavefront WF sender Exporte r Metric Endpoi nt Query API
  14. 14. Exporte r Metric Endpoi nt Query API Histograms? A bit of a pain really. WF sender
  15. 15. From cumulative histogram of counters to standard histogram of gauges Exporte r Metric Endpoi nt Query API … which means dealing with: ● Parsing ( PPCL ) ● Caching ( Redis ) ● Bucket resize ● Mixed data ● Out of order processing WF sender
  16. 16. SNMP Export er SNMP Mass Targeted SNMP Polling (Nagios’s killing blow)
  17. 17. Assembly kit for a Scalable SNMP Poller 1. SNMP collector nodesSNMP SNMP Export er Client plugin
  18. 18. SNMP SNMP Export er Client plugin Assembly kit for a Scalable SNMP Poller 1. SNMP collector nodes 2. Inventory SNMP devices as proxy clients Machine DB API P-Client
  19. 19. { "name": "core02.net.az1.eng.pdx.wd", "proxy": true, "snmp_device": true, "snmp_cluster_id": "AZ1", "snmp_module": "switches_cisco_nexus", ... } SNMP SNMP Export er Client plugin Machine DB API P-Client P-checks Assembly kit for a Scalable SNMP Poller 1. SNMP collector nodes 2. Inventory SNMP devices as proxy clients 3. Proxy request checks (with tokens) "checks": { "snmp.AZ1": { "command": "wd_get_prometheus_metrics.rb ..." …
  20. 20. { "name": "core02.net.az1.eng.pdx.wd", "proxy": true, "snmp_device": true, "snmp_cluster_id": "AZ1", "snmp_module": "switches_cisco_nexus", ... } SNMP SNMP Export er Client plugin Machine DB API P-Client P-checks Assembly kit for a Scalable SNMP Poller 1. SNMP collector nodes 2. Inventory SNMP devices as proxy clients 3. Proxy request checks (with tokens) "checks": { "snmp.AZ1": { "command": "wd_get_prometheus_metrics.rb ..." "proxy_requests": { "client_attributes": { "snmp_device": true, "snmp_cluster_id": "AZ1" }, "splay": true }, …
  21. 21. { "name": "core02.net.az1.eng.pdx.wd", "proxy": true, "snmp_device": true, "snmp_cluster_id": "AZ1", "snmp_module": "switches_cisco_nexus", ... } SNMP SNMP Export er Client plugin Machine DB API P-Client P-checks "checks": { "snmp.AZ1": { "command": "wd_get_prometheus_metrics.rb ..." "proxy_requests": { "client_attributes": { "snmp_device": true, "snmp_cluster_id": "AZ1" }, "splay": true }, "subscribers": [ "roundrobin:snmp.collector.az1" ], … Assembly kit for a Scalable SNMP Poller 1. SNMP collector nodes 2. Inventory SNMP devices as proxy clients 3. Proxy request checks (with tokens) 4. Round-robin client subscriptions
  22. 22. { "name": "core02.net.az1.eng.pdx.wd", "proxy": true, "snmp_device": true, "snmp_cluster_id": "AZ1", "snmp_module": "switches_cisco_nexus", ... } SNMP SNMP Export er Client plugin Machine DB API P-Client P-checks "checks": { "snmp.AZ1": { "command": "wd_get_prometheus_metrics.rb ..." "proxy_requests": { "client_attributes": { "snmp_device": true, "snmp_cluster_id": "AZ1" }, "splay": true }, "subscribers": [ "roundrobin:snmp.collector.az1" ], … Assembly kit for a Scalable SNMP Poller 1. SNMP collector nodes 2. Inventory SNMP devices as proxy clients 3. Proxy request checks (with tokens) 4. Round-robin client subscriptions
  23. 23. Back to the big picture...
  24. 24. Appliances Fabrics Uchiwa SNMP Ephemerals Kubernetes
  25. 25. Meanwhile at Sensu Inc ...
  26. 26. In short Sensu@Workday was the answer to two major conundrum: ● Enabling Monitoring for a heterogeneous and constantly evolving ecosystem ● Providing a noninvasive and low maintenance metric pipeline
  27. 27. Questions?
  28. 28. Thank you

×