Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

observability pre-release: using prometheus to test and fix new software

220 views

Published on

The pillars of observability have long been accepted as key components of any microservice-in-production. But what about those new products—those new features—that have yet to be released? Properly instrumenting and leveraging metrics at this stage is perhaps even more crucial. When a product is yet to be released, identifying and addressing early bugs is critical.

See how the team at Digital Ocean leveraged Prometheus to properly instrument and test features within their software-defined networking pillar. This session will highlight instrumentation, key visualizations, and takeaways from their experience. You’ll also hear about areas for improvement and find out how to use these learnings for your own releases.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

observability pre-release: using prometheus to test and fix new software

  1. 1. digitalocean.com about me @sneha inguva software engineer @DigitalOcean currently network services
  2. 2. digitalocean.com who are we? cloud-hosting company 12 data centers compute, storage, k8saas, and more! ☁
  3. 3. digitalocean.com scale 300+ microservices 1.15MM droplets (VMs) 1.8MM metrics samples/second
  4. 4. a time of change digitalocean.com
  5. 5. digitalocean.com the past services run on virtual machines deploy using chef, bash scripts, etc. blackbox monitoring: icinga, nagios
  6. 6. digitalocean.com the present services run in containers on k8s deploy using docc CLI tool metrics, tracing, and logging
  7. 7. digitalocean.com the result: observability into production
  8. 8. digitalocean.com Is our service up/down? Can we trace this request error through our services? Should we alert due to an abnormal metric?
  9. 9. digitalocean.com what about new services?
  10. 10. observability pre-release: using prometheus to test and fix new software digitalocean.com
  11. 11. digitalocean.com the plan: ✔ observability at DigitalOcean metrics to measure what is prometheus? build. instrument. test. iterate. examples
  12. 12. digitalocean.com pillars of observability metrics logging tracing
  13. 13. digitalocean.com metrics: time-series of sampled data
  14. 14. digitalocean.com apply functions for analysis look at aggregate trends configure alerts metrics
  15. 15. digitalocean.com tracing: propagating metadata through different requests, threads, and processes
  16. 16. digitalocean.com tracing analyze causality watch a single request traverse multiple services build an ecosystem map
  17. 17. digitalocean.com logging: record of discrete events over time
  18. 18. digitalocean.com logging harder to store can locate needles in the haystack
  19. 19. digitalocean.com the plan: ✔ observability at DigitalOcean ✔ metrics to measure what is prometheus? build. instrument. test. iterate. examples
  20. 20. digitalocean.com metrics: what do we measure?
  21. 21. digitalocean.com four golden signals
  22. 22. digitalocean.com latency: time to service a request traffic: requests/second error: error rate of requests saturation: fullness of a service
  23. 23. digitalocean.com Utilization Saturation Error rate
  24. 24. digitalocean.com utilization: avg time resource busy servicing work saturation: degree to which resource has extra work it can’t service error rate: rate of error events
  25. 25. digitalocean.com “USE metrics often allow users to solve 80% of server issues with 5% of the effort.”
  26. 26. digitalocean.com the plan: ✔ observability at DigitalOcean ✔ metrics to measure ✔ what is prometheus? build. instrument. test. iterate. examples
  27. 27. digitalocean.com
  28. 28. digitalocean.com pull-based metrics solution written in go has alerting sol’n, alertmanager
  29. 29. digitalocean.com 85+ instances of prometheus used for datacenter-wide HV and service monitoring 1.8+ million samples/sec
  30. 30. digitalocean.com 1. easy to deploy 2. flexible and extensible 3. labelling 4. service discovery 5. strong query language
  31. 31. digitalocean.com 4 base metrics counters gauges histograms summaries
  32. 32. digitalocean.com error/request rate → counters + rate() utilization → gauges latency → histograms latency → summaries
  33. 33. digitalocean.com the plan: ✔ observability at DigitalOcean ✔ metrics to measure ✔ what is prometheus? ✔ build. instrument. test. iterate. examples
  34. 34. digitalocean.com build: design the service write it in go use internally shared libraries
  35. 35. digitalocean.com build: doge/dorpc - shared rpc library var DefaultInterceptors = []string{ StdLoggingInterceptor, StdMetricsInterceptor, StdTracingInterceptor} func NewServer(opt ...ServerOpt) (*Server, error) { opts := serverOpts{ name: "server", clientTLSAuth: tls.VerifyClientCertIfGiven, intercept: interceptor.NewPathInterceptor(interceptor.DefaultInterceptors...), keepAliveParams: DefaultServerKeepAlive, keepAliveEnforce: DefaultServerKeepAliveEnforcement, } … }
  36. 36. digitalocean.com instrument: send logs to centralized logging send spans to trace-collectors set up prometheus metrics
  37. 37. digitalocean.com metrics instrumentation: go-client func (s *server) initalizeMetrics() { s.metrics = metricsConfig{ attemptedConvergeChassis: s.metricsNode.Gauge("attempted_converge_chassis", "number of chassis converger attempting to converge"), } func (s *server) ConvergeAllChassis(...) { ... s.metrics.failedConvergeChassis(float64(len(failed))) ... }
  38. 38. digitalocean.com Quick Q & A: Collector Interface // A collector must be registered. prometheus.MustRegister(collector) type Collector interface { // Describe sends descriptors to channel. Describe(chan<- *Desc) // Collect is used by the prometheus registry on a scrape. // Metrics are sent to the provided channel. Collect(chan<- Metric) }
  39. 39. digitalocean.com metrics instrumentation: third-party exporters Built using the collector interface Sometimes we build our own Often we use others: github.com/prometheus/mysqld_exporter github.com/kbudde/rabbitmq_exporter github.com/prometheus/node_exporter github.com/digitalocean/openvswitch_exporter
  40. 40. digitalocean.com metrics instrumentation: in-service collectors type RateMapCollector struct { RequestRate *prometheus.Desc rm *RateMap buckets []float64 } var _ prometheus.Collector = &RateMapCollector{} func (r *RateMapCollector) Describe(ch chan<- *prometheus.Desc) { ds := []*prometheus.Desc{ r.RequestRate} for _, d := range ds { ch <- d } } func (r *RateMapCollector) Collect(ch chan<- prometheus.Metric) { ... ch <- prometheus.MustNewConstHistogram( r.RequestRate, count, sum, rateCount) }
  41. 41. digitalocean.com metrics instrumentation: dashboards #1 state metrics
  42. 42. digitalocean.com metrics instrumentation: dashboard #2 request rate request latency
  43. 43. digitalocean.com metrics instrumentation: dashboard #3 utilization metrics
  44. 44. digitalocean.com metrics instrumentation: dashboard #4 queries /second utilization
  45. 45. digitalocean.com metrics instrumentation: dashboard #6 metrics instrumentation: dashboard #5 saturation metric
  46. 46. digitalocean.com test: load testing: grpc-clients and goroutines chaos testing: take down a component of a system integration testing: how does this feature integrate with the cloud?
  47. 47. digitalocean.com testing: identify key issues how is our latency? is there a goroutine leak? does resource usage increase with traffic? is there a high error rate? how are our third-party services? use tracing to dig down use a worker pool use cpu and memory profiling check logs for type of error
  48. 48. digitalocean.com testing: tune metrics are these metrics actually useful? should we collect more data? do we need more labels for our metrics?
  49. 49. digitalocean.com testing: tune alerts state-based alerting: Is our service up or down? threshold alerting: When does our service fail?
  50. 50. digitalocean.com testing: documentation set up operational playbooks document recovery efforts
  51. 51. digitalocean.com iterate! (but really, let’s look at some examples…)
  52. 52. digitalocean.com the plan: ✔ observability at DigitalOcean ✔ metrics to measure ✔ what is prometheus? ✔ build. instrument. test. iterate. ✔ examples
  53. 53. digitalocean.com product #1: DHCP (hvaddrd)
  54. 54. digitalocean.com product #1: DHCP hvflowd hvaddrd OvS br0 RNS OpenFlow SetParameters addr0 bolt DHCPv4 NDP gRPC DHCPv6 tapX dropletX hvaddrd traffic AddFlows Hypervisor main
  55. 55. DHCP: load testing digitalocean.com
  56. 56. DHCP: load testing digitalocean.com
  57. 57. DHCP: custom conn collector digitalocean.com package dhcp4conn var _ prometheus.Collector = &collector{} type collector struct { ReadBytesTotal *prometheus.Desc ReadPacketsTotal *prometheus.Desc WriteBytesTotal *prometheus.Desc WritePacketsTotal *prometheus.Desc } Implements the net.conn interface and allows us to process ethernet frames for validation and other purposes.
  58. 58. DHCP: custom conn collector digitalocean.com
  59. 59. DHCP: goroutine worker pools digitalocean.com workC := make(chan request, Workers) for i := 0; i < Workers; i++ { go func() { defer workWG.Done() for r := range workC { s.serve(r.buf, r.from) } }() } Uses buffered channel to process requests, limiting goroutines and resource usage.
  60. 60. DHCP: rate limiter collector digitalocean.com type RateMap struct { mu sync.Mutex ... rateMap map[string]*rate } type RateMapCollector struct { RequestRate *prometheus.Desc rm *RateMap buckets []float64 } func (r *RateMapCollector) Collect(ch chan<- prometheus.Metric) { … ch <- prometheus.MustNewConstHistogram( r.RequestRate, count, sum, rateCount) } ratemap calculates the exponentially weighted moving average on a per-client basis and limits requests collector gives us a snapshot of rate distributions
  61. 61. DHCP: rate alerts digitalocean.com Rate Limiter Centralized LoggingCentralized LoggingCentralized LoggingCentralized Logging Elastalert emits log line
  62. 62. DHCP: the final result digitalocean.com
  63. 63. digitalocean.com DHCP: conclusion 1. Load tested our service. 2. Used metrics to identify resource spikes. 3. Added additional metrics, built a rate limiter, and set up alerts.
  64. 64. digitalocean.com product #2: VPC
  65. 65. digitalocean.com product #2: VPC
  66. 66. digitalocean.com VPC: load-testing load tester repeatedly makes some RPC calls
  67. 67. digitalocean.com VPC: latency issues as load testing continued, started to notice latency in different rpc calls
  68. 68. digitalocean.com VPC: latency issues use tracing to take a look at the /SyncInitialChassis call
  69. 69. digitalocean.com VPC: latency issues Note that spans for some traces were being dropped. Slowing down the load tester, however, eventually ameliorated that problem.
  70. 70. digitalocean.com VPC: latency issues “The fix was to be smarter and do the queries more efficiently. The repetitive loop of queries to rnsdb really stood out in the lightstep data.” - Bob Salmi
  71. 71. digitalocean.com VPC: remove component can queue be replaced with simple request-response system?
  72. 72. digitalocean.com VPC: chaos testing Induce south service failure and see how rabbit responds Drop primary and recovery from secondary Induce northd failure and ensure failover works
  73. 73. digitalocean.com VPC: add alerts state-based alert
  74. 74. digitalocean.com VPC: add alerts threshold alert
  75. 75. digitalocean.com VPC: conclusion 1. Load and chaos tested our service. 2. Used metrics and tracing to identify latency issues. 3. Configured useful alerts.
  76. 76. digitalocean.com conclusion
  77. 77. digitalocean.com 1. Measure the 4 golden signals and USE metrics. 2. Instrument services as early as possible. 3. Combine with profiling, logging, and tracing.
  78. 78. thanks! @snehainguva

×