Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Observability and Product Release

1,634 views

Published on

Leveraging Prometheus to Build, Test, and Release New Products at DigitalOcean.

Published in: Internet
  • Be the first to comment

Observability and Product Release

  1. 1. @snehainguva
  2. 2. observability and product release: leveraging prometheus to build and test new products digitalocean.com
  3. 3. digitalocean.com about me software engineer @DigitalOcean currently network services <3 cats
  4. 4. digitalocean.com some stats
  5. 5. digitalocean.com 90M+ timeseries 85 instances of prometheus 1.7M+ samples/sec
  6. 6. digitalocean.com the history
  7. 7. digitalocean.com ye’ olden days use nagios + various plugins to monitor use collectd + statsd + graphite openTSDB
  8. 8. digitalocean.com lovely prometheus white-box monitoring multi-dimensional data model fantastic querying language
  9. 9. digitalocean.com glorious kubernetes easily deploy and update services scalability combine with prometheus + alertmanager
  10. 10. digitalocean.com sneha joins networking set up monitoring for VPC working on DHCP how can we use prometheus even before release?
  11. 11. digitalocean.com the plan: ✔ observability DigitalOcean build --- instrument --- test --- iterate examples
  12. 12. digitalocean.com metrics: time-series of sampled data tracing: propagating metadata through different requests, threads, and processes logging: record of discrete events over time
  13. 13. digitalocean.com metrics: what do we measure?
  14. 14. digitalocean.com four golden signals
  15. 15. digitalocean.com latency: time to service a request traffic: requests/second error: error rate of requests saturation: fullness of a service
  16. 16. digitalocean.com Utilization Saturation Error rate
  17. 17. digitalocean.com “USE metrics often allow users to solve 80% of server issues with 5% of the effort.”
  18. 18. digitalocean.com the plan: ✔ observability DigitalOcean ✔ build --- instrument --- test --- iterate examples
  19. 19. digitalocean.com build: design the service write it in go use internally shared libraries
  20. 20. digitalocean.com build: doge/dorpc - shared rpc library var DefaultInterceptors = []string{ StdLoggingInterceptor, StdMetricsInterceptor, StdTracingInterceptor} func NewServer(opt ...ServerOpt) (*Server, error) { opts := serverOpts{ name: "server", clientTLSAuth: tls.VerifyClientCertIfGiven, intercept: interceptor.NewPathInterceptor(interceptor.DefaultInterceptors...), keepAliveParams: DefaultServerKeepAlive, keepAliveEnforce: DefaultServerKeepAliveEnforcement, } … }
  21. 21. digitalocean.com instrument: send logs to centralized logging send spans to trace-collectors set up prometheus metrics
  22. 22. digitalocean.com metrics instrumentation: go-client func (s *server) initalizeMetrics() { s.metrics = metricsConfig{ attemptedConvergeChassis: s.metricsNode.Gauge("attempted_converge_chassis", "number of chassis converger attempting to converge"), failedConvergeChassis: s.metricsNode.Gauge("failed_converge_chassis", "number of chassis that failed to converge"), } } func (s *server) ConvergeAllChassis(...) { ... s.metrics.attemptedConvergeChassis(float64(len(attempted))) s.metrics.failedConvergeChassis(float64(len(failed))) ... }
  23. 23. digitalocean.com Quick Q & A: Collector Interface // A collector must be registered. prometheus.MustRegister(collector) type Collector interface { // Describe sends descriptors to channel. Describe(chan<- *Desc) // Collect is used by the prometheus registry on a scrape. // Metrics are sent to the provided channel. Collect(chan<- Metric) }
  24. 24. digitalocean.com metrics instrumentation: third-party exporters Built using the collector interface Sometimes we build our own Often we use others: github.com/prometheus/mysqld_exporter github.com/kbudde/rabbitmq_exporter github.com/prometheus/node_exporter github.com/digitalocean/openvswitch_exporter
  25. 25. digitalocean.com metrics instrumentation: in-service collectors type RateMap struct { mu sync.Mutex ... rateMap map[string]*rate } var _ prometheus.Collector = &RateMapCollector{} func (r *RateMapCollector) Describe(ch chan<- *prometheus.Desc) { ds := []*prometheus.Desc{ r.RequestRate} for _, d := range ds { ch <- d } } func (r *RateMapCollector) Collect(ch chan<- prometheus.Metric) { ... ch <- prometheus.MustNewConstHistogram( r.RequestRate, count, sum, rateCount) }
  26. 26. digitalocean.com metrics instrumentation: dashboards #1 state metrics
  27. 27. digitalocean.com metrics instrumentation: dashboard #2 request rate request latency
  28. 28. digitalocean.com metrics instrumentation: dashboard #3 utilization metrics
  29. 29. digitalocean.com metrics instrumentation: dashboard #4 queries/second utilization
  30. 30. digitalocean.com metrics instrumentation: dashboard #6 metrics instrumentation: dashboard #5 saturation metric
  31. 31. digitalocean.com test: load testing: grpc-clients and goroutines chaos testing: take down a component of a system integration testing: how does this feature integrate with the cloud?
  32. 32. digitalocean.com testing: identify key issues how is our latency? is there a goroutine leak? does resource usage increase with traffic? is there a high error rate? how are our third-party services? use tracing to dig down use cpu and memory profiling use a worker pool check logs for types of error
  33. 33. digitalocean.com testing: tune metrics + alerts do we need more labels for our metrics? should we collect more data? State-based alerting: Is our service up or down? Threshold alerting: When does our service fail?
  34. 34. digitalocean.com testing: documentation set up operational playbooks document recovery efforts
  35. 35. digitalocean.com iterate! (but really, let’s look at some examples…)
  36. 36. digitalocean.com the plan: ✔ observability DigitalOcean ✔ build --- instrument --- test --- iterate ✔ examples
  37. 37. digitalocean.com product #1: DHCP (hvaddrd)
  38. 38. digitalocean.com product #1: DHCP hvflowd hvaddrd OvS br0 RNS OpenFlow SetParameters addr0 bolt DHCPv4 NDP gRPC DHCPv6 tapX dropletX hvaddrd traffic AddFlows Hypervisor main
  39. 39. DHCP: load testing digitalocean.com
  40. 40. DHCP: load testing (2) digitalocean.com
  41. 41. DHCP: custom conn collector digitalocean.com package dhcp4conn var _ prometheus.Collector = &collector{} // A collector gathers connection metrics. type collector struct { ReadBytesTotal *prometheus.Desc ReadPacketsTotal *prometheus.Desc WriteBytesTotal *prometheus.Desc WritePacketsTotal *prometheus.Desc } Implements the net.conn interface and allows us to process ethernet frames for validation and other purposes.
  42. 42. DHCP: custom conn collector digitalocean.com
  43. 43. DHCP: goroutine worker pools digitalocean.com workC := make(chan request, Workers) for i := 0; i < Workers; i++ { go func() { defer workWG.Done() for r := range workC { s.serve(r.buf, r.from) } }() } Uses buffered channel to process requests, limiting goroutines and resource usage.
  44. 44. DHCP: rate limiter collector digitalocean.com type RateMap struct { mu sync.Mutex ... rateMap map[string]*rate } type RateMapCollector struct { RequestRate *prometheus.Desc rm *RateMap buckets []float64 } func (r *RateMapCollector) Collect(ch chan<- prometheus.Metric) { … ch <- prometheus.MustNewConstHistogram( r.RequestRate, count, sum, rateCount) } ratemap calculates the exponentially weighted moving average on a per-client basis and limits requests collector gives us a snapshot of rate distributions
  45. 45. DHCP: rate alerts digitalocean.com Rate Limiter Centralized LoggingCentralized LoggingCentralized LoggingCentralized Logging Elastalert emits log line
  46. 46. DHCP: the final result digitalocean.com
  47. 47. digitalocean.com product #2: VPC
  48. 48. digitalocean.com product #2: VPC
  49. 49. digitalocean.com VPC: load-testing load tester repeatedly makes some RPC calls
  50. 50. digitalocean.com VPC: latency issues (1) as load testing continued, started to notice latency in different rpc calls
  51. 51. digitalocean.com VPC: latency issues (2) use tracing to take a look at the /SyncInitialChassis call
  52. 52. digitalocean.com VPC: latency issues (3) Note that spans for some traces were being dropped. Slowing down the load tester, however, eventually ameliorated that problem.
  53. 53. digitalocean.com VPC: latency issues (4) “The fix was to be smarter and do the queries more efficiently. The repetitive loop of queries to rnsdb really stood out in the lightstep data.” - Bob Salmi
  54. 54. digitalocean.com VPC: remove component can queue be replaced with simple request-response system? source: https://programmingisterrible.com/post/162346490883/how-do-you-cut-a-monolith-in-half
  55. 55. digitalocean.com VPC: chaos testing Induce south service failure and see how rabbit responds Drop primary and recovery from secondary Induce northd failure and ensure failover works
  56. 56. digitalocean.com VPC: add alerts (1) state-based alerts
  57. 57. digitalocean.com VPC: add alerts (2) threshold alert
  58. 58. digitalocean.com conclusion
  59. 59. digitalocean.com what? four golden signals, USE metrics when? as early as possible how? combine with profiling, logging, tracing
  60. 60. thanks! @snehainguva

×