Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData


Published on

In this session, Tim will cover principles, learnings, and practical advice from operating multiple cloud services at scale, including of course our InfluxDB Cloud service. What do we monitor, what do we alert on, and how did we architect it all? What are our underlying architectural and operational principles?

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData

  1. 1. Tim E. Hall @thallinflux VP, Products InfluxData Lessons Learned: Running InfluxDB Cloud at Scale
  2. 2. Discussion Topics Brief History of InfluxDB Cloud Gathering Metrics...and Logs Visualization, Monitoring, and Alerting Troubleshooting Scenarios What did we miss? So many things…
  3. 3. A Brief History of InfluxDB Cloud 1.0… April 2016 August 2017 May 2014 • Enterprise Edition DBaaS • Kapacitor Add-On • Hosted on AWS • Enterprise Edition DBaaS • Chronograf and limited Kapacitor included • Co-monitoring • Pay-as-you-go storage• Open Source DBaaS • Hosted on Digital Ocean
  4. 4. From development to production • Establish monitoring baselines • Ensure visibility into health of the system • Notifications for most common issues, before they become outages
  5. 5. From OSS to Enterprise InfluxDB OSS Meta 1 Meta 3Meta 2 Data Node 2 Data Node 1 InfluxDB Enterprise
  6. 6. InfluxDB Cloud 1: Deployment Diagram AWS Account (Separate Accounts for Development/Acceptance and Production) Monitoring Cluster Kubernetes cluster ssh Bastion Subscriptions (Single Tenant) Running procs: ssh Running procs: Docker ssh etcd Designates: Service Running procs: Docker ssh etcd Cluster Manager API Access :443 TLS Listeners Chronograf UI Access :443 TLS Listeners Cluster Manager Cluster Backup Servicessh Access :22 software image repository InfluxDB Enterprise Data Nodes InfluxDB Enterprise Meta Nodes Chronograf Kapacitor InfluxDB Enterprise Meta Nodes InfluxDB Enterprise Data Nodes Chronograf + Kapacitor Add-Ons: Kapacitor Grafana Papertrail (log archival)
  7. 7. Data Nodes InfluxDB Cloud 1: Deployment Diagram Meta Node Quorum Data Nodes Kapacitor Node (optional add-on) Kach Node Meta Nodes Papertrail (log archival) Running procs: Docker ssh etcd Running procs: Docker ssh etcd Running procs: Docker ssh etcd Designates: Docker Container Kapacitor (Chronograf access only) Automatron LogSpout SkyDNS Telegraf InfluxData Monitoring InfluxData Provisioning Chronograf Automatron LogSpout Telegraf SkyDNS Running procs: Docker ssh etcd Browser- based access CLI and/or Programmatic Access :8086 (Data Node) :9092 (Kapacitor Node) :443 TLS Listeners :8088 (Chronograf) :443 TLS Listeners InfluxEnterprise Meta InfluxEnterprise Data Automatron LogSpout Telegraf SkyDNS Kapacitor SkyDNS Automatron LogSpout Telegraf ALB (Shared across n clusters) Shared Security Group (Open ports between nodes) :3000 :4001 :7001 :8083, :8086, :8088, :8089, :8091 :9092 Other Port Access :46939 – Provisioning System :22 – open to bastion host only (for ssh)
  8. 8. Description of common processes and services within InfluxCloud Running processes – Each node has the following processes running • Docker -- container infrastructure within which ALL InfluxEnterprise components execute • ssh – secure shell to allow for secure, remote login • etcd – provides common rendezvous point for InfluxDB Enterprise components in the event of changes in the underlying infrastructure – Docker containers common across nodes • LogSpout gathers InfluxEnterprise related log outputs and delivers them to PaperTrail for storage, archival and search. • Telegraf gathers and metrics and events from the systems services and InfluxEnterprise components to facilitate remote monitoring • Automatron is a custom built provisioning infrastructure which allows for delivery of software updates to any of the containers deployed across the nodes. Papertrail (log archival) Automatron LogSpout Telegraf InfluxData Monitoring InfluxData Provisioning SkyDNS Running procs: Docker ssh etcd
  9. 9. Deploy Telegraf on all nodes (meta and data) By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can be measured and serve as a good starting point for monitoring. Minimum Recommendation: 1. CPU: collects standard CPU metrics 2. System: gathers general stats on system load 3. Processes: uptime, and number of users logged in 4. DiskIO: gathers metrics about disk traffic and timing 5. Disk: gathers metrics about disk usage 6. Mem: collects system memory metrics 7. NetStat: Network related metrics 8. http_response: Setup local ping 9. filestat: Files to gather stats about (meta node only) 10. InfluxDB: gather stats from the InfluxDB Instance. (data node only) Optional: 1. Logs: requires syslog 2. Swap: collects system swap metrics 3. Internal: gather Telegraf related stats 4. Docker: if deployed in containers
  10. 10. Telegraf Configuration: Global [global_tags] cluster_id = $CLUSTER_ID environment = $ENVIRONMENT [agent] interval = "10s" round_interval = true metric_buffer_limit = 10000 metric_batch_size = 1000 collection_jitter = "0s" flush_interval = "30s" flush_jitter = "30s" debug = false hostname = "" All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by activating them. Global tags can be specified in the [global_tags] section of the config file in key="value" format. Use a GUID which uniquely identifies each “cluster” and ensure that environment variable exists consistently on all hosts (meta and data). Optionally, add other tags if desired. Example: dev, prod for environment. Agent Configuration recommended config settings for InfluxDB data collection. Adjust the interval and flush_interval based on: ● desire around “speed of observability” ● retention policy for the data
  11. 11. Telegraf Configuration: Inputs (common) # INPUTS [[inputs.cpu]] percpu = false totalcpu = true fieldpass = ["usage_idle", "usage_user", "usage_system", "usage_steal"] [[inputs.mem]] [[inputs.netstat]] [[inputs.system]] [[inputs.diskio]] Input Configuration items include grabbing metrics from the various infrastructure, database, and system components in play. For the other plug-ins, default config is sufficient.
  12. 12. Telegraf Configuration: Inputs Data Nodes # INPUTS [[inputs.influxdb]] interval = "15s" urls = ["http://<localhost>:8086/debug/vars"] timeout = "15s” [[inputs.http_response]] #DATA address = "http://<localhost>:8086/ping” [[inputs.disk]] mount_points = ["/var/lib/influxdb/data","/var/lib/influxdb/wal", "/var/lib/influxdb/hh”,"/"] InfluxDB grabs all metrics from the exposed endpoint. http_response allows you to ping individual data nodes and track response output. You can also setup a separate Telegraf agent elsewhere within your infrastructure to ping the available cluster(s) through the load balancer. disk allows you to configure the various volumes/mount points on disk -- locations of data, wal, hinted handoff -- and root. (default config options shown)
  13. 13. Telegraf Configuration: Inputs Meta Nodes # INPUTS [[inputs.http_response]] #META address = "http://<localhost>:8091/ping" [[inputs.filestat]] files = ["/ivar/lib/influxdb/meta/snapshots/*/state.bin"] md5 = false [[inputs.disk]] mount_points = ["/var/lib/influxdb/meta", "/"] http_response allows you to ping individual meta nodes and track response output. filestat allows you to monitor metadata snapshots. disk allows you to configure the various volumes/mount points on disk -- locations of meta store -- and root. (default config options shown)
  14. 14. Telegraf Configuration: Outputs # OUTPUTS [[outputs.influxdb]] urls = [ "<target URL of DB>" ] database = "telegraf" retention_policy = "autogen" timeout = "10s" username = <uname> password = <pword> content_encoding = "gzip" Output Configuration tells telegraf which output sink to send the data. Multiple output sinks can be specified in the configuration file. ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster.
  15. 15. Telegraf Configuration: Gathering Logs # INPUT [[inputs.syslog]] # OUTPUTS [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" # Drop all measurements that start with "syslog" namedrop = [ "syslog*" ] [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" retention_policy = "14days" # Only accept syslog data: namepass = [ "syslog*" ] Output Configuration use namepass/namedrop to direct metrics/logs to different db.rp targets ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster. Input Configuration add the syslog input plug-in. Review the settings for your environment. InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
  16. 16. Visualization, Monitoring, Alerting
  17. 17. We’ve gathered a wide variety of now what? Dashboards!
  18. 18. Alerting: Common Metrics to Watch Disk Usage Hinted Handoff Queue No metrics…. aka Deadman
  19. 19. Disk Usage Batch Task: TICKscript // Monitor disk usage for all hosts var data = batch |query(''' SELECT last(used_percent) FROM "telegraf"."autogen"."disk" WHERE ("host" =~ /prod-.*/) AND ("path" = '/var/lib/influxdb/data' OR "path" = '/var/lib/influxdb/wal' OR "path" = '/var/lib/influxdb/hh' OR "path" = '/') ''') .period(5m) .every(10m) .groupBy('host', 'role', 'environment', 'device')
  20. 20. Disk Usage Alert: TICKscript var warn_threshold = 85 var critical_threshold = 95 data |alert() .id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags "environment" }}') .message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index .Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}') .warn(lambda: "used_percent" > warn_threshold) .crit(lambda: "used_percent" > critical_threshold) .slack() .channel('#monitoring')
  21. 21. Hinted Handoff Queue Batch Task: TICKscript // This generates alerts for high hinted-handoff queues for InfluxEnterprise var queue_size = batch |query(''' SELECT max(queueBytes) as "max" FROM "telegraf"."autogen"."influxdb_hh_processor" WHERE ("host" =~ /prod-.*/) ''') .groupBy('host', 'cluster_id') .period(5m) .every(10m) |eval(lambda: "max" / 1048576.0) .as('queue_size_mb')
  22. 22. Hinted Handoff Queue Alert: TICKscript var warn_threshold = 3500 var crit_threshold = 5000 queue_size |alert() .id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id" }}/{{ index .Tags "host" }}') .message('Host {{ index .Tags "host" }} (cluster {{ index .Tags "cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields "queue_size_mb" }}MB') .details('') .warn(lambda: "queue_size_mb" > warn_threshold) .crit(lambda: "queue_size_mb" > crit_threshold) .stateChangesOnly() .slack() .pagerDuty()
  23. 23.
  24. 24. Troubleshooting
  25. 25. Common Troubleshooting Scenarios • OOM Loop • Runaway Series Cardinality
  26. 26. Common Troubleshooting Scenarios Workload Type • Which type are we looking at? – Read heavy – Write heavy – Mixed? – Establish baselines and understand “normal” using metrics and visualization – Baselines allow us to understand change over time and help determine when is time to scale up Log Analysis • Metrics First! – Highlights where you should look within the log files • Logs allow for pin pointing root-cause of issue observed by metrics – Cache max memory size – Hinted Handoff Queue “Blocked” IOPS & Disk Throughput • Understand the capabilities the hardware by plan size – Develop and review sizing guidelines – Understand max read and write limits based on machine class and drive types – these can change as you scale!
  27. 27. What did we miss? So many things… Head for the balcony! – Shift from instance-based dashboards to “fleet management” What’s the experience of the “customer”? – Real user monitoring from the front-door – Integration with subscription management system SSL Cert expiration E-commerce system monitoring – Health and availability of supporting components
  28. 28. Recap Gather Metrics...and Logs (for context) Visualize, Monitor, and Alert… tune based on your environment Iterate and address “new” scenarios to eliminate alert fatigue
  29. 29.
  30. 30. Thank You