SlideShare a Scribd company logo
Tim E. Hall @thallinflux
VP, Products InfluxData
Lessons Learned: Running
InfluxDB Cloud at Scale
Discussion Topics
Brief History of InfluxDB Cloud
Gathering Metrics...and Logs
Visualization, Monitoring, and Alerting
Troubleshooting Scenarios
What did we miss? So many things…
A Brief History of InfluxDB Cloud 1.0…
April
2016
August
2017
May
2014
• Enterprise Edition DBaaS
• Kapacitor Add-On
• Hosted on AWS
• Enterprise Edition DBaaS
• Chronograf and limited
Kapacitor included
• Co-monitoring
• Pay-as-you-go storage• Open Source DBaaS
• Hosted on Digital Ocean
From development
to production
• Establish monitoring baselines
• Ensure visibility into health of the system
• Notifications for most common issues, before
they become outages
From OSS to Enterprise
InfluxDB
OSS
Meta 1 Meta 3Meta 2
Data Node
2
Data Node
1
InfluxDB Enterprise
InfluxDB Cloud 1: Deployment Diagram
AWS Account (Separate Accounts for Development/Acceptance and Production)
Monitoring Cluster
Kubernetes cluster
ssh
Bastion
Subscriptions (Single Tenant)
Running	procs:	
ssh
Running	procs:	
Docker
ssh
etcd
Designates:	
Service
Running	procs:	
Docker
ssh
etcd
Cluster	Manager	API	
Access
:443	
TLS	Listeners
Chronograf UI	
Access
:443	
TLS	Listeners
Cluster	
Manager
Cluster	
Backup	
Servicessh Access
:22
Quay.io
software image
repository
InfluxDB
Enterprise
Data Nodes
InfluxDB
Enterprise
Meta Nodes
Chronograf
Kapacitor
InfluxDB
Enterprise
Meta Nodes
InfluxDB
Enterprise
Data Nodes
Chronograf
+ Kapacitor
Add-Ons:
Kapacitor
Grafana
Papertrail
(log archival)
Data Nodes
InfluxDB Cloud 1: Deployment Diagram
Meta Node Quorum
Data Nodes
Kapacitor Node (optional add-on)
Kach Node
Meta Nodes
Papertrail
(log archival)
Running	procs:	
Docker
ssh
etcd
Running	procs:	
Docker
ssh
etcd
Running	procs:	
Docker
ssh
etcd
Designates:	
Docker	
Container
Kapacitor
(Chronograf
access	only)
Automatron
LogSpout
SkyDNS
Telegraf
InfluxData
Monitoring
InfluxData
Provisioning
Chronograf
Automatron
LogSpout
Telegraf
SkyDNS
Running	procs:	
Docker
ssh
etcd
Browser-
based	
access
CLI	and/or	
Programmatic	
Access
:8086	(Data	Node)
:9092		(Kapacitor
Node)
:443	
TLS	Listeners
:8088	(Chronograf)
:443	
TLS	Listeners
InfluxEnterprise
Meta	
InfluxEnterprise
Data
Automatron
LogSpout
Telegraf
SkyDNS
Kapacitor
SkyDNS
Automatron
LogSpout
Telegraf
ALB
(Shared
across n
clusters)
Shared	Security	Group	
(Open	ports	between	nodes)
:3000
:4001
:7001
:8083,	:8086,	:8088,	:8089,	:8091
:9092
Other	Port	Access
:46939	– Provisioning	System
:22	– open	to	bastion	host	only	
(for	ssh)
Description of common processes and services
within InfluxCloud
Running processes
– Each node has the following processes running
• Docker -- container infrastructure within which ALL InfluxEnterprise components execute
• ssh – secure shell to allow for secure, remote login
• etcd – provides common rendezvous point for InfluxDB Enterprise components in the event of
changes in the underlying infrastructure
– Docker containers common across nodes
• LogSpout gathers InfluxEnterprise related log outputs and delivers them to PaperTrail for storage,
archival and search.
• Telegraf gathers and metrics and events from the systems services and InfluxEnterprise
components to facilitate remote monitoring
• Automatron is a custom built provisioning infrastructure which allows for delivery of software
updates to any of the containers deployed across the nodes.
Papertrail
(log archival)
Automatron
LogSpout
Telegraf
InfluxData Monitoring
InfluxData
Provisioning
SkyDNS
Running	procs:	
Docker
ssh
etcd
Deploy Telegraf on all nodes (meta and data)
By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can
be measured and serve as a good starting point for monitoring.
Minimum Recommendation:
1. CPU: collects standard CPU metrics
2. System: gathers general stats on system load
3. Processes: uptime, and number of users logged in
4. DiskIO: gathers metrics about disk traffic and timing
5. Disk: gathers metrics about disk usage
6. Mem: collects system memory metrics
7. NetStat: Network related metrics
8. http_response: Setup local ping
9. filestat: Files to gather stats about (meta node only)
10. InfluxDB: gather stats from the InfluxDB Instance. (data node only)
Optional:
1. Logs: requires syslog
2. Swap: collects system swap metrics
3. Internal: gather Telegraf related stats
4. Docker: if deployed in containers
Telegraf Configuration: Global
[global_tags]
cluster_id = $CLUSTER_ID
environment = $ENVIRONMENT
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 10000
metric_batch_size = 1000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "30s"
debug = false
hostname = ""
All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by
activating them.
Global tags can be specified in the [global_tags]
section of the config file in key="value" format. Use
a GUID which uniquely identifies each “cluster” and
ensure that environment variable exists consistently
on all hosts (meta and data). Optionally, add other
tags if desired. Example: dev, prod for environment.
Agent Configuration recommended config settings
for InfluxDB data collection. Adjust the interval and
flush_interval based on:
● desire around “speed of observability”
● retention policy for the data
Telegraf Configuration: Inputs (common)
# INPUTS
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = ["usage_idle",
"usage_user", "usage_system",
"usage_steal"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.system]]
[[inputs.diskio]]
Input Configuration items include grabbing metrics
from the various infrastructure, database, and
system components in play.
For the other plug-ins, default config is sufficient.
Telegraf Configuration: Inputs Data Nodes
# INPUTS
[[inputs.influxdb]]
interval = "15s"
urls = ["http://<localhost>:8086/debug/vars"]
timeout = "15s”
[[inputs.http_response]] #DATA
address = "http://<localhost>:8086/ping”
[[inputs.disk]]
mount_points =
["/var/lib/influxdb/data","/var/lib/influxdb/wal",
"/var/lib/influxdb/hh”,"/"]
InfluxDB grabs all metrics from the
exposed endpoint.
http_response allows you to ping
individual data nodes and track
response output.
You can also setup a separate Telegraf
agent elsewhere within your
infrastructure to ping the available
cluster(s) through the load balancer.
disk allows you to configure the
various volumes/mount points on
disk -- locations of data, wal, hinted
handoff -- and root. (default config
options shown)
Telegraf Configuration: Inputs Meta Nodes
# INPUTS
[[inputs.http_response]] #META
address = "http://<localhost>:8091/ping"
[[inputs.filestat]]
files =
["/ivar/lib/influxdb/meta/snapshots/*/state.bin"]
md5 = false
[[inputs.disk]]
mount_points = ["/var/lib/influxdb/meta", "/"]
http_response allows you to ping
individual meta nodes and track response
output.
filestat allows you to monitor metadata
snapshots.
disk allows you to configure the
various volumes/mount points on
disk -- locations of meta store -- and
root. (default config options shown)
Telegraf Configuration: Outputs
# OUTPUTS
[[outputs.influxdb]]
urls = [ "<target URL of DB>" ]
database = "telegraf"
retention_policy = "autogen"
timeout = "10s"
username = <uname>
password = <pword>
content_encoding = "gzip"
Output Configuration tells telegraf which
output sink to send the data. Multiple
output sinks can be specified in the
configuration file.
** NOTE: This should point to the load
balancer, if you are storing the metrics into
a cluster.
Telegraf Configuration: Gathering Logs
# INPUT
[[inputs.syslog]]
# OUTPUTS
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
# Drop all measurements that start
with "syslog"
namedrop = [ "syslog*" ]
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
retention_policy = "14days"
# Only accept syslog data:
namepass = [ "syslog*" ]
Output Configuration use
namepass/namedrop to
direct metrics/logs to
different db.rp targets
** NOTE: This should point
to the load balancer, if you
are storing the metrics into
a cluster.
Input Configuration add
the syslog input plug-in.
Review the settings for
your environment.
InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
Visualization, Monitoring,
Alerting
We’ve gathered a wide variety of metrics...so now
what?
Dashboards!
Alerting: Common Metrics to Watch
Disk Usage
Hinted Handoff Queue
No metrics…. aka Deadman
Disk Usage Batch Task: TICKscript
// Monitor disk usage for all hosts
var data = batch
|query('''
SELECT last(used_percent)
FROM "telegraf"."autogen"."disk"
WHERE ("host" =~ /prod-.*/)
AND ("path" = '/var/lib/influxdb/data'
OR "path" = '/var/lib/influxdb/wal'
OR "path" = '/var/lib/influxdb/hh'
OR "path" = '/')
''')
.period(5m)
.every(10m)
.groupBy('host', 'role', 'environment', 'device')
Disk Usage Alert: TICKscript
var warn_threshold = 85
var critical_threshold = 95
data
|alert()
.id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags
"environment" }}')
.message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index
.Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}')
.warn(lambda: "used_percent" > warn_threshold)
.crit(lambda: "used_percent" > critical_threshold)
.slack()
.channel('#monitoring')
Hinted Handoff Queue Batch Task: TICKscript
// This generates alerts for high hinted-handoff queues for InfluxEnterprise
var queue_size = batch
|query('''
SELECT max(queueBytes) as "max"
FROM "telegraf"."autogen"."influxdb_hh_processor"
WHERE ("host" =~ /prod-.*/)
''')
.groupBy('host', 'cluster_id')
.period(5m)
.every(10m)
|eval(lambda: "max" / 1048576.0)
.as('queue_size_mb')
Hinted Handoff Queue Alert: TICKscript
var warn_threshold = 3500
var crit_threshold = 5000
queue_size
|alert()
.id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id"
}}/{{ index .Tags "host" }}')
.message('Host {{ index .Tags "host" }} (cluster {{ index .Tags
"cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields
"queue_size_mb" }}MB')
.details('')
.warn(lambda: "queue_size_mb" > warn_threshold)
.crit(lambda: "queue_size_mb" > crit_threshold)
.stateChangesOnly()
.slack()
.pagerDuty()
https://docs.influxdata.com
Troubleshooting
Common Troubleshooting Scenarios
• OOM Loop
• Runaway Series Cardinality
Common Troubleshooting Scenarios
Workload Type
• Which type are we
looking at?
– Read heavy
– Write heavy
– Mixed?
– Establish baselines and
understand “normal”
using metrics and
visualization
– Baselines allow us to
understand change over
time and help determine
when is time to scale up
Log Analysis
• Metrics First!
– Highlights where you
should look within the
log files
• Logs allow for pin
pointing root-cause of
issue observed by
metrics
– Cache max memory size
– Hinted Handoff Queue
“Blocked”
IOPS & Disk Throughput
• Understand the
capabilities the
hardware by plan size
– Develop and review
sizing guidelines
– Understand max read
and write limits based
on machine class and
drive types – these can
change as you scale!
What did we miss? So many things…
Head for the balcony!
– Shift from instance-based dashboards to “fleet management”
What’s the experience of the “customer”?
– Real user monitoring from the front-door
– Integration with subscription management system
SSL Cert expiration
E-commerce system monitoring
– Health and availability of supporting components
Recap
Gather Metrics...and Logs (for context)
Visualize, Monitor, and Alert… tune based on your environment
Iterate and address “new” scenarios to eliminate alert fatigue
https://community.influxdata.com https://docs.influxdata.com
https://www.influxdata.com/products/influxdb-cloud/
Thank You

More Related Content

What's hot

Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
InfluxData
 
Virtual training Intro to the Tick stack and InfluxEnterprise
Virtual training  Intro to the Tick stack and InfluxEnterpriseVirtual training  Intro to the Tick stack and InfluxEnterprise
Virtual training Intro to the Tick stack and InfluxEnterprise
InfluxData
 
Virtual training Intro to Kapacitor
Virtual training  Intro to Kapacitor Virtual training  Intro to Kapacitor
Virtual training Intro to Kapacitor
InfluxData
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
InfluxData
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxData
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
InfluxData
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
Prashant Vats
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
InfluxData
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
InfluxData
 
WRITING QUERIES (INFLUXQL AND TICK)
WRITING QUERIES (INFLUXQL AND TICK)WRITING QUERIES (INFLUXQL AND TICK)
WRITING QUERIES (INFLUXQL AND TICK)
InfluxData
 
A TRUE STORY ABOUT DATABASE ORCHESTRATION
A TRUE STORY ABOUT DATABASE ORCHESTRATIONA TRUE STORY ABOUT DATABASE ORCHESTRATION
A TRUE STORY ABOUT DATABASE ORCHESTRATION
InfluxData
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
InfluxData
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
InfluxData
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
InfluxData
 
Kapacitor Manager
Kapacitor ManagerKapacitor Manager
Kapacitor Manager
InfluxData
 
A True Story About Database Orchestration
A True Story About Database OrchestrationA True Story About Database Orchestration
A True Story About Database Orchestration
InfluxData
 
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
InfluxData
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDB
InfluxData
 
Kapacitor Stream Processing
Kapacitor Stream ProcessingKapacitor Stream Processing
Kapacitor Stream Processing
InfluxData
 

What's hot (20)

Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
 
Virtual training Intro to the Tick stack and InfluxEnterprise
Virtual training  Intro to the Tick stack and InfluxEnterpriseVirtual training  Intro to the Tick stack and InfluxEnterprise
Virtual training Intro to the Tick stack and InfluxEnterprise
 
Virtual training Intro to Kapacitor
Virtual training  Intro to Kapacitor Virtual training  Intro to Kapacitor
Virtual training Intro to Kapacitor
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
WRITING QUERIES (INFLUXQL AND TICK)
WRITING QUERIES (INFLUXQL AND TICK)WRITING QUERIES (INFLUXQL AND TICK)
WRITING QUERIES (INFLUXQL AND TICK)
 
A TRUE STORY ABOUT DATABASE ORCHESTRATION
A TRUE STORY ABOUT DATABASE ORCHESTRATIONA TRUE STORY ABOUT DATABASE ORCHESTRATION
A TRUE STORY ABOUT DATABASE ORCHESTRATION
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Kapacitor Manager
Kapacitor ManagerKapacitor Manager
Kapacitor Manager
 
A True Story About Database Orchestration
A True Story About Database OrchestrationA True Story About Database Orchestration
A True Story About Database Orchestration
 
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDB
 
Kapacitor Stream Processing
Kapacitor Stream ProcessingKapacitor Stream Processing
Kapacitor Stream Processing
 

Similar to Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData

Monitoring InfluxEnterprise
Monitoring InfluxEnterpriseMonitoring InfluxEnterprise
Monitoring InfluxEnterprise
InfluxData
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
InfluxData
 
Influx data basic
Influx data basicInflux data basic
Influx data basic
Сергій Саварин
 
Informix Data Streaming Overview
Informix Data Streaming OverviewInformix Data Streaming Overview
Informix Data Streaming Overview
Brian Hughes
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
InfluxData
 
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Principled Technologies
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
Padraig O'Sullivan
 
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Gareth Chapman
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
Nagios
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
Vic Hargrave
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
Amazon Web Services
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
Paul Czarkowski
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Eran Duchan
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
leesjensen
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Codemotion
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
Piyush Kumar
 
Performance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDBPerformance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDB
Severalnines
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
Antoine Leroyer
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
SamHoney6
 

Similar to Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData (20)

Monitoring InfluxEnterprise
Monitoring InfluxEnterpriseMonitoring InfluxEnterprise
Monitoring InfluxEnterprise
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
 
Influx data basic
Influx data basicInflux data basic
Influx data basic
 
Informix Data Streaming Overview
Informix Data Streaming OverviewInformix Data Streaming Overview
Informix Data Streaming Overview
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Performance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDBPerformance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDB
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 

More from InfluxData

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
InfluxData
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
InfluxData
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
InfluxData
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
InfluxData
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
InfluxData
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 

More from InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
 

Recently uploaded

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 

Recently uploaded (20)

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 

Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData

  • 1. Tim E. Hall @thallinflux VP, Products InfluxData Lessons Learned: Running InfluxDB Cloud at Scale
  • 2. Discussion Topics Brief History of InfluxDB Cloud Gathering Metrics...and Logs Visualization, Monitoring, and Alerting Troubleshooting Scenarios What did we miss? So many things…
  • 3. A Brief History of InfluxDB Cloud 1.0… April 2016 August 2017 May 2014 • Enterprise Edition DBaaS • Kapacitor Add-On • Hosted on AWS • Enterprise Edition DBaaS • Chronograf and limited Kapacitor included • Co-monitoring • Pay-as-you-go storage• Open Source DBaaS • Hosted on Digital Ocean
  • 4. From development to production • Establish monitoring baselines • Ensure visibility into health of the system • Notifications for most common issues, before they become outages
  • 5. From OSS to Enterprise InfluxDB OSS Meta 1 Meta 3Meta 2 Data Node 2 Data Node 1 InfluxDB Enterprise
  • 6. InfluxDB Cloud 1: Deployment Diagram AWS Account (Separate Accounts for Development/Acceptance and Production) Monitoring Cluster Kubernetes cluster ssh Bastion Subscriptions (Single Tenant) Running procs: ssh Running procs: Docker ssh etcd Designates: Service Running procs: Docker ssh etcd Cluster Manager API Access :443 TLS Listeners Chronograf UI Access :443 TLS Listeners Cluster Manager Cluster Backup Servicessh Access :22 Quay.io software image repository InfluxDB Enterprise Data Nodes InfluxDB Enterprise Meta Nodes Chronograf Kapacitor InfluxDB Enterprise Meta Nodes InfluxDB Enterprise Data Nodes Chronograf + Kapacitor Add-Ons: Kapacitor Grafana Papertrail (log archival)
  • 7. Data Nodes InfluxDB Cloud 1: Deployment Diagram Meta Node Quorum Data Nodes Kapacitor Node (optional add-on) Kach Node Meta Nodes Papertrail (log archival) Running procs: Docker ssh etcd Running procs: Docker ssh etcd Running procs: Docker ssh etcd Designates: Docker Container Kapacitor (Chronograf access only) Automatron LogSpout SkyDNS Telegraf InfluxData Monitoring InfluxData Provisioning Chronograf Automatron LogSpout Telegraf SkyDNS Running procs: Docker ssh etcd Browser- based access CLI and/or Programmatic Access :8086 (Data Node) :9092 (Kapacitor Node) :443 TLS Listeners :8088 (Chronograf) :443 TLS Listeners InfluxEnterprise Meta InfluxEnterprise Data Automatron LogSpout Telegraf SkyDNS Kapacitor SkyDNS Automatron LogSpout Telegraf ALB (Shared across n clusters) Shared Security Group (Open ports between nodes) :3000 :4001 :7001 :8083, :8086, :8088, :8089, :8091 :9092 Other Port Access :46939 – Provisioning System :22 – open to bastion host only (for ssh)
  • 8. Description of common processes and services within InfluxCloud Running processes – Each node has the following processes running • Docker -- container infrastructure within which ALL InfluxEnterprise components execute • ssh – secure shell to allow for secure, remote login • etcd – provides common rendezvous point for InfluxDB Enterprise components in the event of changes in the underlying infrastructure – Docker containers common across nodes • LogSpout gathers InfluxEnterprise related log outputs and delivers them to PaperTrail for storage, archival and search. • Telegraf gathers and metrics and events from the systems services and InfluxEnterprise components to facilitate remote monitoring • Automatron is a custom built provisioning infrastructure which allows for delivery of software updates to any of the containers deployed across the nodes. Papertrail (log archival) Automatron LogSpout Telegraf InfluxData Monitoring InfluxData Provisioning SkyDNS Running procs: Docker ssh etcd
  • 9. Deploy Telegraf on all nodes (meta and data) By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can be measured and serve as a good starting point for monitoring. Minimum Recommendation: 1. CPU: collects standard CPU metrics 2. System: gathers general stats on system load 3. Processes: uptime, and number of users logged in 4. DiskIO: gathers metrics about disk traffic and timing 5. Disk: gathers metrics about disk usage 6. Mem: collects system memory metrics 7. NetStat: Network related metrics 8. http_response: Setup local ping 9. filestat: Files to gather stats about (meta node only) 10. InfluxDB: gather stats from the InfluxDB Instance. (data node only) Optional: 1. Logs: requires syslog 2. Swap: collects system swap metrics 3. Internal: gather Telegraf related stats 4. Docker: if deployed in containers
  • 10. Telegraf Configuration: Global [global_tags] cluster_id = $CLUSTER_ID environment = $ENVIRONMENT [agent] interval = "10s" round_interval = true metric_buffer_limit = 10000 metric_batch_size = 1000 collection_jitter = "0s" flush_interval = "30s" flush_jitter = "30s" debug = false hostname = "" All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by activating them. Global tags can be specified in the [global_tags] section of the config file in key="value" format. Use a GUID which uniquely identifies each “cluster” and ensure that environment variable exists consistently on all hosts (meta and data). Optionally, add other tags if desired. Example: dev, prod for environment. Agent Configuration recommended config settings for InfluxDB data collection. Adjust the interval and flush_interval based on: ● desire around “speed of observability” ● retention policy for the data
  • 11. Telegraf Configuration: Inputs (common) # INPUTS [[inputs.cpu]] percpu = false totalcpu = true fieldpass = ["usage_idle", "usage_user", "usage_system", "usage_steal"] [[inputs.mem]] [[inputs.netstat]] [[inputs.system]] [[inputs.diskio]] Input Configuration items include grabbing metrics from the various infrastructure, database, and system components in play. For the other plug-ins, default config is sufficient.
  • 12. Telegraf Configuration: Inputs Data Nodes # INPUTS [[inputs.influxdb]] interval = "15s" urls = ["http://<localhost>:8086/debug/vars"] timeout = "15s” [[inputs.http_response]] #DATA address = "http://<localhost>:8086/ping” [[inputs.disk]] mount_points = ["/var/lib/influxdb/data","/var/lib/influxdb/wal", "/var/lib/influxdb/hh”,"/"] InfluxDB grabs all metrics from the exposed endpoint. http_response allows you to ping individual data nodes and track response output. You can also setup a separate Telegraf agent elsewhere within your infrastructure to ping the available cluster(s) through the load balancer. disk allows you to configure the various volumes/mount points on disk -- locations of data, wal, hinted handoff -- and root. (default config options shown)
  • 13. Telegraf Configuration: Inputs Meta Nodes # INPUTS [[inputs.http_response]] #META address = "http://<localhost>:8091/ping" [[inputs.filestat]] files = ["/ivar/lib/influxdb/meta/snapshots/*/state.bin"] md5 = false [[inputs.disk]] mount_points = ["/var/lib/influxdb/meta", "/"] http_response allows you to ping individual meta nodes and track response output. filestat allows you to monitor metadata snapshots. disk allows you to configure the various volumes/mount points on disk -- locations of meta store -- and root. (default config options shown)
  • 14. Telegraf Configuration: Outputs # OUTPUTS [[outputs.influxdb]] urls = [ "<target URL of DB>" ] database = "telegraf" retention_policy = "autogen" timeout = "10s" username = <uname> password = <pword> content_encoding = "gzip" Output Configuration tells telegraf which output sink to send the data. Multiple output sinks can be specified in the configuration file. ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster.
  • 15. Telegraf Configuration: Gathering Logs # INPUT [[inputs.syslog]] # OUTPUTS [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" # Drop all measurements that start with "syslog" namedrop = [ "syslog*" ] [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" retention_policy = "14days" # Only accept syslog data: namepass = [ "syslog*" ] Output Configuration use namepass/namedrop to direct metrics/logs to different db.rp targets ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster. Input Configuration add the syslog input plug-in. Review the settings for your environment. InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
  • 17. We’ve gathered a wide variety of metrics...so now what? Dashboards!
  • 18. Alerting: Common Metrics to Watch Disk Usage Hinted Handoff Queue No metrics…. aka Deadman
  • 19. Disk Usage Batch Task: TICKscript // Monitor disk usage for all hosts var data = batch |query(''' SELECT last(used_percent) FROM "telegraf"."autogen"."disk" WHERE ("host" =~ /prod-.*/) AND ("path" = '/var/lib/influxdb/data' OR "path" = '/var/lib/influxdb/wal' OR "path" = '/var/lib/influxdb/hh' OR "path" = '/') ''') .period(5m) .every(10m) .groupBy('host', 'role', 'environment', 'device')
  • 20. Disk Usage Alert: TICKscript var warn_threshold = 85 var critical_threshold = 95 data |alert() .id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags "environment" }}') .message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index .Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}') .warn(lambda: "used_percent" > warn_threshold) .crit(lambda: "used_percent" > critical_threshold) .slack() .channel('#monitoring')
  • 21. Hinted Handoff Queue Batch Task: TICKscript // This generates alerts for high hinted-handoff queues for InfluxEnterprise var queue_size = batch |query(''' SELECT max(queueBytes) as "max" FROM "telegraf"."autogen"."influxdb_hh_processor" WHERE ("host" =~ /prod-.*/) ''') .groupBy('host', 'cluster_id') .period(5m) .every(10m) |eval(lambda: "max" / 1048576.0) .as('queue_size_mb')
  • 22. Hinted Handoff Queue Alert: TICKscript var warn_threshold = 3500 var crit_threshold = 5000 queue_size |alert() .id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id" }}/{{ index .Tags "host" }}') .message('Host {{ index .Tags "host" }} (cluster {{ index .Tags "cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields "queue_size_mb" }}MB') .details('') .warn(lambda: "queue_size_mb" > warn_threshold) .crit(lambda: "queue_size_mb" > crit_threshold) .stateChangesOnly() .slack() .pagerDuty()
  • 25. Common Troubleshooting Scenarios • OOM Loop • Runaway Series Cardinality
  • 26. Common Troubleshooting Scenarios Workload Type • Which type are we looking at? – Read heavy – Write heavy – Mixed? – Establish baselines and understand “normal” using metrics and visualization – Baselines allow us to understand change over time and help determine when is time to scale up Log Analysis • Metrics First! – Highlights where you should look within the log files • Logs allow for pin pointing root-cause of issue observed by metrics – Cache max memory size – Hinted Handoff Queue “Blocked” IOPS & Disk Throughput • Understand the capabilities the hardware by plan size – Develop and review sizing guidelines – Understand max read and write limits based on machine class and drive types – these can change as you scale!
  • 27. What did we miss? So many things… Head for the balcony! – Shift from instance-based dashboards to “fleet management” What’s the experience of the “customer”? – Real user monitoring from the front-door – Integration with subscription management system SSL Cert expiration E-commerce system monitoring – Health and availability of supporting components
  • 28. Recap Gather Metrics...and Logs (for context) Visualize, Monitor, and Alert… tune based on your environment Iterate and address “new” scenarios to eliminate alert fatigue https://community.influxdata.com https://docs.influxdata.com