Monitoring InfluxEnterprise

Tim E. Hall @thallinflux
VP, Products InfluxData
Monitoring InfluxEnterprise

Discussion Topics
• Background
• Gathering Metrics...and Logs
• Visualization, Monitoring, and Alerting
• Troubleshooting Scenarios

From
development to
production
• Change is required
• Establish monitoring baselines
• Ensure visibility into health of the system
• Notifications for most common issues,
before they become outages

From OSS to Enterprise
InfluxDB
OSS
Meta 1 Meta 3Meta 2
Data Node
2
Data Node
1
InfluxDB Enterprise

Deploy Telegraf on all nodes (meta and data)
By enabling these plugins, KPI’s routinely associated with infrastructure and database
performance can be measured and serve as a good starting point for monitoring.
Minimum Recommendation:
1. CPU: collects standard CPU metrics
2. System: gathers general stats on system load
3. Processes: uptime, and number of users logged in
4. DiskIO: gathers metrics about disk traffic and timing
5. Disk: gathers metrics about disk usage
6. Mem: collects system memory metrics
7. NetStat: Network related metrics
8. http_response: Setup local ping
9. filestat: Files to gather stats about (meta node only)
10. InfluxDB: gather stats from the InfluxDB Instance. (data node only)
Optional:
1. Logs: requires syslog
2. Swap: collects system swap metrics
3. Internal: gather Telegraf related stats
4. Docker: if deployed in containers

But where should these metrics land?
• You’ve got lots of options
– Typical recommendation: use an Open Source instance as the “watcher
of the watchers”
• If there are a small number of clusters that need to be monitored this is the easiest,
simplest way to go
– Other options that can be considered:
• 2 instances -- monitor each other
• Separate by environment -- and eliminate the environment global tag in the Telegraf
config
• Unleash your creativity…

Key Point
– Production InfluxDB instances
should not monitor themselves
– WHY?
• Because…visibility is lost if the
database is unreachable, for any
reason.
[monitor]
store-enabled = false

Telegraf Configuration: Global
[global_tags]
cluster_id = $CLUSTER_ID
environment = $ENVIRONMENT
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 10000
metric_batch_size = 1000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "30s"
debug = false
hostname = ""
All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by
activating them.
Global tags can be specified in the [global_tags]
section of the config file in key="value" format. Use
a GUID which uniquely identifies each “cluster” and
ensure that environment variable exists consistently
on all hosts (meta and data). Optionally, add other
tags if desired. Example: dev, prod for environment.
Agent Configuration recommended config settings
for InfluxDB data collection. Adjust the interval and
flush_interval based on:
● desire around “speed of observability”
● retention policy for the data

Telegraf Configuration: Inputs (common)
# INPUTS
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = ["usage_idle",
"usage_user", "usage_system",
"usage_steal"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.system]]
[[inputs.diskio]]
Input Configuration items include grabbing metrics
from the various infrastructure, database, and
system components in play.
For the other plug-ins, default config is sufficient.

Telegraf Configuration: Inputs Data Nodes
# INPUTS
[[inputs.influxdb]]
interval = "15s"
urls = ["http://<localhost>:8086/debug/vars"]
timeout = "15s”
[[inputs.http_response]] #DATA
address = "http://<localhost>:8086/ping”
[[inputs.disk]]
mount_points =
["/var/lib/influxdb/data","/var/lib/influxdb/wal",
"/var/lib/influxdb/hh”,"/"]
InfluxDB grabs all metrics from the
exposed endpoint.
http_response allows you to ping
individual data nodes and track
response output.
You can also setup a separate Telegraf
agent elsewhere within your
infrastructure to ping the available
cluster(s) through the load balancer.
disk allows you to configure the
various volumes/mount points on
disk -- locations of data, wal, hinted
handoff -- and root. (default config
options shown)

Telegraf Configuration: Inputs Meta Nodes
# INPUTS
[[inputs.http_response]] #META
address = "http://<localhost>:8091/ping"
[[inputs.filestat]]
files =
["/ivar/lib/influxdb/meta/snapshots/*/state.bin"]
md5 = false
[[inputs.disk]]
mount_points = ["/var/lib/influxdb/meta", "/"]
http_response allows you to ping
individual meta nodes and track response
output.
filestat allows you to monitor metadata
snapshots.
disk allows you to configure the
various volumes/mount points on
disk -- locations of meta store -- and
root. (default config options shown)

Telegraf Configuration: Outputs
# OUTPUTS
[[outputs.influxdb]]
urls = [ "<target URL of DB>" ]
database = "telegraf"
retention_policy = "autogen"
timeout = "10s"
username = <uname>
password = <pword>
content_encoding = "gzip"
Output Configuration tells telegraf which
output sink to send the data . Multiple
output sinks can be specified in the
configuration file.
** NOTE: This should point to the load
balancer, if you are storing the metrics into a
cluster.

Telegraf Configuration: Gathering Logs
# INPUT
[[inputs.syslog]]
# OUTPUTS
urls = [ "http://localhost:8086" ]
# Drop all measurements that start
with "syslog"
namedrop = [ "syslog*" ]
urls = [ "http://localhost:8086" ]
retention_policy = "14days"
# Only accept syslog data:
namepass = [ "syslog*" ]
Output Configuration use
namepass/namedrop to
direct metrics/logs to
different db.rp targets
** NOTE: This should point to
the load balancer, if you are
storing the metrics into a
cluster.
Input Configuration add the
syslog input plug-in.
Review the settings for
your environment.
InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.

Visualization, Monitoring, Alerting

We’ve gathered a wide variety of metrics...so now what?
• Dashboards!

Alerting: Common Metrics to Watch
• Disk Usage
• Hinted Handoff Queue
• No metrics…. aka Deadman

Disk Usage Batch Task: TICKscript
// Monitor disk usage for all hosts
var data = batch
|query('''
SELECT last(used_percent)
FROM "telegraf"."autogen"."disk"
WHERE ("host" =~ /prod-.*/)
AND ("path" = '/var/lib/influxdb/data'
OR "path" = '/var/lib/influxdb/wal'
OR "path" = '/var/lib/influxdb/hh'
OR "path" = '/')
''')
.period(5m)
.every(10m)
.groupBy('host', 'role', 'environment', 'device')

Disk Usage Alert: TICKscript
var warn_threshold = 85
var critical_threshold = 95
data
|alert()
.id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags
"environment" }}')
.message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index
.Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}')
.warn(lambda: "used_percent" > warn_threshold)
.crit(lambda: "used_percent" > critical_threshold)
.slack()
.channel('#monitoring')

Hinted Handoff Queue Batch Task: TICKscript
// This generates alerts for high hinted-handoff queues for InfluxEnterprise
var queue_size = batch
|query('''
SELECT max(queueBytes) as "max"
FROM "telegraf"."autogen"."influxdb_hh_processor"
''')
.groupBy('host', 'cluster_id')
.period(5m)
.every(10m)
|eval(lambda: "max" / 1048576.0)
.as('queue_size_mb')

Hinted Handoff Queue Alert: TICKscript
var warn_threshold = 3500
var crit_threshold = 5000
queue_size
|alert()
.id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id"
}}/{{ index .Tags "host" }}')
.message('Host {{ index .Tags "host" }} (cluster {{ index .Tags
"cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields
"queue_size_mb" }}MB')
.details('')
.warn(lambda: "queue_size_mb" > warn_threshold)
.crit(lambda: "queue_size_mb" > crit_threshold)
.stateChangesOnly()
.slack()
.pagerDuty()

Deadman Batch Task: TICKscript
// Ensure hosts are running. If no CPU usage statistics can be retrieved
// We assume the host has locked up, disappeared or is otherwise unreachable
var cpu_stats = batch
|barrier().idle(5m)
|query('''
SELECT count(usage_system)
FROM "telegraf"."autogen"."cpu"
''')
.period(5m)
.every(10m)
.groupBy('cluster_id', 'host')

Deadman Alert: TICKscript
var trigger = cpu_stats
|deadman(0.0, 10m)
.id('Host: {{ index .Tags "host" }}, Cluster ID: {{ index .Tags
"cluster_id" }}')
.message('Alert: Kapacitor Deadman, Level: {{ .Level }}, {{ .ID }}')
.idTag('alertID')
.messageField('message')
.durationField('duration')
.levelTag('level')
.stateChangesOnly()
.slack()
.channel('#monitoring')

Deadman Evaluate & Visualize Alert in Chronograf: TICKscript
trigger
|eval(lambda: "emitted")
.as('value')
.keep('value', 'message', 'duration')
|eval(lambda: float("value"))
.as('value')
.keep()
|influxDBOut()
.create()
.database('chronograf')
.retentionPolicy('autogen')
.measurement('alerts')
.tag('alertName', 'Deadman')
.tag('triggerType', 'deadman')
For Chronograf

Common Troubleshooting Scenarios
• OOM Loop
• Runaway Series Cardinality

Common Troubleshooting Scenarios
Workload Type
• Which type are you?
– Read heavy
– Write heavy
– Mixed?
– Establish baselines and
understand “normal”
using metrics and
visualization
– Baselines allow you to
understand change over
time and help determine
when is time to scale up
Log Analysis
• Metrics First!
– Highlights where you
should look within the
log files
• Logs allow for pin
pointing root-cause of
issue observed by
metrics
– Cache max memory size
– Hinted Handoff Queue
“Blocked”
IOPS & Disk Throughput
• Understand the
capabilities of your
hardware
– We recommend SSD-
based deployments
• Deploying in an IaaS
environment?
– Understand max read
and write limits based
on machine class and
drive types – these can
change as you scale!

Recap
• Gather Metrics...and Logs
• Visualize, Monitor, and Alert… tune based on your environment
• Review Common Troubleshooting Scenarios
https://community.influxdata.com https://docs.influxdata.com

Monitoring InfluxEnterprise

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring InfluxEnterprise

Similar to Monitoring InfluxEnterprise (20)

More from InfluxData

More from InfluxData (20)

Recently uploaded

Recently uploaded (20)

Monitoring InfluxEnterprise