OPTIMIZING THE TICK STACK

Sam Dillard, Solutions Architect
Getting Started Series
Optimizing your TICK Stack

© 2017 InfluxData. All rights reserved.2 © 2017 InfluxData. All rights reserved.2
✓ Optimizing
• Hardware/Architecture
• Schema
• Configuration
• Queries
✓ Q&A
Agenda

© 2017 InfluxData. All rights reserved.4
Resource Utilization
• No Specific OS Tuning Required
• OSS
• 70% cpu utilization - need head room for
• Peak periods
• Compactions
• Supporting node failure (clusters only)
• Backfilling data
• Enterprise (clustered)
• 49% cpu on each node (ideal)

Telegraf
• Lightweight and written in Go
• Plug-in driven
• Optimized for writing to InfluxDB
• Formatting
• Retries
• Modifiable batch sizes and jitter
• Tag sorting
• Preprocessing
• Converting tags to fields, fields to tags
• Regex transformations
• Renaming measurements, tags
• Aggregate
• Mean,min,max,count,variance,stddev
• Histogram
• ValueCounter (i.e., for HTTP response codes)

Telegraf
CPU
Mem
Disk
Docker
Kubernetes
/metrics
Kafka
MySQL
Process
-transform
-decorate
-filter
Aggregate
-mean
-min,max
-count
-variance
-stddev
File
InfluxDB
Kafka
CloudWatch
CloudWatch

Telegraf
InfluxDB
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Message Queue Telegraf

Telegraf
InfluxDB
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf
Telegraf

Regard the Shard
• Shards are defined by durations
• Longer
• Better frontend performance (writes & reads)
• More efficient compactions
• Shorter
• More manageable
• More efficient drops (happens everytime RPs are enforced)
• More efficient for moving/copying
• More efficient for recording incremental backups
• Other North Stars:
• Each shard group has at least 100,000 points per shard group
• Each shard group has at least 1,000 points per series

Schema Design Goals
• By reducing...
– Cardinality
– Information-encoding
– Key lengths
• You increase…
– Write performance
– Query performance
– Readability

Data Format
● Points are written to InfluxDB using Line Protocol, which follows this below
format:
<measurement>,tag-key=tag-value field-key=field-value <timestamp>
● Punctuation is paramount!
cpu_usage,host=server02,az=us-west-1b user=25.0,system=55.0 <timestamp>
Measurement
Tag set
Field set
Space Space

Schema Insight
● A Measurement is a namespace for like metrics (SQL table)
● What to make a Measurement?
○ Logically-alike metrics
○ I.e., CPU has metrics has many metrics associated with it
○ I.e., Transactions
■ “usd”,”response_time”,”duration_ms”,”timeout”, whatever else…
● What to make a Tag?
○ Metadata that describe unique sources of metrics
○ Often this translates to “things you need to `GROUP BY`”
● What to make a Field?
○ Actual metrics
○ More specifically?
■ Things you need to do math on or other operations
■ Things that have high value variance...that you don’t need to group

DON'T ENCODE DATA INTO THE MEASUREMENT NAME
Measurement names like:
Encode that information as tags:
cpu.server-5.us-west value=2 1444234982000000000
cpu.server-6.us-west value=4 1444234982000000000
mem-free.server-6.us-west value=2500 1444234982000000000
cpu,host=server-5,region=us-west value=2 1444234982000000000
cpu,host=server-6,region=us-west value=4 1444234982000000000
mem-free,host=server-6,region=us-west value=2500 1444234982000000

DON’T OVERLOAD TAGS
BAD
GOOD: Separate out into different tags:
xxx
cpu,server=localhost.us-west value=2 1444234982000000000
cpu,server=localhost.us-east value=3 1444234982000000000
cpu,host=localhost,region=us-west value=2 1444234982000000000
cpu,host=localhost,region=us-east value=3 1444234982000000000

Cardinality
• The number of unique database, measurement, tag set, and field key combinations in an InfluxDB instance.
• In practice we generally just care about the tagset
For example:
Measurement:
• http
Tags:
• host -- 35 possible hosts
• appName -- 10 possible apps
• datacenter -- 4 possible DCs
Fields:
• responseCode
• responseTime
Assuming each app can reside on each host, total possible series cardinality for this measurement is 1*(35 * 10) + 2 = 352
Note 1: data center is a dependent tag since each host can only reside in 1 data center therefore adding data center as a tag does not increase cardinality
Note 2: all hosts probably don’t support all 10 apps. Therefore, “real” cardinality is likely less than 353.

TSI - How does it help
• Memory use:
• TSI-->30M series = 1-2GB
• In-mem-->30M series = inconceivable
• Startup performance. Startup time should be insignificant, even
for very large indexes.
• The tsi1 index has its own index, while the inmem index is
shared across all shards.
• Write performance--tsi1 index will only need to consult index files
relevant to the hot data.

Tuning Parameters
• Max-concurrent-queries
• Max-select-point
• Max-select-series
• Max-select-buckets
• Rate limiting compactions
• Max-concurrent-compactions
• Compact-full-write-cold-duration
• Compact-throughput (new in v1.7!)
• Compact-throughput-burst (new in v1.7!)

Tuning Parameters
• Cache-max-memory-size
• Cache-snapshot-memory-size
• Cache-snapshot-write-cold-duration
• Max-series-per-database
• Max-values-per-tag
• Fine Grained Auth instead of multiple databases

Queries and Shards
• Shard durations should be longer than your longest typical
query
• When thinking about balancing writes/reads:
High Query load Low Query Load
High Write Load Balanced duration Shorter duration
Bursty or Low Write Load Longer duration Balanced duration

Query Performance
• Streaming functions > batch functions
• Batch funcs
• percentile(), spread(), stddev(), median(), mode(), holt-winters
• Stream funcs
• mean(),bottom(),first(),last(),max(),top(),count(),etc.
• Distributed functions (clusters only) > local functions
• Distributed
• first(),last(),max(),min(),count(),mean(),sum()
• Local
• percentile(),derivative(),spread(),top(),bottom(),elapsed(),etc.

Query Performance
• Boundaries!
• Time-bounding and series-bounding with `WHERE` clause
• `SELECT *` generally not a best practice
• Agg functions instead of raw queries
• `SELECT mean(<field>)` > `SELECT <field>`
• Subqueries
• When appropriate, process data from an already processed subset of
data
• SELECT SUM("max") FROM (SELECT MAX("water_level") FROM "h2o_feet" WHERE time >
now() - 1d GROUP BY "location")

DON'T CREATE TOO MANY LOGICAL CONTAINERS
Or rather, don’t write to too many databases:
• Dozens of databases should be fine
• Hundreds might be okay
• Thousands probably aren't without careful design
Too many databases leads to more open files, more query
iterators in RAM, and more shards expiring. Expiring shards
have a non-trivial RAM and CPU cost to clean up the indices.

sam@influxdata.com @SDillard12

OPTIMIZING THE TICK STACK

More Related Content

What's hot

Similar to OPTIMIZING THE TICK STACK

More from InfluxData

Recently uploaded

OPTIMIZING THE TICK STACK