Scaling graphite to handle a zerg rush

Scaling graphite to
handle a zerg rush
December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform
danielb@similarweb.com

The problem
No metrics across the board
● Hard to debug issues
● No intuitive way to measure efficiency, usage
● Capacity planning?
● Dashboards

The problem
No metrics across the board
● We knew graphite
● We wanted statsd for applicative metrics
● And we heard that collectd is-nice and we installed it X
500 physical machines

Graphite
Write throughput across our
Hadoop fleet
Ingress traffic to our load
balancing layer
"Store numeric time series data"
"Render graphs of this data on demand"

Graphite
Architecture
Image from: github.com/graphite-project/graphite-web/blob/master/README.md

Max IOPS reached
Single Threaded
Graphite
● First setup - 2x 1TB magnetic drives @ RAID 1
● Volume peaked at ˜300 iops
● Carbon-cache maxed the CPU

Graphite
● Why so many IOPS?
● Every metric is a separate file on the FS
/var/data/graphite/collectd/{hostname}/cpu/user.wsp

Graphite
+ Clustering
https://grey-boundary.io/the-architecture-of-clustering-graphite/

Graphite
+ Clustering
https://grey-boundary.io/the-architecture-of-clustering-graphite/
This looks nice but do we really
need moar machines?

Graphite
+ Remember the bottlenecks we had
● Carbon-cache reached 100% CPU on a single core (it's
probably single threaded)
● Disks reached maximum IOPS capacity

Graphite
+ Carbon-cache
● Persists metrics to disk and serves hot-cache to graphite
● Python, single threaded
● So we replaced carbon-cache with go-carbon:
Golang implementation of Graphite/Carbon server with
classic architecture: Agent -> Cache -> Persister

Graphite
+ go-carbon
The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:
Reference: https://github.com/lomik/go-carbon

Graphite
+ go-carbon
Max IOPS reached
20% cpu
x500

Mission
Solving the IOPS
bottleneck

Graphite
+ IOPS
RAID 0? Raid controller became
the bottleneck and it wasn't
enough anyway
SSD? Yes! But one wasn't
enough :(
Hadoop inspiration! JBOD (no
raid)
Influx? No!

Mission
Load balancer:
carbon-relay

Graphite
+ carbon-relay
● "Load balancer" between metric producers and go-
carbon instances
● Same metric is routed to the same go-carbon instance
via a consistent hashing algorithm
● But… is a single-threaded Python app so your mileage
may vary

Graphite
+ carbon-relay
● We replaced with carbon-c-relay:
A very fast C implementation of carbon-relay (and much
more)

Graphite
+ (Some) Performance metrics
Go-carbon Update Operations Stack CPU usage

+ statsd
Can we scale
statsd out?
Graphite

+ statsd
Who wins?
If we shard statsd,
we end up with
wrong data in
graphite.
Graphite

Mission
Introducing statsite
C implementation of
statsd (and much more)

Graphite
● Wire compatible with statsd (drop in replacement)
● Pure C with a tight event loop (very fast)
● Low memory footprint
● Supports quantiles, histograms and much more.
+ Statsite

Graphite
+ Final setup
“Graphite box”

Mission
Don’t give up on
Graphite!

Graphite
● Beast graphite stack, peaked at 1M updates per minute,
room for more
● Very efficient: ˜10% user-land CPU usage, leaves more
room for IRQs (disk, network)
● We can still scale out the whole stacks with another
layer of carbon-c-relay but we never needed to go there.
+ Pros

Graphite
● SSD is still expensive and wears out quickly under heavy
random-writes scenarios - less relevant on AWS :-)
● Bugs - Custom components are somewhat less field
tested.
● Data is not highly available with JBOD
● Doing metrics right is demanding - go SaaS!
+ Cons

Graphite
+ Some tuning tips
● UDP creates correlated loss and has shitty backpressure behaviour
(actually, NO backpressure). Use TCP when possible
● High frequency UDP packets (statsite) can generate a shit-load of IRQs -
balance your interrupts or enforce affinity
● High Carbon PPU (Points per update) signals I/O latency
● Tune go-carbon cache, especially if you alert on metrics

● https://github.com/lomik/go-carbon
● https://github.com/grobian/carbon-c-relay
● https://github.com/statsite/statsite
● https://github.com/similarweb/puppet-go_carbon
● http://www.aosabook.org/en/graphite.html
Links

Scaling graphite to handle a zerg rush

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling graphite to handle a zerg rush

Similar to Scaling graphite to handle a zerg rush (20)

Recently uploaded

Recently uploaded (20)

Scaling graphite to handle a zerg rush

Editor's Notes