Monitoring with
Clickhouse
Berlin DevOps 2018-09-26
Ilya @GoEuro
GoEuro Scale:
● 20 mio+ visitors / month
● 150+ Engineers
● 300+ microservices in production
● 600+ releases per week
Monitoring in GoEuro
● Push-based
● Graphite + Grafana
● 30MBps ingress traffic
● 8 Mio data points per minute
● Tags
● Hostname as a part of each metric
Common Graphite infrastructure
Evolution of our Graphite Setup
1. You start with a common Graphite Stack:
Default components, one mirror (2 replicas), no sharding
2. First performance issues:
Bigger VMs, SSD, memcached, carbon-c-relay, no sharding
* go-carbon - that could have won us some time - it’s way faster than
carbon-cache
3. Bigger performance issues:
Multiple instances, jump hash for sharding, carbonate to rebalance the
cluster, custom cleanup jobs, filling gaps of replication, have to deal
with coupled read and writes
We are building a distributed
database, aren’t we?
Let’s look around in 2018
Criteria for a new backend:
● Replication
● Sharding
● Scaling out
● Aggregation/retention engine
● Graphite compatible for both reads and writes
● Price
● Complexity
● Monitoring
● Robustness e.g. data lost
Graphite backends evaluated
● ElasticSearch - too much effort to make it scale
● Kairos DB - no mechanism of retention out of the box; no Graphite reader
● BigGraphite - too slow; Cassandra has a pretty steep learning curve
● Prometheus - doesn't scale out of the box; we'll have to switch whole
company from Pushing metrics to Pulling them
● GlusterFS - 8x slower on writes vs same storage attached locally, requires
lot of tunings
● Ceph - also too slow
● OpenTSDB - uses HDFS as a filesystem, which makes it from the beginning
a super complex choice
● InfluxDB - you need to come up with an external search index
● Clickhouse - our winner
What is Clickhouse
ClickHouse is an open source column-oriented database
management system capable of real-time generation of
analytical data reports using SQL queries.
https://clickhouse.yandex/
What is Clickhouse
● Blazing Fast
● Linearly Scalable
● Hardware Efficient
● Fault Tolerant
● Sharding and replication out of the box
● Custom table engines (including GraphiteMergeTree)
Clickhouse as a Graphite backend
● Ecosystem is there
● 100% coverage of the Graphite query
language
● We had a seamless experience with golang
implementation (lomik)
Downsides
● Dependent on Zookeeper for sharding and
replication (we don’t use it now)
● Sharding requires some attention
● Read queries against shards are slower
● Well known in Russian-speaking world but
not outside
Current performance
● Uses 2 cores and 2GB of RAM on our scale
● Graphite-web response times before and after:
Questions?

Monitoring with Clickhouse

  • 1.
  • 2.
    GoEuro Scale: ● 20mio+ visitors / month ● 150+ Engineers ● 300+ microservices in production ● 600+ releases per week
  • 3.
    Monitoring in GoEuro ●Push-based ● Graphite + Grafana ● 30MBps ingress traffic ● 8 Mio data points per minute ● Tags ● Hostname as a part of each metric
  • 4.
  • 5.
    Evolution of ourGraphite Setup 1. You start with a common Graphite Stack: Default components, one mirror (2 replicas), no sharding 2. First performance issues: Bigger VMs, SSD, memcached, carbon-c-relay, no sharding * go-carbon - that could have won us some time - it’s way faster than carbon-cache 3. Bigger performance issues: Multiple instances, jump hash for sharding, carbonate to rebalance the cluster, custom cleanup jobs, filling gaps of replication, have to deal with coupled read and writes
  • 6.
    We are buildinga distributed database, aren’t we?
  • 7.
    Let’s look aroundin 2018 Criteria for a new backend: ● Replication ● Sharding ● Scaling out ● Aggregation/retention engine ● Graphite compatible for both reads and writes ● Price ● Complexity ● Monitoring ● Robustness e.g. data lost
  • 8.
    Graphite backends evaluated ●ElasticSearch - too much effort to make it scale ● Kairos DB - no mechanism of retention out of the box; no Graphite reader ● BigGraphite - too slow; Cassandra has a pretty steep learning curve ● Prometheus - doesn't scale out of the box; we'll have to switch whole company from Pushing metrics to Pulling them ● GlusterFS - 8x slower on writes vs same storage attached locally, requires lot of tunings ● Ceph - also too slow ● OpenTSDB - uses HDFS as a filesystem, which makes it from the beginning a super complex choice ● InfluxDB - you need to come up with an external search index ● Clickhouse - our winner
  • 9.
    What is Clickhouse ClickHouseis an open source column-oriented database management system capable of real-time generation of analytical data reports using SQL queries. https://clickhouse.yandex/
  • 10.
    What is Clickhouse ●Blazing Fast ● Linearly Scalable ● Hardware Efficient ● Fault Tolerant ● Sharding and replication out of the box ● Custom table engines (including GraphiteMergeTree)
  • 11.
    Clickhouse as aGraphite backend ● Ecosystem is there ● 100% coverage of the Graphite query language ● We had a seamless experience with golang implementation (lomik)
  • 12.
    Downsides ● Dependent onZookeeper for sharding and replication (we don’t use it now) ● Sharding requires some attention ● Read queries against shards are slower ● Well known in Russian-speaking world but not outside
  • 13.
    Current performance ● Uses2 cores and 2GB of RAM on our scale ● Graphite-web response times before and after:
  • 14.