Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare

ClickHouse
Capacity Planning
Methodology for OLAP Workloads
2019-12-03 SF Clickhouse Meetup Mik Kocikowski mik@cloudflare.com 1/22

Intro
Cloudflare is a big and enthusiastic user of Clickhouse.
- 1PB of data per day going in (true in one sense false in many others)
- Internal and user facing analytics
- Operational support, data “science”, reports, etc.
Started out with a single small cluster almost 3 years ago. Grew “organically” into
a monster. Have been breaking it up into individual workloads and setting up
smaller clusters dedicated to specific products.

Purpose of capacity planning
1. Meet current needs
2. Know how to meet future needs (“100x” exercise)

Specific use case
- User facing analytics (“how many visitors” dashboard) for a product
- HTTP type logs 750Kops @ 500B compressed per record coming in
- On prem HDD machines 256GB 100TB RAID 0
- Clickhouse server version 19.1.16 revision 54413
Decide:
- P99 single SELECT query 1s
- 150 queries /s

tl;dr
- It is impossible to calculate cluster capacity on paper
- The only way to do it is by running real queries on real data
- But it can be done on a single host and then extrapolated
- Real data must be used in benchmarks
- Time based limits and quotas are effective
- Clickhouse documentation is good and the defaults are sane

Process
1. Benchmark the hardware
2. Max out single host using real data
3. Iterate on primary and partitioning key design
a. Sharding strategy comes here if sharding unavoidable
4. Max out the cluster until it is too slow
5. Put in per-host limits and time based quotas

Benchmark the hardware
- Get rough idea of individual host performance (orders of magnitude)
- How long to fill a disk? How long to read it all? DO THE MATH! 30h to read 100TB ok?
- How does it degrade with disk full or running out of iops?
- Make sure all hosts roughly the same (compare relative performance)
- Slowest node determines max speed of the cluster (especially for Distributed engine)
- One node faster than the rest ok; one node slower not ok
- No need to simulate exact CH disk access patterns
We ended up settling on a simple test using fio.

Max out INSERTs
- We use the HTTP interface with data formatted as RowBinary
- No performance advantage to using the “native” protocol
- No significant CPU hit on CH side due to sorting
- 1B going in != 1B on disk
- Observe disk use and calculate max possible retention
- If using “made up” data consider how it compresses
- If you can’t keep up with INSERTs then you must shard
For this use case, it turned out that single host can handle all the INSERTs no
problem (batch size 2M), so no need to shard (split the input set into sub-sets).
Follow instructions in CH documentation.

Max out SELECTs
- SELECT speed depends on how many bytes need to be read from disk. The
number of part blocks that need to be read from disk is determined by:
- The primary and partitioning keys
- Number of columns SELECTed
- Index granularity
- Orders of magnitude difference for:
- Same query with different key (big user vs small user, 1 hour vs 1 month)
- Same key with different columns (count distinct HTTP status codes vs count distinct URLs)
- Throughput is a function of latency and parallelism (Little’s Law)
- The only way to know is to observe it

Max out SELECTs
Is it better to process 10 queries in parallel with each query taking 1s, or 20
queries in parallel with each query taking 2s? Latency vs throughput. Little’s Law.
- Get hold of a representative set of queries
- Get hold of a representative set of keys (user, time)
- Iterate test runs increasing parallelism until optimum found
For us the “sweet spot” turned out to be parallelism 25 resulting in p95 query
latency of 465ms. So that is 50 queries per second per host.

Max out SELECTs
Primary and partitioning keys need to be aligned with use case.
- Primary key: (user, time)
- Partitioning key: (time,)
Exercise: consider using the time record was inserted into clickhouse:
- Less merge churn on inserts
- Deterministic SELECTs
- Results skewed by late arriving data (but churn reduced)

Max out SELECTs
- Clickhouse stores data in “parts” (files on disk)
- Each column has its own set of parts
- The index (always stored in RAM) maps primary key to record offsets
- “Marks” map record offsets to byte offsets in “parts”
To find a record, Clickhouse looks up the key in the index, then looks up the byte
offset in mark files for each column. Keeping the marks in RAM (“mark cache”)
makes a big difference. Marks for 70TB of data take up 70GB (look it up in
filesystem). The more data the more marks. The more columns the more marks.

Decide on sharding (do not shard if at all possible!)
This is a huge topic for separate presentation, but...
- Naive spreads all data across shards
- Need Distributed engine to query
- Max SELECT speed determined by slowest host
- Parallelism flat as hosts added to cluster
- Keyed puts all records for given key into the same shard
- Need additional logic for INSERTs
- Danger of imbalance (Pareto distribution)
- Key selection determines layout: user? Time?
- 2-level sharding combines naive with keyed: a cluster of clusters

Build the cluster
- Shards for INSERTs, replicas for SELECTs
- Remember things break

Max out the cluster until it is too slow
Experiment with variety of loads over period of days.
- Clickhouse degrades gracefully (hard to break)
- Adding replicas increases SELECT performance linearly
- Interaction between INSERT and SELECT must be established empirically
- Invested people REALLY RELUCTANT to break things
- Test not just failure, but also recovery (throttle recovery?)

Put in per-host limits and quotas
User / profile limits set upper boundaries on individual queries (“up to 10M rows
per query”). Quotas set cumulative limits per time period (“up to 10GB per
minute”).
- Get the system as hot as you are willing to ever have it in production
- Run it like that for a good while
- Collect query execution time from system.query_log
- Set limits and quotas accordingly
- Separate INSERTs from SELECTs

User / profile limits set upper boundaries on individual queries.
- Use max_execution_time to prevent runaway queries
- Be aggressive; analyze the data set and query patterns
- Having upper bound to query time makes request rate limiting possible
- Use max_rows_to_read to short circuit “impossible” queries

Quotas set cumulative limits per time period. Time based quotas (“in 60s you can
spend up to 850s processing queries for user A and 150s processing queries for
the user B”) do not “care” about the reason the quota was exceeded, which makes
them very general in application.
- Execution time is the final metric (things are either too slow or fast enough)
- Do not care why it is slow (traffic spike; broken disk; TOR switch bad)
- Slowest node taken out of rotation first (while byte-based penalizes fastest)
We ended up with 2 users: “api” and “inserter”.

After going live
If limits and quotas have been set correctly, you can “walk away” as users add
columns and queries (if you don’t care about end user experience).
- The only “hard” limit is running out of disk space
- But by that time it is likely way too late
- Repeat the capacity planning exercise for each addition (column, query)
- But you can do it on single host
- Multiple clusters for long retention (q1 cluster, q2 cluster, etc)
- Limits and quotas are there to prevent catastrophes only
- You need layered access control and rate limiting

Things I try to remember
Given my RDBMS background, I try to remember, in no particular order:
- It is about bytes not records
- Trivial change to a SELECT query can make it MILLION TIMES SLOWER
- Index construction is paramount
- Execution time is a great top-level metric
- You can’t delete things
- Clickhouse is very well written by really smart people
- There is no magic (cough Distributed engine)
- Unreasonable to expect end user to understand any of this (“but it is SQL”)

Recap
1. Remove all constraints
2. Empirically establish optimal parameters for the given work load
a. The big thing is you can do it on a single host
3. Protect these parameters with limits and quotas
Devote a million lifetimes to:
- Study of primary key construction
- Design of infinitely scalable bi-level sharding schemes

Thank you!

Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare

Similar to Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare (20)

More from Altinity Ltd

More from Altinity Ltd (20)

Recently uploaded

Recently uploaded (20)

Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare