Your first ClickHouse data warehouse

Your ﬁrst ClickHouse
data warehouse
Robert Hodges - 2 December 2020
SF Bay Area ClickHouse Meetup
1

Presenter and Company Bio
www.altinity.com
Enterprise provider for ClickHouse, a
popular, open source data warehouse.
Community sponsor and major
committers to ClickHouse project.
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security. Using
Kubernetes since 2018.
2

Single binary
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
ClickHouse is an open source data warehouse
ClickHouse Server
a b c d
And it’s really fast!
ClickHouse Server
a b c d
ClickHouse Server
a b c d
ClickHouse Server
a b c d

Installing ClickHouse goodness on Linux
# UBUNTU/DEBIAN INSTALL
sudo apt-get install apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80
--recv E0C56BD4
echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee
/etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client
sudo systemctl start clickhouse-server
Debian
Packages
TarballsRPMs

ClickHouse goodness delivered by Docker
mkdir $HOME/clickhouse-data
docker run -d --name clickhouse-server
--ulimit nofile=262144:262144
--volume=$HOME/clickhouse-data:/var/lib/clickhouse
-p 8123:8123 -p 9000:9000
yandex/clickhouse-server
6
Persist data
Make ports visible
Make ClickHouse happy

YES!
● Yandex Managed Service for ClickHouse --
Runs in Yandex.Cloud
● Altinity.Cloud -- Runs in Amazon Public Cloud
Is there ClickHouse cloud goodness?
7

Where is the documentation?
8
https://clickhouse.tech/

Getting started
with app
development

10
First step: The ClickHouse Tutorial
10
https://clickhouse.tech/docs/en/getting-started/tutorial/

Second step: Design table(s) and load data
CREATE TABLE meetup.readings (
sensor_id Int32,
time DateTime,
date Date,
temperature Decimal(5,2)
)
Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (sensor_id, time);
Don’t stress about data types
Use MergeTree table types
Partition by month or day
Sort by “keys” to ﬁnd dataLZ4 compression by default

Table
Part
Index Columns
Sparse index
Columns sorted
on ORDER BY
columns
Rows match
PARTITION BY
expression
Part
Index Columns
Part
Compressed
block
12
Your friend: the MergeTree table type
12

CSVWithNames
"sensor_id","time","date","temperature"
0,"2019-01-01 00:00:00","2019-01-01",43.31
0,"2019-01-01 00:01:00","2019-01-01",43.35
JSONEachRow
{"sensor_id":0,"time":"2019-01-01 00:00:00","date":"2019-01-01",...}
Popular formats for loading data

# Load CSV
cat readings.csv |
clickhouse-client
--query "INSERT INTO meetup.readings FORMAT CSVWithNames"
# Load JSON
cat readings.json |
clickhouse-client --query "INSERT INTO meetup.readings
FORMAT JSONEachRow"
Loading through clickhouse-client

-- Load from a file function.
sudo mkdir -p /var/lib/clickhouse/user_files
sudo chmod 777 /var/lib/clickhouse/user_files
sudo cp readings.json /var/lib/clickhouse/user_files
clickhouse-client
pika :) INSERT INTO meetup.readings
SELECT *
FROM file('readings.json', 'JSONEachRow',
'sensor_id Int32, time DateTime, date Date, temperature
Decimal(5,2)')
Loading through table functions

-- Insert from S3
INSERT INTO meetup.readings
SELECT * FROM
s3('https://s3.us-east-1.amazonaws.com/altinity-data-1/readings.csv',
'CSVWithNames',
'sensor_id Int32, time DateTime, date Date, temperature
Decimal(5,2)')
NEW: loading data from S3 (20.8+)

17
Third Step: Go crazy with your own queries
17
https://clickhouse.tech/docs/en/sql-reference/statements/select/

But what about client libraries??
1818
Language Popular Drivers
C++ https://github.com/ClickHouse/clickhouse-cpp
Golang https://github.com/ClickHouse/clickhouse-go
Java https://github.com/ClickHouse/clickhouse-jdbc
ODBC https://github.com/ClickHouse/clickhouse-odbc
Python https://github.com/mymarilyn/clickhouse-driver
PHP and Javascript Use a library listed on ClickHouse.tech *or* roll your own using
the ClickHouse HTTP interface

ClickHouse
Database
self-defense

Database Choices
Row Store Column Store
“Data Warehouse”

a b c d e f g h i j k l m n o...
MySQL: Row Store Access
Read row data serially

a b c d e f g h i j k l m n o p q r s t u v...
Column Store Access
Read compressed columns in parallel

There is no penalty for wide tables
“Pay” only for the columns you read

Compression makes data even smaller
Data
Type
Codec Compression
LowCardinality
(String)
(none) LZ4
UInt32 DoubleDelta ZSTD(1)

Optimize compression to reduce I/O!
CREATE TABLE billy.readings (
sensor_id Int32 Codec(DoubleDelta, ZSTD(1)),
time DateTime Codec(DoubleDelta, ZSTD(1)),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, ZSTD(1))
)
Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (sensor_id, time);
Codec
Compression
Computed value

Query system.columns to see compression
3.22%
0.13%
3.34%
0.14%
43.8%
29.3%

Materialized views restructure/reduce data
readings
Table
Ingest
All sensor readings Daily max/min by sensor
readings_daily
AggregatingMergeTree
(Trigger)
readings_daily_mv
Materialized View
CREATE MATERIALIZED VIEW billy.readings_daily_mv
TO billy.readings_daily AS
SELECT sensor_id, date,
minState(temperature) as temp_min,
maxState(temperature) as temp_max
FROM billy.readings
GROUP BY sensor_id, date;
Size: 544GB
Rows: 500B
Size: 1.7GB
Rows: 347M

Materialized views function like indexes!
SELECT max(temp_max)
FROM billy.readings_daily
WHERE sensor_id = 55
┌─max(temp_max)─┐
│ 75.91 │
└───────────────┘
1 rows in set. Elapsed: 0.011 sec. Processed 180.22
thousand rows, 1.44 MB (15.86 million rows/s., 126.84
MB/s.)

ClickHouse performance tuning is different...
The bad news…
● No query optimizer
● No EXPLAIN PLAN
● May need to move [a lot
of] data for performance
The good news…
● No query optimizer!
● System log is great
● System tables are too
● Performance drivers are
simple: I/O and CPU
● Constantly improving

Your friend: the ClickHouse query log
clickhouse-client --send_logs_level=trace
sudo less
/var/log/clickhouse-server/clickhouse-server.log
Return messages to
clickhouse-client
View all log
messages on server

Strengths and weaknesses of ClickHouse
(-) Lots of “small” lookups
(-) Lots of updates
(-) High concurrency
(-) Consistency critical
(+) Very long tables
(+) Very wide tables
(+) Open ended questions
(+) Lots of aggregates
OLTP
(“Online Transaction Processing”)
OLAP
(“Online Analytical Processing”)
ClickHouse >> MySQL for analytic queries

● Community docs on ClickHouse.tech
○ Everything Clickhouse
● ClickHouse Youtube Channel
○ Piles of community videos
● Altinity Blog
○ Lots of articles about ClickHouse usage
● Altinity Webinars
○ Webinars on all aspects of ClickHouse
● ClickHouse source code on Github
○ Check out tests for examples of detailed usage
More information and references
32

Thank you!
We’re hiring
ClickHouse:
https://github.com/ClickHouse/
ClickHouse
Documentation:
https://clickhouse.tech
Altinity Website:
https://www.altinity.com
33

Your first ClickHouse data warehouse

More Related Content

What's hot

Similar to Your first ClickHouse data warehouse

More from Altinity Ltd

Recently uploaded

Your first ClickHouse data warehouse