Your first ClickHouse
data warehouse
Robert Hodges - 2 December 2020
SF Bay Area ClickHouse Meetup
1
Presenter and Company Bio
www.altinity.com
Enterprise provider for ClickHouse, a
popular, open source data warehouse.
Community sponsor and major
committers to ClickHouse project.
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security. Using
Kubernetes since 2018.
2
Introducing
ClickHouse
Single binary
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
ClickHouse is an open source data warehouse
ClickHouse Server
a b c d
And it’s really fast!
ClickHouse Server
a b c d
ClickHouse Server
a b c d
ClickHouse Server
a b c d
Installing ClickHouse goodness on Linux
# UBUNTU/DEBIAN INSTALL
sudo apt-get install apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 
--recv E0C56BD4
echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee 
/etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client
sudo systemctl start clickhouse-server
Debian
Packages
TarballsRPMs
ClickHouse goodness delivered by Docker
mkdir $HOME/clickhouse-data
docker run -d --name clickhouse-server 
--ulimit nofile=262144:262144 
--volume=$HOME/clickhouse-data:/var/lib/clickhouse 
-p 8123:8123 -p 9000:9000 
yandex/clickhouse-server
6
Persist data
Make ports visible
Make ClickHouse happy
YES!
● Yandex Managed Service for ClickHouse --
Runs in Yandex.Cloud
● Altinity.Cloud -- Runs in Amazon Public Cloud
Is there ClickHouse cloud goodness?
7
Where is the documentation?
8
https://clickhouse.tech/
Getting started
with app
development
10
First step: The ClickHouse Tutorial
10
https://clickhouse.tech/docs/en/getting-started/tutorial/
Second step: Design table(s) and load data
CREATE TABLE meetup.readings (
sensor_id Int32,
time DateTime,
date Date,
temperature Decimal(5,2)
)
Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (sensor_id, time);
Don’t stress about data types
Use MergeTree table types
Partition by month or day
Sort by “keys” to find dataLZ4 compression by default
Table
Part
Index Columns
Sparse index
Columns sorted
on ORDER BY
columns
Rows match
PARTITION BY
expression
Part
Index Columns
Part
Compressed
block
12
Your friend: the MergeTree table type
12
CSVWithNames
"sensor_id","time","date","temperature"
0,"2019-01-01 00:00:00","2019-01-01",43.31
0,"2019-01-01 00:01:00","2019-01-01",43.35
JSONEachRow
{"sensor_id":0,"time":"2019-01-01 00:00:00","date":"2019-01-01",...}
{"sensor_id":0,"time":"2019-01-01 00:01:00","date":"2019-01-01",...}
{"sensor_id":0,"time":"2019-01-01 00:02:00","date":"2019-01-01",...}
Popular formats for loading data
# Load CSV
cat readings.csv | 
clickhouse-client 
--query "INSERT INTO meetup.readings FORMAT CSVWithNames"
# Load JSON
cat readings.json | 
clickhouse-client --query "INSERT INTO meetup.readings
FORMAT JSONEachRow"
Loading through clickhouse-client
-- Load from a file function.
sudo mkdir -p /var/lib/clickhouse/user_files
sudo chmod 777 /var/lib/clickhouse/user_files
sudo cp readings.json /var/lib/clickhouse/user_files
clickhouse-client
pika :) INSERT INTO meetup.readings
SELECT *
FROM file('readings.json', 'JSONEachRow',
'sensor_id Int32, time DateTime, date Date, temperature
Decimal(5,2)')
Loading through table functions
-- Insert from S3
INSERT INTO meetup.readings
SELECT * FROM
s3('https://s3.us-east-1.amazonaws.com/altinity-data-1/readings.csv',
'CSVWithNames',
'sensor_id Int32, time DateTime, date Date, temperature
Decimal(5,2)')
NEW: loading data from S3 (20.8+)
17
Third Step: Go crazy with your own queries
17
https://clickhouse.tech/docs/en/sql-reference/statements/select/
But what about client libraries??
1818
Language Popular Drivers
C++ https://github.com/ClickHouse/clickhouse-cpp
Golang https://github.com/ClickHouse/clickhouse-go
Java https://github.com/ClickHouse/clickhouse-jdbc
ODBC https://github.com/ClickHouse/clickhouse-odbc
Python https://github.com/mymarilyn/clickhouse-driver
PHP and Javascript Use a library listed on ClickHouse.tech *or* roll your own using
the ClickHouse HTTP interface
ClickHouse
Database
self-defense
Database Choices
Row Store Column Store
“Data Warehouse”
a b c d e f g h i j k l m n o...
a b c d e f g h i j k l m n o...
a b c d e f g h i j k l m n o...
a b c d e f g h i j k l m n o...
a b c d e f g h i j k l m n o...
a b c d e f g h i j k l m n o...
MySQL: Row Store Access
Read row data serially
a b c d e f g h i j k l m n o p q r s t u v...
a b c d e f g h i j k l m n o p q r s t u v...
a b c d e f g h i j k l m n o p q r s t u v...
a b c d e f g h i j k l m n o p q r s t u v...
a b c d e f g h i j k l m n o p q r s t u v...
a b c d e f g h i j k l m n o p q r s t u v...
Column Store Access
Read compressed columns in parallel
There is no penalty for wide tables
“Pay” only for the columns you read
Compression makes data even smaller
Data
Type
Codec Compression
LowCardinality
(String)
(none) LZ4
UInt32 DoubleDelta ZSTD(1)
Optimize compression to reduce I/O!
CREATE TABLE billy.readings (
sensor_id Int32 Codec(DoubleDelta, ZSTD(1)),
time DateTime Codec(DoubleDelta, ZSTD(1)),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, ZSTD(1))
)
Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (sensor_id, time);
Codec
Compression
Computed value
Query system.columns to see compression
3.22%
0.13%
3.34%
0.14%
43.8%
29.3%
Materialized views restructure/reduce data
readings
Table
Ingest
All sensor readings Daily max/min by sensor
readings_daily
AggregatingMergeTree
(Trigger)
readings_daily_mv
Materialized View
CREATE MATERIALIZED VIEW billy.readings_daily_mv
TO billy.readings_daily AS
SELECT sensor_id, date,
minState(temperature) as temp_min,
maxState(temperature) as temp_max
FROM billy.readings
GROUP BY sensor_id, date;
Size: 544GB
Rows: 500B
Size: 1.7GB
Rows: 347M
Materialized views function like indexes!
SELECT max(temp_max)
FROM billy.readings_daily
WHERE sensor_id = 55
┌─max(temp_max)─┐
│ 75.91 │
└───────────────┘
1 rows in set. Elapsed: 0.011 sec. Processed 180.22
thousand rows, 1.44 MB (15.86 million rows/s., 126.84
MB/s.)
ClickHouse performance tuning is different...
The bad news…
● No query optimizer
● No EXPLAIN PLAN
● May need to move [a lot
of] data for performance
The good news…
● No query optimizer!
● System log is great
● System tables are too
● Performance drivers are
simple: I/O and CPU
● Constantly improving
Your friend: the ClickHouse query log
clickhouse-client --send_logs_level=trace
sudo less 
/var/log/clickhouse-server/clickhouse-server.log
Return messages to
clickhouse-client
View all log
messages on server
Strengths and weaknesses of ClickHouse
(-) Lots of “small” lookups
(-) Lots of updates
(-) High concurrency
(-) Consistency critical
(+) Very long tables
(+) Very wide tables
(+) Open ended questions
(+) Lots of aggregates
OLTP
(“Online Transaction Processing”)
OLAP
(“Online Analytical Processing”)
ClickHouse >> MySQL for analytic queries
● Community docs on ClickHouse.tech
○ Everything Clickhouse
● ClickHouse Youtube Channel
○ Piles of community videos
● Altinity Blog
○ Lots of articles about ClickHouse usage
● Altinity Webinars
○ Webinars on all aspects of ClickHouse
● ClickHouse source code on Github
○ Check out tests for examples of detailed usage
More information and references
32
Thank you!
We’re hiring
ClickHouse:
https://github.com/ClickHouse/
ClickHouse
Documentation:
https://clickhouse.tech
Altinity Website:
https://www.altinity.com
33

Your first ClickHouse data warehouse

  • 1.
    Your first ClickHouse datawarehouse Robert Hodges - 2 December 2020 SF Bay Area ClickHouse Meetup 1
  • 2.
    Presenter and CompanyBio www.altinity.com Enterprise provider for ClickHouse, a popular, open source data warehouse. Community sponsor and major committers to ClickHouse project. Robert Hodges - Altinity CEO 30+ years on DBMS plus virtualization and security. Using Kubernetes since 2018. 2
  • 3.
  • 4.
    Single binary Understands SQL Runson bare metal to cloud Stores data in columns Parallel and vectorized execution Scales to many petabytes Is Open source (Apache 2.0) ClickHouse is an open source data warehouse ClickHouse Server a b c d And it’s really fast! ClickHouse Server a b c d ClickHouse Server a b c d ClickHouse Server a b c d
  • 5.
    Installing ClickHouse goodnesson Linux # UBUNTU/DEBIAN INSTALL sudo apt-get install apt-transport-https ca-certificates dirmngr sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4 echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list sudo apt-get update sudo apt-get install -y clickhouse-server clickhouse-client sudo systemctl start clickhouse-server Debian Packages TarballsRPMs
  • 6.
    ClickHouse goodness deliveredby Docker mkdir $HOME/clickhouse-data docker run -d --name clickhouse-server --ulimit nofile=262144:262144 --volume=$HOME/clickhouse-data:/var/lib/clickhouse -p 8123:8123 -p 9000:9000 yandex/clickhouse-server 6 Persist data Make ports visible Make ClickHouse happy
  • 7.
    YES! ● Yandex ManagedService for ClickHouse -- Runs in Yandex.Cloud ● Altinity.Cloud -- Runs in Amazon Public Cloud Is there ClickHouse cloud goodness? 7
  • 8.
    Where is thedocumentation? 8 https://clickhouse.tech/
  • 9.
  • 10.
    10 First step: TheClickHouse Tutorial 10 https://clickhouse.tech/docs/en/getting-started/tutorial/
  • 11.
    Second step: Designtable(s) and load data CREATE TABLE meetup.readings ( sensor_id Int32, time DateTime, date Date, temperature Decimal(5,2) ) Engine = MergeTree PARTITION BY toYYYYMM(time) ORDER BY (sensor_id, time); Don’t stress about data types Use MergeTree table types Partition by month or day Sort by “keys” to find dataLZ4 compression by default
  • 12.
    Table Part Index Columns Sparse index Columnssorted on ORDER BY columns Rows match PARTITION BY expression Part Index Columns Part Compressed block 12 Your friend: the MergeTree table type 12
  • 13.
    CSVWithNames "sensor_id","time","date","temperature" 0,"2019-01-01 00:00:00","2019-01-01",43.31 0,"2019-01-01 00:01:00","2019-01-01",43.35 JSONEachRow {"sensor_id":0,"time":"2019-01-0100:00:00","date":"2019-01-01",...} {"sensor_id":0,"time":"2019-01-01 00:01:00","date":"2019-01-01",...} {"sensor_id":0,"time":"2019-01-01 00:02:00","date":"2019-01-01",...} Popular formats for loading data
  • 14.
    # Load CSV catreadings.csv | clickhouse-client --query "INSERT INTO meetup.readings FORMAT CSVWithNames" # Load JSON cat readings.json | clickhouse-client --query "INSERT INTO meetup.readings FORMAT JSONEachRow" Loading through clickhouse-client
  • 15.
    -- Load froma file function. sudo mkdir -p /var/lib/clickhouse/user_files sudo chmod 777 /var/lib/clickhouse/user_files sudo cp readings.json /var/lib/clickhouse/user_files clickhouse-client pika :) INSERT INTO meetup.readings SELECT * FROM file('readings.json', 'JSONEachRow', 'sensor_id Int32, time DateTime, date Date, temperature Decimal(5,2)') Loading through table functions
  • 16.
    -- Insert fromS3 INSERT INTO meetup.readings SELECT * FROM s3('https://s3.us-east-1.amazonaws.com/altinity-data-1/readings.csv', 'CSVWithNames', 'sensor_id Int32, time DateTime, date Date, temperature Decimal(5,2)') NEW: loading data from S3 (20.8+)
  • 17.
    17 Third Step: Gocrazy with your own queries 17 https://clickhouse.tech/docs/en/sql-reference/statements/select/
  • 18.
    But what aboutclient libraries?? 1818 Language Popular Drivers C++ https://github.com/ClickHouse/clickhouse-cpp Golang https://github.com/ClickHouse/clickhouse-go Java https://github.com/ClickHouse/clickhouse-jdbc ODBC https://github.com/ClickHouse/clickhouse-odbc Python https://github.com/mymarilyn/clickhouse-driver PHP and Javascript Use a library listed on ClickHouse.tech *or* roll your own using the ClickHouse HTTP interface
  • 19.
  • 20.
    Database Choices Row StoreColumn Store “Data Warehouse”
  • 21.
    a b cd e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... MySQL: Row Store Access Read row data serially
  • 22.
    a b cd e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... Column Store Access Read compressed columns in parallel
  • 23.
    There is nopenalty for wide tables “Pay” only for the columns you read
  • 24.
    Compression makes dataeven smaller Data Type Codec Compression LowCardinality (String) (none) LZ4 UInt32 DoubleDelta ZSTD(1)
  • 25.
    Optimize compression toreduce I/O! CREATE TABLE billy.readings ( sensor_id Int32 Codec(DoubleDelta, ZSTD(1)), time DateTime Codec(DoubleDelta, ZSTD(1)), date ALIAS toDate(time), temperature Decimal(5,2) Codec(T64, ZSTD(1)) ) Engine = MergeTree PARTITION BY toYYYYMM(time) ORDER BY (sensor_id, time); Codec Compression Computed value
  • 26.
    Query system.columns tosee compression 3.22% 0.13% 3.34% 0.14% 43.8% 29.3%
  • 27.
    Materialized views restructure/reducedata readings Table Ingest All sensor readings Daily max/min by sensor readings_daily AggregatingMergeTree (Trigger) readings_daily_mv Materialized View CREATE MATERIALIZED VIEW billy.readings_daily_mv TO billy.readings_daily AS SELECT sensor_id, date, minState(temperature) as temp_min, maxState(temperature) as temp_max FROM billy.readings GROUP BY sensor_id, date; Size: 544GB Rows: 500B Size: 1.7GB Rows: 347M
  • 28.
    Materialized views functionlike indexes! SELECT max(temp_max) FROM billy.readings_daily WHERE sensor_id = 55 ┌─max(temp_max)─┐ │ 75.91 │ └───────────────┘ 1 rows in set. Elapsed: 0.011 sec. Processed 180.22 thousand rows, 1.44 MB (15.86 million rows/s., 126.84 MB/s.)
  • 29.
    ClickHouse performance tuningis different... The bad news… ● No query optimizer ● No EXPLAIN PLAN ● May need to move [a lot of] data for performance The good news… ● No query optimizer! ● System log is great ● System tables are too ● Performance drivers are simple: I/O and CPU ● Constantly improving
  • 30.
    Your friend: theClickHouse query log clickhouse-client --send_logs_level=trace sudo less /var/log/clickhouse-server/clickhouse-server.log Return messages to clickhouse-client View all log messages on server
  • 31.
    Strengths and weaknessesof ClickHouse (-) Lots of “small” lookups (-) Lots of updates (-) High concurrency (-) Consistency critical (+) Very long tables (+) Very wide tables (+) Open ended questions (+) Lots of aggregates OLTP (“Online Transaction Processing”) OLAP (“Online Analytical Processing”) ClickHouse >> MySQL for analytic queries
  • 32.
    ● Community docson ClickHouse.tech ○ Everything Clickhouse ● ClickHouse Youtube Channel ○ Piles of community videos ● Altinity Blog ○ Lots of articles about ClickHouse usage ● Altinity Webinars ○ Webinars on all aspects of ClickHouse ● ClickHouse source code on Github ○ Check out tests for examples of detailed usage More information and references 32
  • 33.