Presenter Bios
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security.
ClickHouse is DBMS #20
James Hartig -- Admiral CTO
Co-founder of Admiral, currently
working on distributed systems
in Golang to build Admiral
platform
Company Intros
www.altinity.com
Leading software and services
provider for ClickHouse
Major committer and community
sponsor in US and Western Europe
www.getadmiral.com
The Visitor Relationship Management
Company
A single platform to help publishers
grow visitor relationships and revenue
Admiral Overview
● Sustainable publishing through relationships
● Subscriptions
● Engagement
○ Email newsletter
○ Adblocking
● Privacy (GDPR + CCPA)
● Simple one-tag installation
Custom Experiences
● Custom design
● Elaborate frequencies
● Targeting on:
○ Referrers
○ Subscription State
○ Geo
○ Key-Value pairs
● Targeting performed in real-time
○ Without any code changes for publisher
Targeting In Action
1. User visits publisher’s site
2. JS collects data points about visit
3. Request to Front-End Node
4. Collect recent months of user events
5. Send everything to targeting
FEN
News
URL
Env
KV
History
Targeting
User Event Storage
● Pageview, Engage, Subscribe, Consent, etc
● Generating over 2,500 events a second
● Requires fast lookups for targeting
● Long-term storage for case studies and product development
● Aggregate events to build a session
● Chose ClickHouse
○ Long-term storage on HDD
○ Fast lookups with SSD + in-memory cache
○ Materialized views for storing a queue of events
Tech Stack
● GCP (Compute + PubSub + Memorystore)
● Go backend
● Microservice architecture
● 5 regions across 3 continents
● Over 10,000 HTTP requests per second
● Less than 250 VMs
ClickHouse
features that
enable Admiral
MergeTree is the workhorse ClickHouse table
-- Create table
CREATE TABLE mt (
`key` UInt32,
`value` Int32
) ENGINE = MergeTree()
PARTITION BY tuple() ORDER BY key
-- Add data.
INSERT INTO mt VALUES (1, 1);
INSERT INTO mt VALUES (1, -1);
SummingMergeTree is a useful variant
-- Create table with same schema
CREATE TABLE smt AS mt
ENGINE = SummingMergeTree()
ORDER BY key
-- Add data and select
INSERT INTO smt SELECT * FROM mt
-- When you select with FINAL, “zero” rows disappear!
SELECT key, sum(value) FROM smt FINAL GROUP BY key
0 rows in set. Elapsed: 0.001 sec.
Compression and codecs are configurable
CREATE TABLE test_codecs (
a_lz4 String CODEC(LZ4),
a_zstd String DEFAULT a_lz4 CODEC(ZSTD),
a_lc_lz4 LowCardinality(String) DEFAULT a_lz4 CODEC(LZ4),
a_lc_zstd LowCardinality(String) DEFAULT a_lz4 CODEC(ZSTD)
)
Engine = MergeTree
PARTITION BY tuple() ORDER BY tuple();
Effect on storage size is dramatic
20.84% 12.28%
10.61%
10.65%
7.89%
Materialized views reorganize data for speed
ClickHouse mat views are synchronous
post-insert triggers
Common uses:
● Aggregation
● Automatic reads from Kafka
● Build pipelines using chained views
● Pre-computing last-point queries
● Changing sorting or primary key
○ (Similar to Vertica projections)
cpu
MergeTree
cpu_last_point_agg
SummingMergeTree
cpu_last_point_mv
Materialized View
INSERT
SELECT
(Trigger)
Compressed: ~0.0009%
Uncompressed: ~0.002%
Clusters enable horizontal scaling
Shards
Replicas
Host Host Host
Host
Replicas help with
concurrency
Shards add
IOPs
More table engines to enable clustering
ReplicatedMergeTree
“Umbrella” table that
knows location of
shards and replicas
ReplicatedMergeTree
Distributed
ReplicatedMergeTree
ReplicatedMergeTree
Table that
automatically
propagates
changes to
other replicas
Shard
ClickHouse distributes queries over shards
ontime
_local
ontime
ontime
_local
ontime
ontime
_local
ontime
ontime
_local
ontime
Application
Innermost
subselect is
distributed
AggregateState
computed
locally
Aggregates
merged on
initiator node
Read performance using distributed tables
● Best case performance is linear
with number of nodes
● For fast queries network latency
may dominate parallelization
Tiered storage matches storage to access
Time Series Data
95% of queries
Last day
Last month
Last year
4% of queries
1% of queries
High IOPS
NVMe
SSD
HDD HDD HDD HDD
High Density
Storage configurations enable tiering
/data1 /data2 /data3
default data2data1Disks
tieredPolicy
slowfastVolumes
CREATE TABLE fast_readings (
sensor_id Int32 Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
) Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (sensor_id, time)
TTL time + INTERVAL 1 DAY TO VOLUME 'slow',
time + INTERVAL 1 YEAR DELETE
SETTINGS storage_policy = 'tiered'
TTLs control flow in storage
Available in
version 20.1.x
Admiral’s Path
to ClickHouse
First Attempt: Sharded Mongo
● Familiar database
● Sharded Mongo by region
● Flexible data structures
● Large documents
○ Hard to prune old visits
○ Huge indexes (long rebuilds during scaling)
● Primary/secondary/mongos
○ Complicated deployment/updates
○ Vertical Scaling
● Bugs encountered with sharding
○ Shard boundaries
○ Cleanup after shard split
https://jira.mongodb.org/browse/SERVER-38971, https://jira.mongodb.org/browse/SERVER-38969
{
"_id": "alex",
"dc": "gce-us-east1",
"site": "games",
"events": [
{
"time": "10:01",
"type": "visit",
"url": "...",
...
},
{
"time": "10:02",
"type": "engage",
"url": "...",
...
},
...
],
"lastEvent": 10:02"
}
Current: ClickHouse + Redis
● MVs and time-based parts
● Horizontal scaling
○ Rolling updates without downtime
○ Manual intervention adding new replica
● High compression ratio
● 50% of RAM dedicated to uncompressed cache
● 3 ClickHouse servers per region
● Memorystore (Redis) cluster per region
○ Synchronously add into Redis
○ Asynchronously send to Pub/Sub
Per Region
ClickHouse Storage
● Inserts are batched into “events”
○ Spinning HDD for cost
● Materialized Views create 2 other rows
○ Pending user count
○ Smaller SSD events table
● Fast reads from SSD for targeting
● Future: TTL-based tiered storage
Time User Site Type URL ...
10:01 Alex Games Visit ... ...
10:02 Alex Games Engage ... ...
10:25 Marie News Visit ... ...
Hour User Site Pending
10:00 Joe News 1
10:00 Alex Games 5
10:00 Marie News 4
Performance
● CH: 95th percentile <20ms, 50th percentile 7ms
○ Goal was 100ms for 95th percentile
○ Decreased 95th percentile compared to MongoDB
● Redis: 95th percentile <12ms, 50th percentile 3ms
● Over 1,000 CH queries/sec globally
○ >400 queries/sec in busiest region
● ~50% of queries hit ClickHouse
○ Tail in Redis to know if full history cached
○ 85%-90% uncompressed cache hit rate
Compression
● Global ZSTD Level 1
○ Optimized for speed
○ Future: Per-column compression levels
● LowCardinality type
○ Dictionary with stored positions
○ Country
○ Site
○ Engagement ID
○ 99%+ compression
SELECT
name, type,
1 - (data_compressed_bytes /
data_uncompressed_bytes)
FROM system.columns
WHERE table = ?
ORDER BY data_uncompressed_bytes DESC
user_agent String 0.95936
url String 0.65275
user UUID 0.75514
type UInt8 0.98113
User Sessions
● Session defined by no activity within 30 minutes
○ Or midnight
● Materialized view into SummingMergeTree
○ Sums value for same primary key
○ Deletes rows with 0 value
● Every hour fetch events for the user
○ Decide if session ended
○ Insert negative value
○ After merge row is removed
10:00 Alex Games -5
Hour User Site Pending
10:00 Joe News 1
10:00 Alex Games 5
10:00 Marie News 4
Hour User Site Pending
10:00 Joe News 1
10:00 Marie News 4
Queue in ClickHouse
SELECT groupArray(partition)
FROM system.parts
WHERE active AND database = ? AND table = ? AND rows > 0
SELECT
hour, user, site, sum(pending) as pending
FROM pending_users
PREWHERE hour = ?
GROUP BY hour, user, site
HAVING pending > 0
ORDER BY (user, site);
Queue in ClickHouse
Each hour the number of table parts decreases as rows are removed and merged. At
midnight all sessions expire and the number of parts drops dramatically.
Future ClickHouse Usage
● Public/Internal Alerts
○ Signed up, enabled feature, etc to Slack
○ Popular article (realtime optimizations)
● Publisher Analytics
○ Currently storing >100TB in Bigtable
○ Remove hourly aggregation into Mongo
○ SQL instead of custom query language
● Audit Logging
Wrap-up
Takeaways
● Multiple benefits with switching to ClickHouse:
○ Expanded storage capacity
○ Increased scaling and performance
○ Reduced complexity in deployments
● First non-MongoDB datastore
● Shoutout to Altinity
○ Customized Training
○ POC assistance (design and schema optimization)
● Expanding to new projects
Thank you!
Admiral:
https://www.getadmiral.com
Altinity:
https://www.altinity.com

Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships for publishers

  • 2.
    Presenter Bios Robert Hodges- Altinity CEO 30+ years on DBMS plus virtualization and security. ClickHouse is DBMS #20 James Hartig -- Admiral CTO Co-founder of Admiral, currently working on distributed systems in Golang to build Admiral platform
  • 3.
    Company Intros www.altinity.com Leading softwareand services provider for ClickHouse Major committer and community sponsor in US and Western Europe www.getadmiral.com The Visitor Relationship Management Company A single platform to help publishers grow visitor relationships and revenue
  • 5.
    Admiral Overview ● Sustainablepublishing through relationships ● Subscriptions ● Engagement ○ Email newsletter ○ Adblocking ● Privacy (GDPR + CCPA) ● Simple one-tag installation
  • 6.
    Custom Experiences ● Customdesign ● Elaborate frequencies ● Targeting on: ○ Referrers ○ Subscription State ○ Geo ○ Key-Value pairs ● Targeting performed in real-time ○ Without any code changes for publisher
  • 7.
    Targeting In Action 1.User visits publisher’s site 2. JS collects data points about visit 3. Request to Front-End Node 4. Collect recent months of user events 5. Send everything to targeting FEN News URL Env KV History Targeting
  • 8.
    User Event Storage ●Pageview, Engage, Subscribe, Consent, etc ● Generating over 2,500 events a second ● Requires fast lookups for targeting ● Long-term storage for case studies and product development ● Aggregate events to build a session ● Chose ClickHouse ○ Long-term storage on HDD ○ Fast lookups with SSD + in-memory cache ○ Materialized views for storing a queue of events
  • 9.
    Tech Stack ● GCP(Compute + PubSub + Memorystore) ● Go backend ● Microservice architecture ● 5 regions across 3 continents ● Over 10,000 HTTP requests per second ● Less than 250 VMs
  • 10.
  • 11.
    MergeTree is theworkhorse ClickHouse table -- Create table CREATE TABLE mt ( `key` UInt32, `value` Int32 ) ENGINE = MergeTree() PARTITION BY tuple() ORDER BY key -- Add data. INSERT INTO mt VALUES (1, 1); INSERT INTO mt VALUES (1, -1);
  • 12.
    SummingMergeTree is auseful variant -- Create table with same schema CREATE TABLE smt AS mt ENGINE = SummingMergeTree() ORDER BY key -- Add data and select INSERT INTO smt SELECT * FROM mt -- When you select with FINAL, “zero” rows disappear! SELECT key, sum(value) FROM smt FINAL GROUP BY key 0 rows in set. Elapsed: 0.001 sec.
  • 13.
    Compression and codecsare configurable CREATE TABLE test_codecs ( a_lz4 String CODEC(LZ4), a_zstd String DEFAULT a_lz4 CODEC(ZSTD), a_lc_lz4 LowCardinality(String) DEFAULT a_lz4 CODEC(LZ4), a_lc_zstd LowCardinality(String) DEFAULT a_lz4 CODEC(ZSTD) ) Engine = MergeTree PARTITION BY tuple() ORDER BY tuple();
  • 14.
    Effect on storagesize is dramatic 20.84% 12.28% 10.61% 10.65% 7.89%
  • 15.
    Materialized views reorganizedata for speed ClickHouse mat views are synchronous post-insert triggers Common uses: ● Aggregation ● Automatic reads from Kafka ● Build pipelines using chained views ● Pre-computing last-point queries ● Changing sorting or primary key ○ (Similar to Vertica projections) cpu MergeTree cpu_last_point_agg SummingMergeTree cpu_last_point_mv Materialized View INSERT SELECT (Trigger) Compressed: ~0.0009% Uncompressed: ~0.002%
  • 16.
    Clusters enable horizontalscaling Shards Replicas Host Host Host Host Replicas help with concurrency Shards add IOPs
  • 17.
    More table enginesto enable clustering ReplicatedMergeTree “Umbrella” table that knows location of shards and replicas ReplicatedMergeTree Distributed ReplicatedMergeTree ReplicatedMergeTree Table that automatically propagates changes to other replicas Shard
  • 18.
    ClickHouse distributes queriesover shards ontime _local ontime ontime _local ontime ontime _local ontime ontime _local ontime Application Innermost subselect is distributed AggregateState computed locally Aggregates merged on initiator node
  • 19.
    Read performance usingdistributed tables ● Best case performance is linear with number of nodes ● For fast queries network latency may dominate parallelization
  • 20.
    Tiered storage matchesstorage to access Time Series Data 95% of queries Last day Last month Last year 4% of queries 1% of queries High IOPS NVMe SSD HDD HDD HDD HDD High Density
  • 21.
    Storage configurations enabletiering /data1 /data2 /data3 default data2data1Disks tieredPolicy slowfastVolumes
  • 22.
    CREATE TABLE fast_readings( sensor_id Int32 Codec(DoubleDelta, LZ4), time DateTime Codec(DoubleDelta, LZ4), date ALIAS toDate(time), temperature Decimal(5,2) Codec(T64, LZ4) ) Engine = MergeTree PARTITION BY toYYYYMM(time) ORDER BY (sensor_id, time) TTL time + INTERVAL 1 DAY TO VOLUME 'slow', time + INTERVAL 1 YEAR DELETE SETTINGS storage_policy = 'tiered' TTLs control flow in storage Available in version 20.1.x
  • 23.
  • 24.
    First Attempt: ShardedMongo ● Familiar database ● Sharded Mongo by region ● Flexible data structures ● Large documents ○ Hard to prune old visits ○ Huge indexes (long rebuilds during scaling) ● Primary/secondary/mongos ○ Complicated deployment/updates ○ Vertical Scaling ● Bugs encountered with sharding ○ Shard boundaries ○ Cleanup after shard split https://jira.mongodb.org/browse/SERVER-38971, https://jira.mongodb.org/browse/SERVER-38969 { "_id": "alex", "dc": "gce-us-east1", "site": "games", "events": [ { "time": "10:01", "type": "visit", "url": "...", ... }, { "time": "10:02", "type": "engage", "url": "...", ... }, ... ], "lastEvent": 10:02" }
  • 25.
    Current: ClickHouse +Redis ● MVs and time-based parts ● Horizontal scaling ○ Rolling updates without downtime ○ Manual intervention adding new replica ● High compression ratio ● 50% of RAM dedicated to uncompressed cache ● 3 ClickHouse servers per region ● Memorystore (Redis) cluster per region ○ Synchronously add into Redis ○ Asynchronously send to Pub/Sub Per Region
  • 26.
    ClickHouse Storage ● Insertsare batched into “events” ○ Spinning HDD for cost ● Materialized Views create 2 other rows ○ Pending user count ○ Smaller SSD events table ● Fast reads from SSD for targeting ● Future: TTL-based tiered storage Time User Site Type URL ... 10:01 Alex Games Visit ... ... 10:02 Alex Games Engage ... ... 10:25 Marie News Visit ... ... Hour User Site Pending 10:00 Joe News 1 10:00 Alex Games 5 10:00 Marie News 4
  • 27.
    Performance ● CH: 95thpercentile <20ms, 50th percentile 7ms ○ Goal was 100ms for 95th percentile ○ Decreased 95th percentile compared to MongoDB ● Redis: 95th percentile <12ms, 50th percentile 3ms ● Over 1,000 CH queries/sec globally ○ >400 queries/sec in busiest region ● ~50% of queries hit ClickHouse ○ Tail in Redis to know if full history cached ○ 85%-90% uncompressed cache hit rate
  • 28.
    Compression ● Global ZSTDLevel 1 ○ Optimized for speed ○ Future: Per-column compression levels ● LowCardinality type ○ Dictionary with stored positions ○ Country ○ Site ○ Engagement ID ○ 99%+ compression SELECT name, type, 1 - (data_compressed_bytes / data_uncompressed_bytes) FROM system.columns WHERE table = ? ORDER BY data_uncompressed_bytes DESC user_agent String 0.95936 url String 0.65275 user UUID 0.75514 type UInt8 0.98113
  • 29.
    User Sessions ● Sessiondefined by no activity within 30 minutes ○ Or midnight ● Materialized view into SummingMergeTree ○ Sums value for same primary key ○ Deletes rows with 0 value ● Every hour fetch events for the user ○ Decide if session ended ○ Insert negative value ○ After merge row is removed 10:00 Alex Games -5 Hour User Site Pending 10:00 Joe News 1 10:00 Alex Games 5 10:00 Marie News 4 Hour User Site Pending 10:00 Joe News 1 10:00 Marie News 4
  • 30.
    Queue in ClickHouse SELECTgroupArray(partition) FROM system.parts WHERE active AND database = ? AND table = ? AND rows > 0 SELECT hour, user, site, sum(pending) as pending FROM pending_users PREWHERE hour = ? GROUP BY hour, user, site HAVING pending > 0 ORDER BY (user, site);
  • 31.
    Queue in ClickHouse Eachhour the number of table parts decreases as rows are removed and merged. At midnight all sessions expire and the number of parts drops dramatically.
  • 32.
    Future ClickHouse Usage ●Public/Internal Alerts ○ Signed up, enabled feature, etc to Slack ○ Popular article (realtime optimizations) ● Publisher Analytics ○ Currently storing >100TB in Bigtable ○ Remove hourly aggregation into Mongo ○ SQL instead of custom query language ● Audit Logging
  • 33.
  • 34.
    Takeaways ● Multiple benefitswith switching to ClickHouse: ○ Expanded storage capacity ○ Increased scaling and performance ○ Reduced complexity in deployments ● First non-MongoDB datastore ● Shoutout to Altinity ○ Customized Training ○ POC assistance (design and schema optimization) ● Expanding to new projects
  • 35.