Building an Analytic Extension to MySQL with ClickHouse and Open Source
In this webinar Percona and Altinity offer suggestions and tips on how to recognize when MySQL is overburdened with analytics and can benefit from ClickHouse’s unique capabilities.
Also, they will walk you through important patterns for integrating MySQL and ClickHouse which will enable the building of powerful and cost-efficient applications that leverage the strengths of both databases.
13. 13
Leveraging Analytical
Benefits of
ClickHouse
● Identify Databases/Tables in
MySQL to be replicated
● Create schema/Databases in
ClickHouse
● Transfer Data from MySQL to
ClickHouse
https://github.com/Altinity/clickhouse-sink-connector
14. Fully wired, continuous replication
14
Table Engine(s)
Initial Dump/Load
MySQL ClickHouse
OLTP App Analytic App
MySQL
Binlog
Debezium
Altinity Sink
Connector
Kafka*
Event
Stream
*Including Pulsar and RedPanda
ReplacingMergeTree
16. 1. Initial Dump/Load
Why do we need custom load/dump tools?
● Data Types limits and Data Types are not the same for
MySQL and ClickHouse
Date Max MySQL(9999-12-31), Date CH(2299-12-31)
● Translate/Read MySQL schema and create ClickHouse
schema. (Identify PK, partition and translate to ORDER BY
in CH(RMT))
● Faster transfer, leverage existing MySQL and ClickHouse
tools.
20. 1. Initial Dump/Load
CREATE TABLE IF NOT EXISTS `employees_predated` (
`emp_no` int NOT NULL,
`birth_date` Date32 NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` Date32 NOT NULL,
`salary` bigint unsigned DEFAULT NULL,
`num_years` tinyint unsigned DEFAULT NULL,
`bonus` mediumint unsigned DEFAULT NULL,
`small_value` smallint unsigned DEFAULT NULL,
`int_value` int unsigned DEFAULT NULL,
`discount` bigint DEFAULT NULL,
`num_years_signed` tinyint DEFAULT NULL,
`bonus_signed` mediumint DEFAULT NULL,
`small_value_signed` smallint DEFAULT NULL,
`int_value_signed` int DEFAULT NULL,
`last_modified_date_time` DateTime64(0) DEFAULT NULL,
`last_access_time` String DEFAULT NULL,
`married_status` char(1) DEFAULT NULL,
`perDiemRate` decimal(30,12) DEFAULT NULL,
`hourlyRate` double DEFAULT NULL,
`jobDescription` text DEFAULT NULL,
`updated_time` String NULL ,
`bytes_date` longblob DEFAULT NULL,
`binary_test_column` varbinary(255) DEFAULT NULL,
`blob_med` mediumblob DEFAULT NULL,
`blob_new` blob DEFAULT NULL,
`_sign` Int8 DEFAULT 1,
`_version` UInt64 DEFAULT 0,
) ENGINE = ReplacingMergeTree(_version) ORDER BY (`emp_no`)
SETTINGS index_granularity = 8192;
CREATE TABLE `employees_predated` (
`emp_no` int NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
`salary` bigint unsigned DEFAULT NULL,
`num_years` tinyint unsigned DEFAULT NULL,
`bonus` mediumint unsigned DEFAULT NULL,
`small_value` smallint unsigned DEFAULT NULL,
`int_value` int unsigned DEFAULT NULL,
`discount` bigint DEFAULT NULL,
`num_years_signed` tinyint DEFAULT NULL,
`bonus_signed` mediumint DEFAULT NULL,
`small_value_signed` smallint DEFAULT NULL,
`int_value_signed` int DEFAULT NULL,
`last_modified_date_time` datetime DEFAULT NULL,
`last_access_time` time DEFAULT NULL,
`married_status` char(1) DEFAULT NULL,
`perDiemRate` decimal(30,12) DEFAULT NULL,
`hourlyRate` double DEFAULT NULL,
`jobDescription` text,
`updated_time` timestamp NULL DEFAULT NULL,
`bytes_date` longblob,
`binary_test_column` varbinary(255) DEFAULT NULL,
`blob_med` mediumblob,
`blob_new` blob,
PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY RANGE (`emp_no`)
(PARTITION p1 VALUES LESS THAN (1000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
*/ |
MySQL
ClickHouse
21. 2. Validate Data
Why is a basic count check not enough?
● Essential to validate the values, example decimal/floating
precision and datatype limits.
● Data Types are different between MySQL and ClickHouse.
Solution: md5 checksum of column data (Courtesy:
Sisense)
1. Take the MD5 of each column. Use a space for
NULL values.
2. Concatenate those results, and MD5 this result.
3. Split into 4 8-character hex strings.
4. Convert into 32-bit integers and sum.
python
db_compare/mysql_table_check
sum.py --mysql_host localhost --
mysql_user root --mysql_password
root --mysql_database menagerie
--tables_regex "^pet" --
debug_output
python
db_compare/clickhouse_table_c
hecksum.py --clickhouse_host
localhost --clickhouse_user root --
clickhouse_password root --
clickhouse_database menagerie --
tables_regex "^pet" --debug_output
diff out.pet.ch.txt out.pet.mysql.txt
| grep "<|>"
Credits: Arnaud
22. 3. Setup CDC Replication
MySQL
binlog file: mysql.bin.00001
binlog position: 100002
Or
Gtid: 1233:223232323
Debezium
Altinity Sink
Connector
Kafka*
Event
Stream
ClickHouse
Setup Debezium to start from binlog file/position or Gtid
https://github.com/Altinity/clickhouse-sink-connector/blob/develop/doc/debezium_setup.md
23. Final step - Deploy
● Docker Compose (Debezium Strimzi, Sink Strimzi)
https://hub.docker.com/repository/docker/altinity/clickhouse-sink-connector
● Kubernetes (Docker images)
● JAR file
24. Simplified Architecture
MySQL
binlog file: mysql.bin.00001
binlog position: 100002
Or
Gtid: 1233:223232323
ClickHouse
Debezium
Altinity Sink
Connector
One executable
One service
25. Final step - Monitor
● Monitor Lag
● Connector Status
● Kafka monitoring
● CPU/Memory Stats
30. Alter Table support
30
ADD Column <col_name> varchar(1000)
NULL
ADD Column <col_name> Nullable(String)
ADD index type btree ADD index type minmax
MySQL ClickHouse
32. 32
Replicating Schema Changes
● Debezium does not provide events for all DDL Changes
● Complete DDL is only available in a separate topic(Not a
SinkRecord)
● Parallel Kafka workers might process messages out of order.
34. Where can I get more information?
34
Altinity Sink Connector for ClickHouse
https://github.com/Altinity/clickhouse-sink-connector
https://github.com/ClickHouse/ClickHouse
https://github.com/mydumper/mydumper
35. 35
Project roadmap and next Steps
- PostgreSQL, Mongo, SQL server support
- CH shards/replicas support
- Support Transactions.
Experience deploying to customers and the tools we have developed in the process.
It's a complicated set of steps, it will be easier to automate the entire process.
Create schema/databases -> we have scripts for the initial load that simplifies this process, and sink connector can also auto create tables.
Complete suite of tools to simplify the process end to end.
Existing data in MySQL might be big, need a solution that will be fast to do the Initial transfer. (CH needs to be in-sync)
End to End solution for transferring data from MySQL to ClickHouse for Production Deployments.
Debezium timeout(STATEMENT execution timeout).
Source DB might have limited permissions.
You might not have permission to perform OUTFILE.
Step 1: Perform a dump of data from MySQL and load it into ClickHouse. Debezium initial snapshot might not be faster.
Step 2: After the dump is loaded, validate the data.
Step 3: Setup CDC replication using Debezium and Altinity sink connector.
Debezium provides initial snapshotting, but it’s slow.
Debezium load times very slow. MAX_EXECUTION_TIMEOUT
Debezium provides initial snapshotting, but it’s slow.
Mysqlsh requires a PK, if PK is not present, it does not parallelize and do not provide chunking capabilities.
Debezium provides initial snapshotting, but it’s slow.
Mysql shell uses zstd compression standard by default.
–threads option provides parallelism.
Debezium provides initial snapshotting, but it’s slow.
Mysql shell uses zstd compression standard by default.
–threads option provides parallelism.
Clickhouse_loader creates CH schema and adds version and sign columns for UPDATES/DELETES.
Debezium provides initial snapshotting, but it’s slow.
Mysql shell uses zstd compression standard by default.
–threads option provides parallelism.
Clickhouse_loader creates CH schema and adds version and sign columns for UPDATES/DELETES.
Debezium provides initial snapshotting, but it’s slow.
Compare results of the aggregation table that drives your dashboard.
Sales numbers have to be accurate.
Debezium provides initial snapshotting, but it’s slow.
Different environments
We also maintain images for Debezium/Strimzi and Sink/Strimzi
Debezium provides initial snapshotting, but it’s slow.
Different environments
We also maintain images for Debezium/Strimzi and Sink/Strimzi
Setup Alerts if connectors are down.
Setup Alerts when there is a lag.
Setup Alerts when there are errors.
We also bundle the debezium dashboard and the kafka dashboard.
Co-ordination is Key!
Tradeoff between Parallelism and Consistency.