Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay

Real-Time, Exactly-Once Data
Ingestion from Kafka to ClickHouse
Mohammad Roohitavaf, Jun Li
October 21, 2021

The Real-Time Analytics Processing Pipeline

ClickHouse as Real-Time Analytics Database
• ClickHouse: an open-source columnar database
to support OLAP
• Data insertion favors large blocks over individual
rows
• Kafka serves as data buffering
• A Block Aggregator is a data loader to aggregate
Kafka messages into large blocks before loading to
ClickHouse

Block Aggregator Failures
• With respect to block aggregator
• Kafka can fail
• Database backend can fail
• Network connections to Kafka and database can fail
• Block aggregator itself can crash
• Blindly retries on loading data will lead to data loss or data duplication to data
persisted in database
• Kafka transaction mechanism can not be applied here

Our Solution: Exactly-Once Message Delivery to ClickHouse
• To have aggregator to deterministically produce identical blocks to ClickHouse
• With existing runtime supports:
• Kafka metadata store to keep track of execution state, and
• ClickHouse’s block duplication detection

The Outline of the Talk
• The block aggregator developed for multi-DC deployment
• The deterministic message replay protocol in block aggregator
• The runtime verifier as a monitoring/debugging tool for block aggregator
• Issues and experiences in block aggregator’s implementation and deployment
• The block aggregator deployment in production

The Multi-DC Kafka/ClickHouse Deployment
• Each database shard has its own topic
• #partitions in topic = #replicas in shard
• Block aggregator co-located in each
replica (as two containers in a
Kubernetes pod)
• Block aggregator only inserts data to
local database replica (with ClickHouse
replication protocol to replicate data to
other replicas)
• Each block aggregator subscribes to
both Kafka clusters

The Multi-DC Kafka/ClickHouse Failure Scenario (1)
(Kafka DC Down)

(DC Down)
(ClickHouse DC
Down)
• ClickHouse insert-quorum = 2

(Kafka DC Down)
(ClickHouse
DC Down)
• ClickHouse insert-quorum = 2

Mappings of Topics, Tables, Rows, Messages
• One topic contains messages associated with multiple
tables in database
• One message contains multiple rows belonging to the
same table
• Each message is an opaque byte-array in Kafka based on
the protobuf-based encoding mechanism
• Block aggregator relies on ClickHouse table schema to
decode Kafka messages
• When a new table is added to database, no need to make
schema changes to Kafka clusters
• The number of topics does not grow as the tables continue
to be added
• Table rows constructed from Kafka messages in two Kafka
DCs get merged in database

The Block Aggregator Architecture

The Key Features of Block Aggregator
• Support multi-datacenter deployment model
• Multiple tables per topic/partition
• No data loss/duplication
• Monitoring with over hundred metrics:
• Message processing rates
• Block insertion rate and failure rate
• Block size distribution
• Block loading time distribution
• Kafka metadata commit time and failure rate
• Whether abnormal message consumption behaviors happened (such as message
offset re-wound or skipped)

A Naïve Way for Block Aggregator to Replay Messages (1)

A Naïve Way for Block Aggregator to Replay Messages (2)

Our Solution: Block-Level Deduplication in ClickHouse (1)
• ClickHouse relies on ZooKeeper to store metadata
• Each block stored contains a hash value
• New blocks to be inserted need to have hash uniqueness checked
• Blocks are identical if
• Having same block size
• Containing same rows
• And rows in same order

Our Solution: Guarantee to Form Identical Blocks (2)
• Store metadata back to Kafka which describes the latest blocks formed for
each table
• In case of failure, the next Block Aggregator that picks up the partition will
know exactly how to reconstruct the latest blocks formed for each table by
the previous Block Aggregator
• The Block Aggregators can be in two different ClickHouse replicas, if Kafka
partition rebalancing happens

The Metadata Structure
For each Kafka connector, the metadata persisted to Kafka, per partition, is:
replica_1,table1,0,29,20,table2,5,20,10
The last block for table1 decided to load to ClickHouse: [0, 29].
Starting offset min = 0, we have consumed 20 messages for table1.
The last block for table2 decided to load to ClickHouse: [5, 20].
Starting offset min = 0, we have consumed 10 messages for table2.
In total, we have consumed all 30 messages from offset min=0 to offset max=29: 20 for table 1 and 10 for table2.
replica-Id, [table-name, begin-msg-offset, end-msg-offset, count]+
Metadata.min = MIN (begin-msg-offset); Metadata.max = MAX(end-msg-offset)

The Metadata Structure for Special Block
• Special block: when begin-msg-offset = end-msg-offset + 1
• Either no message for the table with offset less than begin-msg-offset
• Or any message for the table with offset less than begin-msg-offset has been
received and acknowledged by ClickHouse
• Example: replica_id,table1,30,29,20,table2,5,20,10
• All messages with offset less than 30 for table1 are acknowledged by
ClickHouse

Message Processing Sequence: Consume/Commit/Load
The message processing
shown here is per partition

Two Execution Modes:
• Aggregators starts from the message offset previously committed
• REPLAY: Where aggregator retries sending the last blocks sent for each table to avoid
data loss
• CONSUME: Where aggregator is done with REPLAY and it is in the normal state
• Mode Switching:
DetermineState (current_offset, saved_metadata) {
begin=saved_metadata.min
end = saved_metadata.max
if (current_offset > end) state = CONSUME
else state = REPLAY
}

The Top-Level Processing Loop of A Kafka Connector
• For each Kafka Connector:
while (running){ //outer loop
wait for ClickHouse and Kafka to be healthy and connected
while (running){ // inner loop
batch = read a batch from Kafka if error, break inner loop
for (msg : batch.messages){
partitionHandlers[msg.partition].consume(msg) if error, break
inner loop
}
for (ph : partitionHandlers){
if (ph.state == CONSUME){
ph.checkBuffers() if error, break the inner loop
}
}
}
disconnect from Kafka
clear partitionHandlers
}
Consume loop
Check buffers loop
- Commit to Kafka
- Flush to ClickHouse
- Append message to its
table’s buffer
Elapsed time <= max_poll_interval

Some Clarifications
• Partition handlers can be dynamically created or deleted due to Kafka Broker’s decision
• Under some failure condition, one Kafka Connector can have > 1 partitions assigned
• Partition handler performs metadata commit on the corresponding partition
• Each partition handler can process multiple tables (because a Kafka partition can support
multiple tables)
• At any given time, each partition handler can only have one in-flight block, per table, to
be inserted to ClickHouse
• No new block can be submitted until the current in-flight block gets successful ACK from ClickHouse
• Thus, the metadata committed is just one block per table ahead, i.e., “Write Ahead Logging with One
Block”
• In other words, when replay happens, at most one block per table needs to be replayed

Some Clarifications (cont’d)
• If block insertion to ClickHouse fails,
• The outermost loop will disconnect the Kafka Connector from the Kafka Broker
• The Kafka consumer group rebalancing gets triggered automatically
• A different replica’s Kafka Connector will be assigned for the partition and block insertion
continues at this new replica
• Thus, rebalancing allows “Global Retries with Last Committed State” over multiple replicas
• The same failure handling mechanism can be applied, for example, when metadata
commit to Kafka fails
• Thus, Kafka consumer group rebalancing is an indicator on the situation in which a failure
cannot be recovered by a block aggregator

Example on Partition Rebalancing on Replicas
The following diagram shows two aggregators in one shard being killed (to simulate 1
datacenter down), and block insertion traffic gets picked up by the two remaining
aggregators in the same shard.

Runtime Verification
•Aggregator Verifier (AV): To check all blocks flushed by all aggregators to
ClickHouse not cause any data loss/duplication
•How can AV know what are the blocks flushed by the aggregators?
• Each aggregator commits metadata to Kafka before flushing anything to ClickHouse, for each
partition
• All metadata records committed by the aggregators will be appended to an internal topic in
Kafka called __consumer_offsets
• Thus, AV needs to subscribe to this topic and learn about all blocks flushed to ClickHouse by all
aggregators

Runtime Verification Algorithm
Let M.t.start and M.t.end be the start offset
and end offset for table t in metadata M,
respectively
For any given metadata instances M and M’,
where M committed happened before M’
committed, in time:
•Backward Anomaly: For some table t,
M’.t.end < M.t.start
•Overlap Anomaly: For some table t,
M.t.start < M’.t.end AND M’.t.start <
M.t.end

Runtime Verifier Implementation
•The verifier reads metadata instances in the commit order to Kafka, stored in the system
topic called _consumer_offset.
•The _consumer_offset is a partitioned topic and Kafka does not guarantee ordering across
partitions.
•We order metadata instances with respect to their commit timestamp at the brokers. This
approach requires the clock of the Kafka brokers to be synchronized with an uncertainty
window less than the time between committing two metadata instances. Thus, we should
not commit metadata to Kafka too frequently.
•This is not a problem in block aggregator, as it commits metadata to Kafka for each block
every several seconds, which is not very frequent compared to the clock skew.

Compile and Link ClickHouse into Block Aggregator
• Instead of using the C++ client library at the ClickHouse repo, we compiled
and linked the entire ClickHouse codebase to block aggregator
• It allows us to leverage the native ClickHouse implementation:
• Native TCP/IP communication protocol (with TLS and connection pooling)
• Select query capabilities just like ClickHouse-Client (for testing purpose)
• Table schema retrieval, and block header construction from schema
• Column construction from protobuf-based Kafka message deserialization
• Column default expression evaluation
• ZooKeeper client for distributed locking

Dynamic Table Schema Update
• To dynamically update a table schema:
• Step 1: Table schema is updated to each ClickHouse shard
• Step 2: Block aggregators in each shard is restarted, thus to load updated schema from the
co-located ClickHouse replica
• Step 3: With offline confirmation on schema update, the client application updates its
application logic to follow the updated schema to produce new Kafka messages
• Requirement: Block aggregator needs to be able to deserialize the Kafka
messages into blocks, for the messages with or without the updated schema
• Solution: to enforce that columns in a table schema can only be added and
can not be deleted afterwards

Multiple ZooKeeper Clusters for One ClickHouse Cluster
• ClickHouse relies on ZooKeeper as metadata store and replication coordination
• Each block insertion takes roughly 15 remote calls to ZooKeeper server cluster
• Block insertion is performed per table
• Our ZooKeeper (with 3.5.8) cluster is deployed across three datacenters with ~ 20 ms cross-
datacenter communication latency
• For a large ClickHouse cluster with 250 shards (with each shard having 4 replicas), a single
ZooKeeper deployment can introduce high ZooKeeper “hardware exception” rate
• The exception due to ZooKeeper session frequently expired
• Multiple ZooKeeper clusters are deployed instead, with each allocated with a subset of the
ClickHouse shards
• In our deployment, 50 shards share one ZK cluster
• It depends on block insertion rate per table, and total number of tables involved in real-time
insertion

Distributed Locking at Block Aggregator
• Before “insert_quorum_parallel” is introduced in ClickHouse,
• In each shard, for each table, only one replica is allowed to perform data insertion
• Distributed locking is used to coordinate block insertion at block aggregators
• The ZooKeeper locking implementation in ClickHouse is used
• More recent ClickHouse version has “insert_quorum_parallel” introduced
• The default value is true
• According to the Altinity blog article, current ClickHouse implementation breaks
sequential consistency and may have other side effects
• In our recent product release based on ClickHouse 21.8, we turned this option off
• And we still enforce distributed locking at block aggregator

Testing on Block Aggregator
• Resiliency Testing (in an 8-shard cluster with 32 replicas )
• Follow the “Chaos Monkey” approach
• Kill: individual processes and individual containers, across ZooKeeper, ClickHouse, Block Aggregator
• Kill: all processes and containers in one datacenter, across ZooKeeper, ClickHouse, Block Aggregator
• To validate whether data loading can recover and continue
• Smaller-scale integration testing
• The whole cluster runs on a single machine with multiple processes from ZooKeeper, ClickHouse and
Block Aggregators
• Programmatically control process start/stop, along with small table insertion
• In addition, to turn on fault injection at predefined points in Block Aggregator code
- For example, to not accept Kafka messages deliberately for 10 seconds
• Validate whether data loss and data duplication happens

ClickHouse Troubleshooting and Remediation
• The setting “insert_quorum = 2” is to guarantee high data reliability
• ClickHouse Exception (with error code = 286) can happen occasionally:
2021.04.10 16:26:38.896509 [ 59963 ] {8421e4d6-43f0-4792-8570-7ef2bf8f595a} <Error> executeQuery: Code: 286, e.displayText()
= DB::Exception: Quorum for previous write has not been satisfied yet. Status: version: 1
part_name: 20210410-0_990_990_0
required_number_of_replicas: 2
actual_number_of_replicas: 1
replicas: SLC-74137
Data insertion in the whole shard stops
when this exception happens!

ClickHouse Troubleshooting and Remediation (cont’d)
• An inhouse tool is developed to:
• scan ZooKeeper subtree associated with log replication queues
• inspect why queued commands cannot be performed
• Once queued commands all get cleared, the quorum then automatically gets satisfied
• Afterwards, data insertion resumes in the shard
• Real-time alerts are defined:
• Long duration time that a shard does not have block insertion
• Block insertion experiences non-zero failure rate with error code = 286
• Some replicas have their replication queues too large

Block Aggregator Deployment in Production
One Example Deployment
Kafka Clusters: 2 Datacenters
The ClickHouse Cluster:
*2 datacenters
*250 shards
*Each shard having 4 replicas (2 replica
per DC)
*Each aggregator co-located in each
replica
Metric Measured Result
Total messages processed/sec (peak) 280 K
Total message bytes processed/sec (peak) 220 MB/sec
95%-tile block insertion time (quorum=2) 3.8 sec (for table 1)
1.1 sec (for table 2)
4.0 sec (for table 3)
95%-tile block size 0.16 MB (for table 1)
0.03 MB (for table 2)
0.46 MB (for table 3)
95%-tile number of rows in a block 1358 rows (for table 1)
1.8 rows (for table 2)
1894 rows (for table 3)
95%-tile Kafka commit time 64 ms
End-to-end message consumption Lag time < 30 sec

•The block insertion rate at the shard level in a 24-hour window

•The message consumption LAG time at the shard level captured in a 24-hour window

•The Kafka Group Rebalance Rate at the shard level in a 24-hour window (always 0)

•The ZooKeeper hardware exception in a 24-hour window (close to 0)

Summary
•Using streaming platforms like Kafka is one standard way to transfer data across data
processing systems
•For Columnar DB, block loading is more efficient than loading individual records
•Under failure conditions, replaying Kafka messages may cause data loss or data duplication at
block loaders
•Our solution is to deterministically produce identical blocks under various failure conditions so
that the backend Columnar DB can detect and remove duplicated blocks
•The same solution allows us to verify that blocks are always produced correctly under failure
conditions
•This solution has been developed and deployed into production

Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay

Similar to Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay (20)

More from Altinity Ltd

More from Altinity Ltd (20)

Recently uploaded

Recently uploaded (20)

Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay