Successfully reported this slideshow.
Your SlideShare is downloading. ×

Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 45 Ad

Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay

Download to read offline

LIVE WEBINAR: October 21, 2021 | 10 am PT
SPEAKERS: Jun Li, Principal Architect, eBay & Robert Hodges, CEO, Altinity
eBay depends on Kafka to solve the impedance mismatch between rapidly arriving messages in event streams and efficient block insert into ClickHouse clusters. Naïve loading procedures from Kafka to ClickHouse generate non-deterministic blocks, which can lead to data loss and incorrect results in applications. The eBay team solved this problem with a block aggregator that leverages Kafka to store message processing metadata as well as ClickHouse deduplication to ensure blocks being loaded to ClickHouse exactly once. The block aggregator allows eBay to support a sharded ClickHouse architecture across multiple data centers that can tolerate failures in any individual part of the system. Join us to learn how eBay developed this unique architecture and how they use it to deliver low-latency analytics to users.

LIVE WEBINAR: October 21, 2021 | 10 am PT
SPEAKERS: Jun Li, Principal Architect, eBay & Robert Hodges, CEO, Altinity
eBay depends on Kafka to solve the impedance mismatch between rapidly arriving messages in event streams and efficient block insert into ClickHouse clusters. Naïve loading procedures from Kafka to ClickHouse generate non-deterministic blocks, which can lead to data loss and incorrect results in applications. The eBay team solved this problem with a block aggregator that leverages Kafka to store message processing metadata as well as ClickHouse deduplication to ensure blocks being loaded to ClickHouse exactly once. The block aggregator allows eBay to support a sharded ClickHouse architecture across multiple data centers that can tolerate failures in any individual part of the system. Join us to learn how eBay developed this unique architecture and how they use it to deliver low-latency analytics to users.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay (20)

Advertisement

More from Altinity Ltd (20)

Recently uploaded (20)

Advertisement

Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay

  1. 1. Real-Time, Exactly-Once Data Ingestion from Kafka to ClickHouse Mohammad Roohitavaf, Jun Li October 21, 2021
  2. 2. The Real-Time Analytics Processing Pipeline
  3. 3. ClickHouse as Real-Time Analytics Database • ClickHouse: an open-source columnar database to support OLAP • Data insertion favors large blocks over individual rows • Kafka serves as data buffering • A Block Aggregator is a data loader to aggregate Kafka messages into large blocks before loading to ClickHouse
  4. 4. Block Aggregator Failures • With respect to block aggregator • Kafka can fail • Database backend can fail • Network connections to Kafka and database can fail • Block aggregator itself can crash • Blindly retries on loading data will lead to data loss or data duplication to data persisted in database • Kafka transaction mechanism can not be applied here
  5. 5. Our Solution: Exactly-Once Message Delivery to ClickHouse • To have aggregator to deterministically produce identical blocks to ClickHouse • With existing runtime supports: • Kafka metadata store to keep track of execution state, and • ClickHouse’s block duplication detection
  6. 6. The Outline of the Talk • The block aggregator developed for multi-DC deployment • The deterministic message replay protocol in block aggregator • The runtime verifier as a monitoring/debugging tool for block aggregator • Issues and experiences in block aggregator’s implementation and deployment • The block aggregator deployment in production
  7. 7. The Multi-DC Kafka/ClickHouse Deployment • Each database shard has its own topic • #partitions in topic = #replicas in shard • Block aggregator co-located in each replica (as two containers in a Kubernetes pod) • Block aggregator only inserts data to local database replica (with ClickHouse replication protocol to replicate data to other replicas) • Each block aggregator subscribes to both Kafka clusters
  8. 8. The Multi-DC Kafka/ClickHouse Failure Scenario (1) (Kafka DC Down)
  9. 9. The Multi-DC Kafka/ClickHouse Failure Scenario (2) (DC Down) (ClickHouse DC Down) • ClickHouse insert-quorum = 2
  10. 10. The Multi-DC Kafka/ClickHouse Failure Scenario (3) (Kafka DC Down) (ClickHouse DC Down) • ClickHouse insert-quorum = 2
  11. 11. Mappings of Topics, Tables, Rows, Messages • One topic contains messages associated with multiple tables in database • One message contains multiple rows belonging to the same table • Each message is an opaque byte-array in Kafka based on the protobuf-based encoding mechanism • Block aggregator relies on ClickHouse table schema to decode Kafka messages • When a new table is added to database, no need to make schema changes to Kafka clusters • The number of topics does not grow as the tables continue to be added • Table rows constructed from Kafka messages in two Kafka DCs get merged in database
  12. 12. The Block Aggregator Architecture
  13. 13. The Key Features of Block Aggregator • Support multi-datacenter deployment model • Multiple tables per topic/partition • No data loss/duplication • Monitoring with over hundred metrics: • Message processing rates • Block insertion rate and failure rate • Block size distribution • Block loading time distribution • Kafka metadata commit time and failure rate • Whether abnormal message consumption behaviors happened (such as message offset re-wound or skipped)
  14. 14. The Outline of the Talk • The block aggregator developed for multi-DC deployment • The deterministic message replay protocol in block aggregator • The runtime verifier as a monitoring/debugging tool for block aggregator • Issues and experiences in block aggregator’s implementation and deployment • The block aggregator deployment in production
  15. 15. A Naïve Way for Block Aggregator to Replay Messages (1)
  16. 16. A Naïve Way for Block Aggregator to Replay Messages (2)
  17. 17. Our Solution: Block-Level Deduplication in ClickHouse (1) • ClickHouse relies on ZooKeeper to store metadata • Each block stored contains a hash value • New blocks to be inserted need to have hash uniqueness checked • Blocks are identical if • Having same block size • Containing same rows • And rows in same order
  18. 18. Our Solution: Guarantee to Form Identical Blocks (2) • Store metadata back to Kafka which describes the latest blocks formed for each table • In case of failure, the next Block Aggregator that picks up the partition will know exactly how to reconstruct the latest blocks formed for each table by the previous Block Aggregator • The Block Aggregators can be in two different ClickHouse replicas, if Kafka partition rebalancing happens
  19. 19. The Metadata Structure For each Kafka connector, the metadata persisted to Kafka, per partition, is: replica_1,table1,0,29,20,table2,5,20,10 The last block for table1 decided to load to ClickHouse: [0, 29]. Starting offset min = 0, we have consumed 20 messages for table1. The last block for table2 decided to load to ClickHouse: [5, 20]. Starting offset min = 0, we have consumed 10 messages for table2. In total, we have consumed all 30 messages from offset min=0 to offset max=29: 20 for table 1 and 10 for table2. replica-Id, [table-name, begin-msg-offset, end-msg-offset, count]+ Metadata.min = MIN (begin-msg-offset); Metadata.max = MAX(end-msg-offset)
  20. 20. The Metadata Structure for Special Block • Special block: when begin-msg-offset = end-msg-offset + 1 • Either no message for the table with offset less than begin-msg-offset • Or any message for the table with offset less than begin-msg-offset has been received and acknowledged by ClickHouse • Example: replica_id,table1,30,29,20,table2,5,20,10 • All messages with offset less than 30 for table1 are acknowledged by ClickHouse
  21. 21. Message Processing Sequence: Consume/Commit/Load The message processing shown here is per partition
  22. 22. Two Execution Modes: • Aggregators starts from the message offset previously committed • REPLAY: Where aggregator retries sending the last blocks sent for each table to avoid data loss • CONSUME: Where aggregator is done with REPLAY and it is in the normal state • Mode Switching: DetermineState (current_offset, saved_metadata) { begin=saved_metadata.min end = saved_metadata.max if (current_offset > end) state = CONSUME else state = REPLAY }
  23. 23. The Top-Level Processing Loop of A Kafka Connector • For each Kafka Connector: while (running){ //outer loop wait for ClickHouse and Kafka to be healthy and connected while (running){ // inner loop batch = read a batch from Kafka if error, break inner loop for (msg : batch.messages){ partitionHandlers[msg.partition].consume(msg) if error, break inner loop } for (ph : partitionHandlers){ if (ph.state == CONSUME){ ph.checkBuffers() if error, break the inner loop } } } disconnect from Kafka clear partitionHandlers } Consume loop Check buffers loop - Commit to Kafka - Flush to ClickHouse - Append message to its table’s buffer Elapsed time <= max_poll_interval
  24. 24. Some Clarifications • Partition handlers can be dynamically created or deleted due to Kafka Broker’s decision • Under some failure condition, one Kafka Connector can have > 1 partitions assigned • Partition handler performs metadata commit on the corresponding partition • Each partition handler can process multiple tables (because a Kafka partition can support multiple tables) • At any given time, each partition handler can only have one in-flight block, per table, to be inserted to ClickHouse • No new block can be submitted until the current in-flight block gets successful ACK from ClickHouse • Thus, the metadata committed is just one block per table ahead, i.e., “Write Ahead Logging with One Block” • In other words, when replay happens, at most one block per table needs to be replayed
  25. 25. Some Clarifications (cont’d) • If block insertion to ClickHouse fails, • The outermost loop will disconnect the Kafka Connector from the Kafka Broker • The Kafka consumer group rebalancing gets triggered automatically • A different replica’s Kafka Connector will be assigned for the partition and block insertion continues at this new replica • Thus, rebalancing allows “Global Retries with Last Committed State” over multiple replicas • The same failure handling mechanism can be applied, for example, when metadata commit to Kafka fails • Thus, Kafka consumer group rebalancing is an indicator on the situation in which a failure cannot be recovered by a block aggregator
  26. 26. Example on Partition Rebalancing on Replicas The following diagram shows two aggregators in one shard being killed (to simulate 1 datacenter down), and block insertion traffic gets picked up by the two remaining aggregators in the same shard.
  27. 27. The Outline of the Talk • The block aggregator developed for multi-DC deployment • The deterministic message replay protocol in block aggregator • The runtime verifier as a monitoring/debugging tool for block aggregator • Issues and experiences in block aggregator’s implementation and deployment • The block aggregator deployment in production
  28. 28. Runtime Verification •Aggregator Verifier (AV): To check all blocks flushed by all aggregators to ClickHouse not cause any data loss/duplication •How can AV know what are the blocks flushed by the aggregators? • Each aggregator commits metadata to Kafka before flushing anything to ClickHouse, for each partition • All metadata records committed by the aggregators will be appended to an internal topic in Kafka called __consumer_offsets • Thus, AV needs to subscribe to this topic and learn about all blocks flushed to ClickHouse by all aggregators
  29. 29. Runtime Verification Algorithm Let M.t.start and M.t.end be the start offset and end offset for table t in metadata M, respectively For any given metadata instances M and M’, where M committed happened before M’ committed, in time: •Backward Anomaly: For some table t, M’.t.end < M.t.start •Overlap Anomaly: For some table t, M.t.start < M’.t.end AND M’.t.start < M.t.end
  30. 30. Runtime Verifier Implementation •The verifier reads metadata instances in the commit order to Kafka, stored in the system topic called _consumer_offset. •The _consumer_offset is a partitioned topic and Kafka does not guarantee ordering across partitions. •We order metadata instances with respect to their commit timestamp at the brokers. This approach requires the clock of the Kafka brokers to be synchronized with an uncertainty window less than the time between committing two metadata instances. Thus, we should not commit metadata to Kafka too frequently. •This is not a problem in block aggregator, as it commits metadata to Kafka for each block every several seconds, which is not very frequent compared to the clock skew.
  31. 31. The Outline of the Talk • The block aggregator developed for multi-DC deployment • The deterministic message replay protocol in block aggregator • The runtime verifier as a monitoring/debugging tool for block aggregator • Issues and experiences in block aggregator’s implementation and deployment • The block aggregator deployment in production
  32. 32. Compile and Link ClickHouse into Block Aggregator • Instead of using the C++ client library at the ClickHouse repo, we compiled and linked the entire ClickHouse codebase to block aggregator • It allows us to leverage the native ClickHouse implementation: • Native TCP/IP communication protocol (with TLS and connection pooling) • Select query capabilities just like ClickHouse-Client (for testing purpose) • Table schema retrieval, and block header construction from schema • Column construction from protobuf-based Kafka message deserialization • Column default expression evaluation • ZooKeeper client for distributed locking
  33. 33. Dynamic Table Schema Update • To dynamically update a table schema: • Step 1: Table schema is updated to each ClickHouse shard • Step 2: Block aggregators in each shard is restarted, thus to load updated schema from the co-located ClickHouse replica • Step 3: With offline confirmation on schema update, the client application updates its application logic to follow the updated schema to produce new Kafka messages • Requirement: Block aggregator needs to be able to deserialize the Kafka messages into blocks, for the messages with or without the updated schema • Solution: to enforce that columns in a table schema can only be added and can not be deleted afterwards
  34. 34. Multiple ZooKeeper Clusters for One ClickHouse Cluster • ClickHouse relies on ZooKeeper as metadata store and replication coordination • Each block insertion takes roughly 15 remote calls to ZooKeeper server cluster • Block insertion is performed per table • Our ZooKeeper (with 3.5.8) cluster is deployed across three datacenters with ~ 20 ms cross- datacenter communication latency • For a large ClickHouse cluster with 250 shards (with each shard having 4 replicas), a single ZooKeeper deployment can introduce high ZooKeeper “hardware exception” rate • The exception due to ZooKeeper session frequently expired • Multiple ZooKeeper clusters are deployed instead, with each allocated with a subset of the ClickHouse shards • In our deployment, 50 shards share one ZK cluster • It depends on block insertion rate per table, and total number of tables involved in real-time insertion
  35. 35. Distributed Locking at Block Aggregator • Before “insert_quorum_parallel” is introduced in ClickHouse, • In each shard, for each table, only one replica is allowed to perform data insertion • Distributed locking is used to coordinate block insertion at block aggregators • The ZooKeeper locking implementation in ClickHouse is used • More recent ClickHouse version has “insert_quorum_parallel” introduced • The default value is true • According to the Altinity blog article, current ClickHouse implementation breaks sequential consistency and may have other side effects • In our recent product release based on ClickHouse 21.8, we turned this option off • And we still enforce distributed locking at block aggregator
  36. 36. Testing on Block Aggregator • Resiliency Testing (in an 8-shard cluster with 32 replicas ) • Follow the “Chaos Monkey” approach • Kill: individual processes and individual containers, across ZooKeeper, ClickHouse, Block Aggregator • Kill: all processes and containers in one datacenter, across ZooKeeper, ClickHouse, Block Aggregator • To validate whether data loading can recover and continue • Smaller-scale integration testing • The whole cluster runs on a single machine with multiple processes from ZooKeeper, ClickHouse and Block Aggregators • Programmatically control process start/stop, along with small table insertion • In addition, to turn on fault injection at predefined points in Block Aggregator code - For example, to not accept Kafka messages deliberately for 10 seconds • Validate whether data loss and data duplication happens
  37. 37. ClickHouse Troubleshooting and Remediation • The setting “insert_quorum = 2” is to guarantee high data reliability • ClickHouse Exception (with error code = 286) can happen occasionally: 2021.04.10 16:26:38.896509 [ 59963 ] {8421e4d6-43f0-4792-8570-7ef2bf8f595a} <Error> executeQuery: Code: 286, e.displayText() = DB::Exception: Quorum for previous write has not been satisfied yet. Status: version: 1 part_name: 20210410-0_990_990_0 required_number_of_replicas: 2 actual_number_of_replicas: 1 replicas: SLC-74137 Data insertion in the whole shard stops when this exception happens!
  38. 38. ClickHouse Troubleshooting and Remediation (cont’d) • An inhouse tool is developed to: • scan ZooKeeper subtree associated with log replication queues • inspect why queued commands cannot be performed • Once queued commands all get cleared, the quorum then automatically gets satisfied • Afterwards, data insertion resumes in the shard • Real-time alerts are defined: • Long duration time that a shard does not have block insertion • Block insertion experiences non-zero failure rate with error code = 286 • Some replicas have their replication queues too large
  39. 39. The Outline of the Talk • The block aggregator developed for multi-DC deployment • The deterministic message replay protocol in block aggregator • The runtime verifier as a monitoring/debugging tool for block aggregator • Issues and experiences in block aggregator’s implementation and deployment • The block aggregator deployment in production
  40. 40. Block Aggregator Deployment in Production One Example Deployment Kafka Clusters: 2 Datacenters The ClickHouse Cluster: *2 datacenters *250 shards *Each shard having 4 replicas (2 replica per DC) *Each aggregator co-located in each replica Metric Measured Result Total messages processed/sec (peak) 280 K Total message bytes processed/sec (peak) 220 MB/sec 95%-tile block insertion time (quorum=2) 3.8 sec (for table 1) 1.1 sec (for table 2) 4.0 sec (for table 3) 95%-tile block size 0.16 MB (for table 1) 0.03 MB (for table 2) 0.46 MB (for table 3) 95%-tile number of rows in a block 1358 rows (for table 1) 1.8 rows (for table 2) 1894 rows (for table 3) 95%-tile Kafka commit time 64 ms End-to-end message consumption Lag time < 30 sec
  41. 41. Block Aggregator Deployment in Production •The block insertion rate at the shard level in a 24-hour window
  42. 42. Block Aggregator Deployment in Production •The message consumption LAG time at the shard level captured in a 24-hour window
  43. 43. Block Aggregator Deployment in Production •The Kafka Group Rebalance Rate at the shard level in a 24-hour window (always 0)
  44. 44. Block Aggregator Deployment in Production •The ZooKeeper hardware exception in a 24-hour window (close to 0)
  45. 45. Summary •Using streaming platforms like Kafka is one standard way to transfer data across data processing systems •For Columnar DB, block loading is more efficient than loading individual records •Under failure conditions, replaying Kafka messages may cause data loss or data duplication at block loaders •Our solution is to deterministically produce identical blocks under various failure conditions so that the backend Columnar DB can detect and remove duplicated blocks •The same solution allows us to verify that blocks are always produced correctly under failure conditions •This solution has been developed and deployed into production

×