Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building robust CDC pipeline with Apache Hudi and Debezium

1,032 views

Published on

We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.

Published in: Data & Analytics
  • Hey, do you have video/audio recording of these slides?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building robust CDC pipeline with Apache Hudi and Debezium

  1. 1. BUILDING ROBUST CDC PIPELINE WITH APACHE HUDI AND DEBEZIUM @SCALE • PRATYAKSH • PURUSHOTHAM • SYED • SHAIK Hadoop Meetup Bangalore (Dec-2019)
  2. 2. What is CDC? Benefits of CDC Comparison of CDC Streaming Systems Comparison of Reconciler Systems CDC Platform Architecture @ Tathastu Challenges Contribution Roadmap Questions
  3. 3. CHANGE DATA CAPTURE (CDC): A set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data.
  4. 4. Low latency Event processing Real time analytics and Dashboarding Audit logging Distribute the load round the clock
  5. 5. Method Log-Based Query-Based Tools Debezium JDBC Connector Schema Evolution Yes Yes Processing Stream Batch Audit Track Preserved Partially Preserved Latency Low High Cost High Low Delete Track Yes No
  6. 6. Solution Maxwell Apache NiFi Debezium Bootstrap Yes No Yes Formats JSON JSON JSON, Avro Message Queues Kafka, Kinesis, SQS, Google Pub/Sub, RabbitMQ, Redis, Custom Producer NiFi connections Kafka Schema Evolution Yes No Yes Latency Low Medium Low Supported Databases MySQL MySQL MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Cassandra Onboarding Command Driven Config and API Driven Purely API Driven State Storage/checkpoints External Database Zookeeper, External Cache Kafka topics
  7. 7. Solution Delta.io (Databricks) Apache HUDI Apache Hive (LLAP) Updates / Deletes Yes Yes Yes Compactions Manual cleanup No Compaction Automatic Manual Automatic Manual File Format Parquet Parquet AVRO ORC Engine Spark Presto (Recently) Spark Presto Hive EMR Athena (with workaround) Hive Spark(LLAP) SQL DML NO NO YES Write Amplification HIGH LOW LOW Apache Governance YES (Recently) YES YES Credits Qubole
  8. 8. Hadoop Upserts Deletes and Incrementals Consists of a self-contained spark library Hudi key = Record key + Partition key Storage types – COPY_ON_WRITE and MERGE_ON_READ Query Engines – SparkSQL, Hive, Presto Multiple Cleaning and Compaction policies supported Key classes – HoodieDeltaStreamer, HiveSyncTool
  9. 9. Schema evolution Handling datatypes (JDBC) Handling RDS internal commands Making libraries compatible with latest versions of Kafka and Spark Multi-table support in DeltaStreamer Enhancing Kafka Batch read for Bootstrapping (Source Limit) Hive Metastore settings Queriable HUDI dataset – making compatible with Athena
  10. 10. CONTRIBUTION • HUDI-288 • HUDI-340 • HUDI-259 • HUDI-114 • HUDI-118 • HUDI-245 • DBZ-1521 • DBZ-1492 • 563 • 311 • NIFI-6501 • NIFI-6914 • NIFI-6119
  11. 11. • Build the single click UI for Orchestration • Data profiler UI for validation and alerts • Config-store for configs and credential • ACL for table and databases (via Ranger) • Managing the subscriber list for notifications and alerts
  12. 12. • QUBOLE CDC RECONCILER COMPARISION • HUDI DETAILED ARCHITECTURE DISCUSSION • ADVANTAGES OF LOG-BASED OVER QUERY-BASED
  13. 13. spark-submit --name debz_futurepay --queue etl --files jaas.conf,custom_config.json --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 4g --num-executors 50 --class org.apache.hudi.utilities.deltastreamer.CDCStreamer hudi- utilities-bundle-0.5.1-SNAPSHOT.jar --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --storage-type COPY_ON_WRITE --source-ordering-field __ts_ms --target-base-path s3://{BASE_PATH}/hudi/${DATABASE}/${TABLE}/ --target-table cdc_flat_cow --props ${HUDI_CONFIG} --enable-hive-sync --custom-props custom_config.json --continuous -- source-limit 1000000 hive.metastore.disallow.incompatible.col.type.changes=false; parquet.column.index.access='false' HUDI Command Hive Metastore Properties
  14. 14. #Cleanup policy hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS hoodie.cleaner.fileversions.retained=1 HUDI Properties (For Athena )

×