Building robust CDC pipeline with Apache Hudi and Debezium

Tathastu.ai
Tathastu.aiSoftware developer at Tathastu.ai
BUILDING ROBUST CDC PIPELINE WITH
APACHE HUDI AND DEBEZIUM @SCALE
• PRATYAKSH
• PURUSHOTHAM
• SYED
• SHAIK
Hadoop Meetup Bangalore
(Dec-2019)
What is CDC?
Benefits of CDC
Comparison of CDC Streaming Systems
Comparison of Reconciler Systems
CDC Platform Architecture @ Tathastu
Challenges
Contribution
Roadmap
Questions
CHANGE DATA CAPTURE (CDC): A set of
software design patterns used to determine
(and track) the data that has changed so that
action can be taken using the changed data.
Low latency
Event processing
Real time analytics and Dashboarding
Audit logging
Distribute the load round the clock
Method Log-Based Query-Based
Tools Debezium JDBC Connector
Schema Evolution Yes Yes
Processing Stream Batch
Audit Track Preserved Partially Preserved
Latency Low High
Cost High Low
Delete Track Yes No
Solution Maxwell Apache NiFi Debezium
Bootstrap Yes No Yes
Formats JSON JSON JSON, Avro
Message Queues
Kafka, Kinesis, SQS, Google
Pub/Sub, RabbitMQ, Redis, Custom
Producer
NiFi connections Kafka
Schema Evolution Yes No Yes
Latency Low Medium Low
Supported Databases MySQL MySQL
MySQL, PostgreSQL, Oracle,
SQL Server, MongoDB,
Cassandra
Onboarding Command Driven Config and API Driven Purely API Driven
State
Storage/checkpoints
External Database
Zookeeper, External
Cache
Kafka topics
Solution
Delta.io
(Databricks)
Apache
HUDI
Apache Hive
(LLAP)
Updates / Deletes Yes Yes Yes
Compactions
Manual cleanup
No Compaction
Automatic
Manual
Automatic
Manual
File Format Parquet
Parquet
AVRO
ORC
Engine
Spark
Presto (Recently)
Spark
Presto
Hive
EMR
Athena (with workaround)
Hive
Spark(LLAP)
SQL DML NO NO YES
Write Amplification HIGH LOW LOW
Apache Governance YES (Recently) YES YES
Credits Qubole
Building robust CDC pipeline with Apache Hudi and Debezium
Hadoop Upserts Deletes and Incrementals
Consists of a self-contained spark library
Hudi key = Record key + Partition key
Storage types – COPY_ON_WRITE and MERGE_ON_READ
Query Engines – SparkSQL, Hive, Presto
Multiple Cleaning and Compaction policies supported
Key classes – HoodieDeltaStreamer, HiveSyncTool
Building robust CDC pipeline with Apache Hudi and Debezium
Schema evolution
Handling datatypes (JDBC)
Handling RDS internal commands
Making libraries compatible with latest versions of Kafka and Spark
Multi-table support in DeltaStreamer
Enhancing Kafka Batch read for Bootstrapping (Source Limit)
Hive Metastore settings
Queriable HUDI dataset – making compatible with Athena
CONTRIBUTION
• HUDI-288
• HUDI-340
• HUDI-259
• HUDI-114
• HUDI-118
• HUDI-245
• DBZ-1521
• DBZ-1492
• 563
• 311
• NIFI-6501
• NIFI-6914
• NIFI-6119
• Build the single click UI for Orchestration
• Data profiler UI for validation and alerts
• Config-store for configs and credential
• ACL for table and databases (via Ranger)
• Managing the subscriber list for notifications
and alerts
• QUBOLE CDC RECONCILER COMPARISION
• HUDI DETAILED ARCHITECTURE DISCUSSION
• ADVANTAGES OF LOG-BASED OVER QUERY-BASED
spark-submit --name debz_futurepay --queue etl --files jaas.conf,custom_config.json
--master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 4g
--num-executors 50 --class org.apache.hudi.utilities.deltastreamer.CDCStreamer hudi-
utilities-bundle-0.5.1-SNAPSHOT.jar
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource
--storage-type COPY_ON_WRITE --source-ordering-field __ts_ms --target-base-path
s3://{BASE_PATH}/hudi/${DATABASE}/${TABLE}/ --target-table cdc_flat_cow --props
${HUDI_CONFIG} --enable-hive-sync --custom-props custom_config.json --continuous --
source-limit 1000000
hive.metastore.disallow.incompatible.col.type.changes=false;
parquet.column.index.access='false'
HUDI Command
Hive Metastore Properties
#Cleanup policy
hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
hoodie.cleaner.fileversions.retained=1
HUDI Properties (For Athena )
Building robust CDC pipeline with Apache Hudi and Debezium
1 of 17

Recommended

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ... by
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
1.8K views39 slides
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa... by
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
509 views47 slides
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
4.3K views38 slides
How to build a streaming Lakehouse with Flink, Kafka, and Hudi by
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
486 views16 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 slides
Spark SQL Deep Dive @ Melbourne Spark Meetup by
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
9K views57 slides

More Related Content

What's hot

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha... by
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
976 views44 slides
Understanding Query Plans and Spark UIs by
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
4.7K views50 slides
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
10.8K views45 slides
Making Apache Spark Better with Delta Lake by
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K views40 slides
Modernizing to a Cloud Data Architecture by
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
645 views22 slides
Apache Hudi: The Path Forward by
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
490 views32 slides

What's hot(20)

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha... by HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent976 views
Understanding Query Plans and Spark UIs by Databricks
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks4.7K views
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks10.8K views
Making Apache Spark Better with Delta Lake by Databricks
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks5.4K views
Modernizing to a Cloud Data Architecture by Databricks
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks645 views
Apache Hudi: The Path Forward by Alluxio, Inc.
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.490 views
Optimizing Apache Spark SQL Joins by Databricks
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks44.9K views
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ... by Databricks
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks3.8K views
Hudi architecture, fundamentals and capabilities by Nishith Agarwal
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal2.8K views
Hoodie - DataEngConf 2017 by Vinoth Chandar
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar1.2K views
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS by Mark Kromer
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer2.3K views
Intro to Delta Lake by Databricks
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks1.5K views
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J... by Databricks
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks98.6K views
Optimizing Delta/Parquet Data Lakes for Apache Spark by Databricks
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks2.5K views
Delta from a Data Engineer's Perspective by Databricks
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks1.1K views
Hadoop Security Architecture by Owen O'Malley
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley30.2K views
Databricks Platform.pptx by Alex Ivy
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.2K views
Iceberg + Alluxio for Fast Data Analytics by Alluxio, Inc.
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.528 views

Similar to Building robust CDC pipeline with Apache Hudi and Debezium

Modernizing Your Data Warehouse using APS by
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
1.6K views37 slides
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme... by
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
1.1K views71 slides
SQL on Hadoop by
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
2.3K views49 slides
Hadoop Frameworks Panel__HadoopSummit2010 by
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
1.3K views44 slides
Hadoop in the Cloud – The What, Why and How from the Experts by
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
746 views33 slides
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ... by
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
515 views24 slides

Similar to Building robust CDC pipeline with Apache Hudi and Debezium(20)

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme... by VMworld
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld1.1K views
SQL on Hadoop by nvvrajesh
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh2.3K views
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ... by Fwdays
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays515 views
Introducing Azure SQL Data Warehouse by James Serra
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
James Serra7.7K views
Microsoft Data Platform - What's included by James Serra
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra8.3K views
sudoers: Benchmarking Hadoop with ALOJA by Nicolas Poggi
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi691 views
Cloudera Impala - San Diego Big Data Meetup August 13th 2014 by cdmaxime
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime1.3K views
Big Data Simplified - Is all about Ab'strakSHeN by DataWorks Summit
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit740 views
Big Data and NoSQL for Database and BI Pros by Andrew Brust
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust3.3K views
USQL Trivadis Azure Data Lake Event by Trivadis
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis464 views
5 Comparing Microsoft Big Data Technologies for Analytics by Jen Stirrup
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
Jen Stirrup374 views
Storage and-compute-hdfs-map reduce by Chris Nauroth
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth655 views
CIS13: A Breakthrough in Directory Technology: Meet the Elephant in the Room ... by CloudIDSummit
CIS13: A Breakthrough in Directory Technology: Meet the Elephant in the Room ...CIS13: A Breakthrough in Directory Technology: Meet the Elephant in the Room ...
CIS13: A Breakthrough in Directory Technology: Meet the Elephant in the Room ...
CloudIDSummit1.4K views
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend... by Lucidworks
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks6.9K views

Recently uploaded

Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
11 views13 slides
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 views12 slides
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx by
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 views3 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
7 views29 slides
Binder1.pdf by
Binder1.pdfBinder1.pdf
Binder1.pdfEstherSita2
10 views21 slides
ColonyOS by
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 views17 slides

Recently uploaded(20)

Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials11 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862012 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.3K views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views

Building robust CDC pipeline with Apache Hudi and Debezium

  • 1. BUILDING ROBUST CDC PIPELINE WITH APACHE HUDI AND DEBEZIUM @SCALE • PRATYAKSH • PURUSHOTHAM • SYED • SHAIK Hadoop Meetup Bangalore (Dec-2019)
  • 2. What is CDC? Benefits of CDC Comparison of CDC Streaming Systems Comparison of Reconciler Systems CDC Platform Architecture @ Tathastu Challenges Contribution Roadmap Questions
  • 3. CHANGE DATA CAPTURE (CDC): A set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data.
  • 4. Low latency Event processing Real time analytics and Dashboarding Audit logging Distribute the load round the clock
  • 5. Method Log-Based Query-Based Tools Debezium JDBC Connector Schema Evolution Yes Yes Processing Stream Batch Audit Track Preserved Partially Preserved Latency Low High Cost High Low Delete Track Yes No
  • 6. Solution Maxwell Apache NiFi Debezium Bootstrap Yes No Yes Formats JSON JSON JSON, Avro Message Queues Kafka, Kinesis, SQS, Google Pub/Sub, RabbitMQ, Redis, Custom Producer NiFi connections Kafka Schema Evolution Yes No Yes Latency Low Medium Low Supported Databases MySQL MySQL MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Cassandra Onboarding Command Driven Config and API Driven Purely API Driven State Storage/checkpoints External Database Zookeeper, External Cache Kafka topics
  • 7. Solution Delta.io (Databricks) Apache HUDI Apache Hive (LLAP) Updates / Deletes Yes Yes Yes Compactions Manual cleanup No Compaction Automatic Manual Automatic Manual File Format Parquet Parquet AVRO ORC Engine Spark Presto (Recently) Spark Presto Hive EMR Athena (with workaround) Hive Spark(LLAP) SQL DML NO NO YES Write Amplification HIGH LOW LOW Apache Governance YES (Recently) YES YES Credits Qubole
  • 9. Hadoop Upserts Deletes and Incrementals Consists of a self-contained spark library Hudi key = Record key + Partition key Storage types – COPY_ON_WRITE and MERGE_ON_READ Query Engines – SparkSQL, Hive, Presto Multiple Cleaning and Compaction policies supported Key classes – HoodieDeltaStreamer, HiveSyncTool
  • 11. Schema evolution Handling datatypes (JDBC) Handling RDS internal commands Making libraries compatible with latest versions of Kafka and Spark Multi-table support in DeltaStreamer Enhancing Kafka Batch read for Bootstrapping (Source Limit) Hive Metastore settings Queriable HUDI dataset – making compatible with Athena
  • 12. CONTRIBUTION • HUDI-288 • HUDI-340 • HUDI-259 • HUDI-114 • HUDI-118 • HUDI-245 • DBZ-1521 • DBZ-1492 • 563 • 311 • NIFI-6501 • NIFI-6914 • NIFI-6119
  • 13. • Build the single click UI for Orchestration • Data profiler UI for validation and alerts • Config-store for configs and credential • ACL for table and databases (via Ranger) • Managing the subscriber list for notifications and alerts
  • 14. • QUBOLE CDC RECONCILER COMPARISION • HUDI DETAILED ARCHITECTURE DISCUSSION • ADVANTAGES OF LOG-BASED OVER QUERY-BASED
  • 15. spark-submit --name debz_futurepay --queue etl --files jaas.conf,custom_config.json --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 4g --num-executors 50 --class org.apache.hudi.utilities.deltastreamer.CDCStreamer hudi- utilities-bundle-0.5.1-SNAPSHOT.jar --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --storage-type COPY_ON_WRITE --source-ordering-field __ts_ms --target-base-path s3://{BASE_PATH}/hudi/${DATABASE}/${TABLE}/ --target-table cdc_flat_cow --props ${HUDI_CONFIG} --enable-hive-sync --custom-props custom_config.json --continuous -- source-limit 1000000 hive.metastore.disallow.incompatible.col.type.changes=false; parquet.column.index.access='false' HUDI Command Hive Metastore Properties