Change Data Capture
To Data Lakes
Using
Apache Pulsar/Hudi
Speaker Bio
PMC Chair/Creator of Apache Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Confluent (ksqlDB, Kafka/Streams)
Staff Eng @ Linkedin (Voldemort, DDS)
Sr Eng @ Oracle (CDC/Goldengate/XStream)
Agenda
1) Background On CDC
2) CDC to Lakes
3) Hudi Overview
4) Onwards
Background
CDC - What, Why
Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
System for tracking, fetching new data
- Not concerned with how to use such data
- Ideally, incremental update downstream
- Minimizing number of bits read-written/change
Change is the ONLY Constant
- Even in Computer science
- Data is immutable = Myth (well, kinda)
Polling an external API
- Timestamps, status indicators, versions
- Simple, works for small-scale data changes
- E.g: Polling github events API
Emit Event From App
- Data model to encode deltas
- Scales for high-volume data changes
- E.g: Emitting sensor state changes to Pulsar
Scan Database Redo Log
- SCN and other watermarks to extract data/metadata changes
- Operationally heavy, very high fidelity
- E.g: Using Debezium to obtain changelogs from MySQL
CDC vs Extract-Transform-
Load?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time full bootstrap
CDC changes T and L significantly
- T on change streams, not just table state
- L incrementally, not just bulk reloads
Incremental L = Apply
CDC vs Stream Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
Reliable Stream Processing needs distributed logs
- Rewind/Replay CDC logs
- Absorb spikes/batch writes to sinks
Ideal CDC Source
Support reliable incremental consumption
- Offsets are stable, deterministic
- Efficient fetching of new changes
Support rewinding/replay
- Databases redo logs are typically purged
frequently
- Event Logs offer tiering/large retention
Support ordering of changes
- Out-of-order apply => incorrect results
Ideal CDC Sink
Mutable, Transactional
- Reliably apply changes
Quickly absorb changes
- Sinks are often bottlenecks
- Random I/O
Bonus: Also act as CDC Source
- Keep the stream flowing
CDC to Lakes
Putting Pulsar and Hudi to work
What is a Data Lake?
Architectural Pattern for Analytical Data
- Data Lake != Spark, Flink
- Data Lake != Files on S3
- Raw data (OLTP schema)
- Derived Data (OLAP/BI, ML schema)
Vast Storage
- Object storage vs dedicated storage nodes
- Open formats (data + metadata)
Scalable Compute
- Many query engine choices
Source:
https://martinfowler.com/bliki/images/dataLake/context.png
Database
Events
Apps/
Services
Queries
DFS/Cloud Storage
Change Stream
Operational
Data Infrastructure
Analytics
Data Infrastructure
External
Sources
Tables
Why not?
Challenges
Data Lakes are often file dumps
- Reliably change subset of files
- Transactional, Concurrency Control
Getting “ALL” data quickly
- Apply Updates quicky
- Scalable Deletes, to ensure compliance
Lakes think in big batches
- Difficult to align batch intervals, to join
- Large skews for “event_time” streaming joins
Apache Hudi
Transactional Writes, MVCC/OCC
- Fully managed file/object storage
- Automatic compaction, clustering, sizing
First class support for Updates, Deletes
- Record level Update/Deletes inspired by stream
processors
CDC Streams From Lake Storage
- Storage Layout optimized for incremental fetches
- Hudi’s unique contribution in the space
Change
Streams
DFS/Cloud Storage
Tables
Pulsar (Source) + Hudi (Sink)
Pulsar
Pulsar
Source
Connectors
De-Dupe Indexing
Txn
Hudi’s
DeltaStreamer
Cluster
Optimize
Compact
Applying Event Logs
PR#3096
Applying Database Changes
Coming Soon..
Streaming ETL using Hudi
Hudi Overview
Intro, Components, APIs, Design Choices
Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstraction layer over DFS
- We invented this!
Hadoop Upserts, Deletes &
Incrementals
Provide transactional updates/deletes
First class support for record level CDC
streams
Stream Processing is Fast & Efficient
Streaming Stack
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not scan optimized
Batch Stack
+ Scans, Columnar formats
+ Scalable Compute
- Naive, In-efficient
What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Columnar
formats
+ Scalable Compute
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/; 2016
Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
2016 : Project created at Uber & powers all database/business critical feeds @ Uber
2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support
2018 : Picked up adopters, hardening, async compaction..
2019 : Incubated into ASF, community growth, added more platform components.
2020 : Top level Apache project, Over 10x growth in community, downloads, adoption
2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
Apache Hudi - Adoption
Committers/
Contributors:
Uber, AWS, Alibaba,
Tencent, Robinhood,
Moveworks,
Confluent,
Snowflake,
Bytedance, Zendesk,
Yotpo and more
https://hudi.apache.org/docs/powered_by.html
The Hudi Stack
Complete “data” lake platform
Tightly integrated, Self managing
Write using Spark, Flink
Query using Spark, Flink, Hive,
Presto, Trino, Impala, AWS
Athena/Redshift, Aliyun DLA etc
Out-of-box tools/services for painless
dataops
Our Design Goals
Streaming/Incremental
- Upsert/Delete Optimized
- Key based operations
Faster
- Frequent Commits
- Design around logs
- Minimize overhead
Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled by “retention”
parameters
- Leverage append() when
available; lower metadata
overhead
Merges are local to each file group
- UUID keys throw off any
range pruning
Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming state
store
Workloads have different shapes
- Late arriving updates; Totally
random
- Trickle down to derived tables
Many pluggable options
- Bloom Filters + Key ranges
- HBase, Join based
- Global vs Local
MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differentiate writers vs table services
- Much like what databases do
- Table services don’t contend with
writers
- Async compaction/clustering
Don’t be so “Optimistic”
- OCC b/w writers; works, until it does
n’t
- Retries, split txns, wastes resources
- MVCC/Log based between
writers/table services
Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support business-specific resolution
Log partial updates
- Log just changed column;
- Drastic reduction in write amplification
Log based reconciliation
- Delete, Undelete based on business
logic
- CRDT, Operational Transform like
delayed conflict resolution
Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other
- E.g: Compaction, Cleaning in rocksDB
E.g: Clustering & Compaction know each
other
- Reconcile metadata based on time order
- Compactions avoid redundant
scheduling
Self Managing
- Sorting, Time-order preservation, File-
sizing
Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change streams in commit
order
- _hoodie_commit_seqno: Consume
large commits in chunks, ala Kafka
offsets
File group design => CDC friendly
- Efficient retrieval of old, new values
- Efficient retrieval of all values for key
Infinite Retention/Lookback coming later in
2021
Onwards
Ideas, Ongoing work, Future Plans
Pulsar Source In Hudi
PR#3096 is up and WIP! (Contributions welcome)
- Supports Avro/KEY_VALUE, partitioned/non-partitioned topics, checkpointing
- All Hudi operations, bells-and-whistles
- Consolidate with Kafka-Debezium Source work
Hudi facing work
- Adding transformers, record payload for parsing debezium logs
- Hardening, functional/scale testing
Pulsar facing work
- Better Spark batch query datasource support in Apache Pulsar
- Streamnative/pulsar-spark : upgrade to Spark 3 (we know its painful)/Scala
2.12, support for KV records.
Hudi Sink from Pulsar
Push to the lake in real-time
- Today’s model is “pull based”, micro-batch
- Transactional, Concurrency Control
Hudi facing work
- Harden hudi-java-client
- ~1 min commit frequency, while retaining a month of history
Pulsar facing work
- How to once across tasks?
- How to perform indexing etc efficiently without shuffling data
around
Pulsar Tiered Storage
Columnar reads off Hudi
- Leverage Hudi’s metadata to track ordering/changes
- Push projections/filters to Hudi
- Faster backfills!
Work/Challenges
- Most open-ended of the lot
- Pluggable Tiered storage API in Pulsar (exists?)
- Mapping offsets to _hoodie_commit_seqno
- Leverage Hudi’s compaction and other machinery
Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup
Thanks!
Questions?

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

  • 1.
    Change Data Capture ToData Lakes Using Apache Pulsar/Hudi
  • 2.
    Speaker Bio PMC Chair/Creatorof Apache Hudi Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking) Principal Eng @ Confluent (ksqlDB, Kafka/Streams) Staff Eng @ Linkedin (Voldemort, DDS) Sr Eng @ Oracle (CDC/Goldengate/XStream)
  • 3.
    Agenda 1) Background OnCDC 2) CDC to Lakes 3) Hudi Overview 4) Onwards
  • 4.
  • 5.
    Change Data Capture DesignPattern for Data Integration - Not tied to any particular technology - Deliver low-latency System for tracking, fetching new data - Not concerned with how to use such data - Ideally, incremental update downstream - Minimizing number of bits read-written/change Change is the ONLY Constant - Even in Computer science - Data is immutable = Myth (well, kinda)
  • 6.
    Polling an externalAPI - Timestamps, status indicators, versions - Simple, works for small-scale data changes - E.g: Polling github events API
  • 7.
    Emit Event FromApp - Data model to encode deltas - Scales for high-volume data changes - E.g: Emitting sensor state changes to Pulsar
  • 8.
    Scan Database RedoLog - SCN and other watermarks to extract data/metadata changes - Operationally heavy, very high fidelity - E.g: Using Debezium to obtain changelogs from MySQL
  • 9.
    CDC vs Extract-Transform- Load? CDCis merely Incremental Extraction - Not really competing concepts - ETL needs one-time full bootstrap CDC changes T and L significantly - T on change streams, not just table state - L incrementally, not just bulk reloads Incremental L = Apply
  • 10.
    CDC vs StreamProcessing CDC enables Streaming ETL - Why bulk T & L anymore? - Process change streams - Mutable Sinks Reliable Stream Processing needs distributed logs - Rewind/Replay CDC logs - Absorb spikes/batch writes to sinks
  • 11.
    Ideal CDC Source Supportreliable incremental consumption - Offsets are stable, deterministic - Efficient fetching of new changes Support rewinding/replay - Databases redo logs are typically purged frequently - Event Logs offer tiering/large retention Support ordering of changes - Out-of-order apply => incorrect results
  • 12.
    Ideal CDC Sink Mutable,Transactional - Reliably apply changes Quickly absorb changes - Sinks are often bottlenecks - Random I/O Bonus: Also act as CDC Source - Keep the stream flowing
  • 13.
    CDC to Lakes PuttingPulsar and Hudi to work
  • 14.
    What is aData Lake? Architectural Pattern for Analytical Data - Data Lake != Spark, Flink - Data Lake != Files on S3 - Raw data (OLTP schema) - Derived Data (OLAP/BI, ML schema) Vast Storage - Object storage vs dedicated storage nodes - Open formats (data + metadata) Scalable Compute - Many query engine choices Source: https://martinfowler.com/bliki/images/dataLake/context.png
  • 15.
    Database Events Apps/ Services Queries DFS/Cloud Storage Change Stream Operational DataInfrastructure Analytics Data Infrastructure External Sources Tables Why not?
  • 16.
    Challenges Data Lakes areoften file dumps - Reliably change subset of files - Transactional, Concurrency Control Getting “ALL” data quickly - Apply Updates quicky - Scalable Deletes, to ensure compliance Lakes think in big batches - Difficult to align batch intervals, to join - Large skews for “event_time” streaming joins
  • 17.
    Apache Hudi Transactional Writes,MVCC/OCC - Fully managed file/object storage - Automatic compaction, clustering, sizing First class support for Updates, Deletes - Record level Update/Deletes inspired by stream processors CDC Streams From Lake Storage - Storage Layout optimized for incremental fetches - Hudi’s unique contribution in the space
  • 18.
    Change Streams DFS/Cloud Storage Tables Pulsar (Source)+ Hudi (Sink) Pulsar Pulsar Source Connectors De-Dupe Indexing Txn Hudi’s DeltaStreamer Cluster Optimize Compact
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    Hudi Data Lake Originalpioneer of the transactional data lake movement Embeddable, Serverless, Distributed Database abstraction layer over DFS - We invented this! Hadoop Upserts, Deletes & Incrementals Provide transactional updates/deletes First class support for record level CDC streams
  • 24.
    Stream Processing isFast & Efficient Streaming Stack + Intelligent, Incremental + Fast, Efficient - Row oriented - Not scan optimized Batch Stack + Scans, Columnar formats + Scalable Compute - Naive, In-efficient
  • 25.
    What If: StreamingModel on Batch Data? The Incremental Stack + Intelligent, Incremental + Fast, Efficient + Scans, Columnar formats + Scalable Compute https://www.oreilly.com/content/ubers-case-for- incremental-processing-on-hadoop/; 2016
  • 26.
    Hudi : OpenSourcing & Evolution.. 2015 : Published core ideas/principles for incremental processing (O’reilly article) 2016 : Project created at Uber & powers all database/business critical feeds @ Uber 2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support 2018 : Picked up adopters, hardening, async compaction.. 2019 : Incubated into ASF, community growth, added more platform components. 2020 : Top level Apache project, Over 10x growth in community, downloads, adoption 2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
  • 27.
    Apache Hudi -Adoption Committers/ Contributors: Uber, AWS, Alibaba, Tencent, Robinhood, Moveworks, Confluent, Snowflake, Bytedance, Zendesk, Yotpo and more https://hudi.apache.org/docs/powered_by.html
  • 28.
    The Hudi Stack Complete“data” lake platform Tightly integrated, Self managing Write using Spark, Flink Query using Spark, Flink, Hive, Presto, Trino, Impala, AWS Athena/Redshift, Aliyun DLA etc Out-of-box tools/services for painless dataops
  • 29.
    Our Design Goals Streaming/Incremental -Upsert/Delete Optimized - Key based operations Faster - Frequent Commits - Design around logs - Minimize overhead
  • 30.
    Delta Logs atFile Level over Global Each file group is it’s own self contained log - Constant metadata size, controlled by “retention” parameters - Leverage append() when available; lower metadata overhead Merges are local to each file group - UUID keys throw off any range pruning
  • 31.
    Record Indexes overJust File/Column Stats Index maps key to a file group - During upsert/deletes - Much like streaming state store Workloads have different shapes - Late arriving updates; Totally random - Trickle down to derived tables Many pluggable options - Bloom Filters + Key ranges - HBase, Join based - Global vs Local
  • 32.
    MVCC Concurrency Controlover Only OCC Frequent commits => More frequent clustering/compaction => More contention Differentiate writers vs table services - Much like what databases do - Table services don’t contend with writers - Async compaction/clustering Don’t be so “Optimistic” - OCC b/w writers; works, until it does n’t - Retries, split txns, wastes resources - MVCC/Log based between writers/table services
  • 33.
    Record Level MergeAPI over Only Overwrites More generalized approach - Default: overwrite w/ latest writer wins - Support business-specific resolution Log partial updates - Log just changed column; - Drastic reduction in write amplification Log based reconciliation - Delete, Undelete based on business logic - CRDT, Operational Transform like delayed conflict resolution
  • 34.
    Specialized Database overGeneralized Format Approach it more like a shared-nothing database - Daemons aware of each other - E.g: Compaction, Cleaning in rocksDB E.g: Clustering & Compaction know each other - Reconcile metadata based on time order - Compactions avoid redundant scheduling Self Managing - Sorting, Time-order preservation, File- sizing
  • 35.
    Record level CDCover File/Snapshot Diffing Per record metadata - _hoodie_commit_time : Kafka style compacted change streams in commit order - _hoodie_commit_seqno: Consume large commits in chunks, ala Kafka offsets File group design => CDC friendly - Efficient retrieval of old, new values - Efficient retrieval of all values for key Infinite Retention/Lookback coming later in 2021
  • 36.
  • 37.
    Pulsar Source InHudi PR#3096 is up and WIP! (Contributions welcome) - Supports Avro/KEY_VALUE, partitioned/non-partitioned topics, checkpointing - All Hudi operations, bells-and-whistles - Consolidate with Kafka-Debezium Source work Hudi facing work - Adding transformers, record payload for parsing debezium logs - Hardening, functional/scale testing Pulsar facing work - Better Spark batch query datasource support in Apache Pulsar - Streamnative/pulsar-spark : upgrade to Spark 3 (we know its painful)/Scala 2.12, support for KV records.
  • 38.
    Hudi Sink fromPulsar Push to the lake in real-time - Today’s model is “pull based”, micro-batch - Transactional, Concurrency Control Hudi facing work - Harden hudi-java-client - ~1 min commit frequency, while retaining a month of history Pulsar facing work - How to once across tasks? - How to perform indexing etc efficiently without shuffling data around
  • 39.
    Pulsar Tiered Storage Columnarreads off Hudi - Leverage Hudi’s metadata to track ordering/changes - Push projections/filters to Hudi - Faster backfills! Work/Challenges - Most open-ended of the lot - Pluggable Tiered storage API in Pulsar (exists?) - Mapping offsets to _hoodie_commit_seqno - Leverage Hudi’s compaction and other machinery
  • 40.
    Engage With OurCommunity User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup
  • 41.

Editor's Notes

  • #4 Let’s get into today’s agenda
  • #6 Let’s get into today’s agenda
  • #7 Let’s get into today’s agenda
  • #8 Let’s get into today’s agenda
  • #9 Let’s get into today’s agenda
  • #10 Let’s get into today’s agenda
  • #11 Let’s get into today’s agenda
  • #12 Let’s get into today’s agenda
  • #13 Let’s get into today’s agenda
  • #15 Let’s get into today’s agenda
  • #17 Let’s get into today’s agenda
  • #18 Let’s get into today’s agenda
  • #20 Let’s get into today’s agenda
  • #21 Let’s get into today’s agenda
  • #22 Let’s get into today’s agenda