[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

Change Data Capture
To Data Lakes
Using
Apache Pulsar/Hudi

Speaker Bio
PMC Chair/Creator of Apache Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Confluent (ksqlDB, Kafka/Streams)
Staff Eng @ Linkedin (Voldemort, DDS)
Sr Eng @ Oracle (CDC/Goldengate/XStream)

Agenda
1) Background On CDC
2) CDC to Lakes
3) Hudi Overview
4) Onwards

Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
System for tracking, fetching new data
- Not concerned with how to use such data
- Ideally, incremental update downstream
- Minimizing number of bits read-written/change
Change is the ONLY Constant
- Even in Computer science
- Data is immutable = Myth (well, kinda)

Polling an external API
- Timestamps, status indicators, versions
- Simple, works for small-scale data changes
- E.g: Polling github events API

Emit Event From App
- Data model to encode deltas
- Scales for high-volume data changes
- E.g: Emitting sensor state changes to Pulsar

Scan Database Redo Log
- SCN and other watermarks to extract data/metadata changes
- Operationally heavy, very high fidelity
- E.g: Using Debezium to obtain changelogs from MySQL

CDC vs Extract-Transform-
Load?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time full bootstrap
CDC changes T and L significantly
- T on change streams, not just table state
- L incrementally, not just bulk reloads
Incremental L = Apply

CDC vs Stream Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
Reliable Stream Processing needs distributed logs
- Rewind/Replay CDC logs
- Absorb spikes/batch writes to sinks

Ideal CDC Source
Support reliable incremental consumption
- Offsets are stable, deterministic
- Efficient fetching of new changes
Support rewinding/replay
- Databases redo logs are typically purged
frequently
- Event Logs offer tiering/large retention
Support ordering of changes
- Out-of-order apply => incorrect results

Ideal CDC Sink
Mutable, Transactional
- Reliably apply changes
Quickly absorb changes
- Sinks are often bottlenecks
- Random I/O
Bonus: Also act as CDC Source
- Keep the stream flowing

CDC to Lakes
Putting Pulsar and Hudi to work

What is a Data Lake?
Architectural Pattern for Analytical Data
- Data Lake != Spark, Flink
- Data Lake != Files on S3
- Raw data (OLTP schema)
- Derived Data (OLAP/BI, ML schema)
Vast Storage
- Object storage vs dedicated storage nodes
- Open formats (data + metadata)
Scalable Compute
- Many query engine choices
Source:
https://martinfowler.com/bliki/images/dataLake/context.png

Database
Events
Apps/
Services
Queries
DFS/Cloud Storage
Change Stream
Operational
Data Infrastructure
Analytics
Data Infrastructure
External
Sources
Tables
Why not?

Challenges
Data Lakes are often file dumps
- Reliably change subset of files
- Transactional, Concurrency Control
Getting “ALL” data quickly
- Apply Updates quicky
- Scalable Deletes, to ensure compliance
Lakes think in big batches
- Difficult to align batch intervals, to join
- Large skews for “event_time” streaming joins

Apache Hudi
Transactional Writes, MVCC/OCC
- Fully managed file/object storage
- Automatic compaction, clustering, sizing
First class support for Updates, Deletes
- Record level Update/Deletes inspired by stream
processors
CDC Streams From Lake Storage
- Storage Layout optimized for incremental fetches
- Hudi’s unique contribution in the space

Change
Streams
DFS/Cloud Storage
Tables
Pulsar (Source) + Hudi (Sink)
Pulsar
Pulsar
Source
Connectors
De-Dupe Indexing
Txn
Hudi’s
DeltaStreamer
Cluster
Optimize
Compact

Applying Database Changes
Coming Soon..

Hudi Overview
Intro, Components, APIs, Design Choices

Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstraction layer over DFS
- We invented this!
Hadoop Upserts, Deletes &
Incrementals
Provide transactional updates/deletes
First class support for record level CDC
streams

Stream Processing is Fast & Efficient
Streaming Stack
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not scan optimized
Batch Stack
+ Scans, Columnar formats
+ Scalable Compute
- Naive, In-efficient

What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Columnar
formats
+ Scalable Compute
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/; 2016

Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
2016 : Project created at Uber & powers all database/business critical feeds @ Uber
2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support
2018 : Picked up adopters, hardening, async compaction..
2019 : Incubated into ASF, community growth, added more platform components.
2020 : Top level Apache project, Over 10x growth in community, downloads, adoption
2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching

Apache Hudi - Adoption
Committers/
Contributors:
Uber, AWS, Alibaba,
Tencent, Robinhood,
Moveworks,
Confluent,
Snowflake,
Bytedance, Zendesk,
Yotpo and more
https://hudi.apache.org/docs/powered_by.html

The Hudi Stack
Complete “data” lake platform
Tightly integrated, Self managing
Write using Spark, Flink
Query using Spark, Flink, Hive,
Presto, Trino, Impala, AWS
Athena/Redshift, Aliyun DLA etc
Out-of-box tools/services for painless
dataops

Our Design Goals
Streaming/Incremental
- Upsert/Delete Optimized
- Key based operations
Faster
- Frequent Commits
- Design around logs
- Minimize overhead

Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled by “retention”
parameters
- Leverage append() when
available; lower metadata
overhead
Merges are local to each file group
- UUID keys throw off any
range pruning

Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming state
store
Workloads have different shapes
- Late arriving updates; Totally
random
- Trickle down to derived tables
Many pluggable options
- Bloom Filters + Key ranges
- HBase, Join based
- Global vs Local

MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differentiate writers vs table services
- Much like what databases do
- Table services don’t contend with
writers
- Async compaction/clustering
Don’t be so “Optimistic”
- OCC b/w writers; works, until it does
n’t
- Retries, split txns, wastes resources
- MVCC/Log based between
writers/table services

Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support business-specific resolution
Log partial updates
- Log just changed column;
- Drastic reduction in write amplification
Log based reconciliation
- Delete, Undelete based on business
logic
- CRDT, Operational Transform like
delayed conflict resolution

Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other
- E.g: Compaction, Cleaning in rocksDB
E.g: Clustering & Compaction know each
other
- Reconcile metadata based on time order
- Compactions avoid redundant
scheduling
Self Managing
- Sorting, Time-order preservation, File-
sizing

Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change streams in commit
order
- _hoodie_commit_seqno: Consume
large commits in chunks, ala Kafka
offsets
File group design => CDC friendly
- Efficient retrieval of old, new values
- Efficient retrieval of all values for key
Infinite Retention/Lookback coming later in
2021

Onwards
Ideas, Ongoing work, Future Plans

Pulsar Source In Hudi
PR#3096 is up and WIP! (Contributions welcome)
- Supports Avro/KEY_VALUE, partitioned/non-partitioned topics, checkpointing
- All Hudi operations, bells-and-whistles
- Consolidate with Kafka-Debezium Source work
Hudi facing work
- Adding transformers, record payload for parsing debezium logs
- Hardening, functional/scale testing
Pulsar facing work
- Better Spark batch query datasource support in Apache Pulsar
- Streamnative/pulsar-spark : upgrade to Spark 3 (we know its painful)/Scala
2.12, support for KV records.

Hudi Sink from Pulsar
Push to the lake in real-time
- Today’s model is “pull based”, micro-batch
- Transactional, Concurrency Control
Hudi facing work
- Harden hudi-java-client
- ~1 min commit frequency, while retaining a month of history
Pulsar facing work
- How to once across tasks?
- How to perform indexing etc efficiently without shuffling data
around

Pulsar Tiered Storage
Columnar reads off Hudi
- Leverage Hudi’s metadata to track ordering/changes
- Push projections/filters to Hudi
- Faster backfills!
Work/Challenges
- Most open-ended of the lot
- Pluggable Tiered storage API in Pulsar (exists?)
- Mapping offsets to _hoodie_commit_seqno
- Leverage Hudi’s compaction and other machinery

Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

More Related Content

What's hot

Similar to [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

More from Vinoth Chandar

Recently uploaded

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

Editor's Notes