Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

Change Data Capture
To Data Lakes
Using
Apache Pulsar/Hudi

Speaker Bio
PMC Chair/Creator of Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Confluent (ksqlDB, Kafka/Streams)
Staff Eng @ Linkedin (Voldemort, DDS)
Sr Eng @ Oracle (CDC/Goldengate/XStream)

Agenda
1) Background On CDC
2) Make a Lake
3) Hudi Deep Dive
4) Onwards

Background
CDC, Data Lakes - What, Why

Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
System for tracking, fetching new data
- Not concerned with how to use such data
- Ideally, incremental update downstream
- Minimizing number of bits read-written/change
Change is the ONLY Constant
- Even in Computer science
- Data is immutable = Myth (well, kinda)

Examples of CDC
Polling an external API for new events
- Timestamps, status indicators, versions
- Simple, works for small-scale data changes
- E.g: Polling github events API
Emit Events directly from Application
- Data model to encode deltas
- Scales for high-volume data changes
- E.g: Emitting sensor state changes to Pulsar
Scanning a database’s redo log
- SCN and other watermarks to extract data/metadata changes
- Operationally heavy, very high fidelity
- E.g: Using Debezium to obtain changelogs from MySQL

CDC vs ETL?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time full bootstrap
- <>
CDC changes T and L significantly
- T on change streams, not just table state
- L incrementally, not just bulk reloads

CDC vs Streaming Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
Reliable Stream Processing needs distributed logs
- Rewind/Replay CDC logs
- Absorb spikes/batch writes to sinks

Ideal CDC Source
Support reliable incremental consumption
- <>
Support rewinding/replay
- <>
Support ordering of changes
- <>

Ideal CDC Sink
Mutable, Transactional
- <>
Quickly absorb changes
- <>
Bonus: Also act as CDC Source
- <>

Data Lakes
Architectural Pattern for Analytical Data
- Data Lake != Spark, Flink
- Data Lake != Files on S3
- <>
Raw Data
- <>
Derived Data
- <>

Database
Events
Apps/
Services
Queries
DFS/Cloud Storage
Change Stream
Operational
Data Infrastructure
Analytics
Data Infrastructure
External
Sources
Tables
CDC to Data Lakes

Make a Lake
Putting Pulsar and Hudi to work

Data Flow Design
<show diagram showing e2e data flow>
- <..>

Pre-requirements
Running MySQL Instance (RDS)
- <..>
Running Pulsar Cluster (??)
- <..>
Running Spark Cluster (e.g EMR)
- <..>

Test Data
Explain ‘users’ table
- <..>
Explain ‘github_events’ data emitted into Pulsar
- <..>

#1: Setup CDC Connector
<Show configurations>
- <..>
<Sample data out of Pulsar>
- <..>

#2: Kick Off Hudi DeltaStreamer
<Show configurations, Command to submit>
- <..>
<Query data out of Hudi tables>
- <..>

#3: Streaming ETL using Hudi
<Show how to CDC from Hudi itself>
- <..>
<Sample pipeline that does some enrichment of
events>
- <..>

Hudi Deep Dive
Intro, Components, APIs, Design Choices

Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstraction layer over DFS
- We invented this!
Hadoop Upserts, Deletes &
Incrementals
Provide transactional updates/deletes
First class support for record level CDC
streams

Stream Processing is Fast & Efficient
Streaming Stack
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not scan optimized
Batch Stack
+ Scans, Columnar formats
+ Scalable Compute
- Naive, In-efficient

What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Columnar
formats
+ Scalable Compute
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/; 2016

Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
2016 : Project created at Uber & powers all database/business critical feeds @ Uber
2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support
2018 : Picked up adopters, hardening, async compaction..
2019 : Incubated into ASF, community growth, added more platform components.
2020 : Top level Apache project, Over 10x growth in community, downloads, adoption
2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching

Apache Hudi - Adoption
Committers/
Contributors:
Uber, AWS,
Alibaba, Tencent,
Robinhood,
Moveworks,
Confluent,
Snowflake,
Bytedance,
Zendesk, Yotpo
and more
https://hudi.apache.org/docs/powered_by.html

The Hudi Stack
Complete “data” lake platform
Tightly integrated, Self managing
Write using Spark, Flink
Query using Spark, Flink, Hive,
Presto, Trino, Impala, AWS
Athena/Redshift, Aliyun DLA etc
Out-of-box tools/services for painless
dataops

Query Types
Read Optimized
Query at 10:10
Snapshot Query
at 10:10
Incremental Query
(10:08, 10:10)

Our Design Goals
Streaming/Incremental
- Upsert/Delete Optimized
- Key based operations
Faster
- Frequent Commits
- Design around logs
- Minimize overhead

Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled by “retention”
parameters
- Leverage append() when
available; lower metadata
overhead
Merges are local to each file group
- UUID keys throw off any
range pruning

Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming state
store
Workloads have different shapes
- Late arriving updates; Totally
random
- Trickle down to derived tables
Many pluggable options
- Bloom Filters + Key ranges
- HBase, Join based
- Global vs Local

MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differentiate writers vs table services
- Much like what databases do
- Table services don’t contend with
writers
- Async compaction/clustering
Don’t be so “Optimistic”
- OCC b/w writers; works, until it does
n’t
- Retries, split txns, wastes resources
- MVCC/Log based between
writers/table services

Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support business-specific resolution
Log partial updates
- Log just changed column;
- Drastic reduction in write amplification
Log based reconciliation
- Delete, Undelete based on business
logic
- CRDT, Operational Transform like
delayed conflict resolution

Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other
- E.g: Compaction, Cleaning in rocksDB
E.g: Clustering & Compaction know each
other
- Reconcile metadata based on time order
- Compactions avoid redundant
scheduling
Self Managing
- Sorting, Time-order preservation, File-
sizing

Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change streams in commit
order
- _hoodie_commit_seqno: Consume
large commits in chunks, ala Kafka
offsets
File group design => CDC friendly
- Efficient retrieval of old, new values
- Efficient retrieval of all values for key
Infinite Retention/Lookback coming later in
2021

Onwards
Ideas, Ongoing work, Future Plans

Scalable, Multi Model Indexes
Partitions are very coarse file-level indexes
Finer grained indexes as new partitions to
metadata table
- Bloom Filter, Bitmaps
- Column ranges (RFC-27)
- HFile/Hash indexes
- Search?
External indexes
- DynamoDB, Spanner + other cloud stores
- C*, Mongo and other

Caching
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
- Today : Aggressively table services
- Tomorrow : File Group/Hudi file model
aware caching
- Mutable data => FileSystem/Block level
caches are not that effective.
Benefits
- Great performance for CDC tables
- Avoid open/close costs for small objects

Timeline Metaserver
Interesting fact : Hudi has a metaserver already
- Runs on Spark driver; Serves FileSystem
RPCs + queries on timeline
- Backed by rocksDB, updated
incrementally on every timeline action
- Very useful in streaming jobs
- But, still standalone
Data lakes need a new metaserver
- Flat file metastores are cool? (really?)
- Sometimes I miss HMS (sometimes..)
- Let’s learn from Cloud warehouses

Pulsar Sink
<Outline strawman design, Hudi facing work,
Call for collab>

Pulsar Tiered Storage
<Research sharing current challenges, call for
collaboration>

Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup

Hudi powers one of the largest transactional
data lakes on the planet @ Uber
Operated 150PB+ Data Lake platform for 4+
years
Multi engine environment with Presto, Spark,
Hive, Vertica & more
Architected several data services for
deletion/GDPR across 15K+ data users
Mission critical to all of Uber w/ data
monitoring/schemas/quality enforcement
~8000
Tables
150+
PB
3-30
Mins Fresh
~1.5
PB/day
~850
million
vcore-secs
~4
Engines
Hudi @ Uber

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

More Related Content

What's hot

Similar to Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

More from StreamNative

Recently uploaded

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

Editor's Notes