SlideShare a Scribd company logo
1 of 39
Download to read offline
1
How Robinhood Built a Streaming Lake House
to Bring Data Freshness from 24h to <15min
Balaji Varadarajan
Senior Staff Engineer, Robinhood Markets
Apache Hudi PMC
Running Production CDC Ingestion
Pipelines at Scale in Robinhood
Pritam Dey
Tech Lead, Robinhood Markets
© 2022 Robinhood Markets, Inc.
• Legacy Data Lake and Ingestion Framework
• Deep Dive - Change Data Capture (CDC)
• Design
• Lessons Learned
• Deep Dive - Data Lakehouse Ingestion
• Apache Hudi
• Lessons learned
• End to End Setup
Agenda
2
Data Lake
3
All trademarks, logos and brands are the property of their respective owners.
Daily Snapshots
All trademarks, logos and brands are the property of their respective owners.
• Daily snapshotting of tables in RDBMS
(RDS)
• High Read & Write amplifications
• Dedicated Replicas to isolate snapshot
queries
• Bottlenecked by Replica I/O
• 24+ hours data latency
Need Faster Intraday Ingestion
Pipeline
Unlock Data Lake for business critical
applications
Change Data Capture
● Each CRUD operation streamed from DB to Subscriber
● Merge changes to lake house
● Efficient & Fast -> Capture and Apply only deltas
All trademarks, logos and brands are the property of their respective owners.
Deep Dive
CDC using Debezium
7
Debezium
● Open source & distributed Kafka-Connect Service for change data
capture
● Support CDC from diverse RDBMS (Postgres, MySQL, MongoDB, etc.)
● Pluggable Sinks through Kafka
High Level Architecture
Primary
RDS
Replica
RDS
DeltaStreamer
DeltaStreamer
Bootstrap
DATA LAKE
(s3://xxx/…
Update schema
and partition
Write incremental data
and checkpoint offsets
All trademarks, logos and brands are the property of their respective owners.
Table Topic
(Partitioned)
Initial Bootstrap
Debezium - Zooming In
Primary DB (RDS)
WriteAheadLogs (WALs)
1.All updates to the Postgres RDS
database are logged into binary files
called WriteAheadLogs (WALs)
All trademarks, logos and brands are the property of their respective owners.
2. Debezium
consumes WALs
using Postgres
Logical Replication
Table_1 Topic
Table_2 Topic
Table_n Topic
4. Debezium performs
transformations and
writes avro serialized
updates into table level
Kafka topics
AVRO Schema Registry
3. Debezium updates
and validates avro
schemas using Kafka
Schema Registry
Why did we choose Debezium over
alternatives?
Debezium AWS Database Migration Service (DMS)
Operational Overhead High Low
Cost Free, with engineering time cost Relatively expensive, with negligible
engineering time cost
Speed High Not enough
Customizations Yes No
Community Support Debezium has a very active and helpful Gitter
community.
Limited to AWS support.
Ease of development Easy Easy
All trademarks, logos and brands are the property of their respective owners.
Initial table bootstrap
All trademarks, logos and brands are the property of their respective owners.
1. Debezium initial snapshot
Advantages:
● Replays all rows of a table
● Seamless and reliable way to create
bootstrap a CDC table
● Works really well for medium sized tables
Challenges:
● Overhead of message persistence in the
desired (avro format)
● Too much pressure on postgres primary and
kafka infrastructure.
● Other downstream kafka consumers which
consume the cdc data for other use cases
get impacted when this is done.
Initial table bootstrap
All trademarks, logos and brands are the property of their respective owners.
2. Bootstrap using JDBC table query
Advantages:
● Faster bootstrap for larger tables.
● Can be done asynchronously than the CDC
pipeline
● Can be distributed by configuring the jdbc
query predicated based on primary key
ranges.
● Non lakehouse consumers do not get
impacted
Challenges:
● Resource cost of maintaining a high IOPS read
replica.
● Query predicates need to be tuned for tables with
skewed data.
Initial table bootstrap
All trademarks, logos and brands are the property of their respective owners.
3. Bootstrap using RDS export
Advantages:
● No overhead of maintaining read replica
● Faster bootstrap for larger tables than JDBC query
● Can be done asynchronously than the CDC
pipeline
● Can easily distributed without any primary key
predicates since data is already in parquet format.
● Non lakehouse consumers do not get impacted
Challenges:
● AWS export SLAs are unreliable
● The service has issues in handling large skewed tables.
1. Postgres Primary Dependency
ONLY the Postgres Primary publishes WriteAheadLogs (WALs).
Disk Space:
- If a consumer dies, Postgres will keep accumulating WALs to ensure Data Consistency
- Can eat up all the disk space
- Need proper monitoring and alerting
CPU:
- Each logical replication consumer uses a small amount of CPU
- Postgres10+ uses pgoutput (built-in) : Lightweight
Postgres9 uses wal2Json (3rd party) : Heavier
- Need upgrades to Postgres10+
Debezium: Lessons Learned
Postgres Primary:
- Publishes WALs
- Record LogSequenceNumber
(LSN) for each consumer
Consumer-1
Consumer-n
Consumer-2
LSN-2
LSN-1
LSN-n
Consumer-ID LSN
Consumer-1 A3CF/BC
Consumer-n A3CF/41
All trademarks, logos and brands are the property of their respective owners.
Debezium: Lessons Learned
2. Initial Table Snapshot
(Bootstrapping)
Need for bootstrapping:
- Each table to replicate requires initial snapshot, on top of which ongoing
logical updates are applied
Problem:
- As discussed before each bootstrap strategy has its own set of challenges
Solution using Hudi Deltastreamer:
- Custom bootstrapping framework using partitioned and distributed spark
reads
- Use of multiple hybrid strategies discussed in the bootstrapping section
before.
Primary
RDS
Replica
RDS
Topic per
table
DeltaStreamer
Bootstrap
DeltaStreamer
All trademarks, logos and brands are the property of their respective owners.
Debezium: Lessons Learned
AVRO JSON JSON + Schema
Throughput
(Benchmarked using db.r5.24xlarge
Postgres RDS instance)
Up to 40K mps Up to 11K mps.
JSON records are larger than AVRO.
Up to 3K mps.
Schema adds considerable size to
JSON records.
Data Types - Supports considerably high
number of primitive and complex
data types out of the box.
- Great for type safety.
Values must be one of these 6 data
types:
- String
- Number
- JSON object
- Array
- Boolean
- Null
Same as JSON
Schema Registry Required by clients to deserialize
the data.
Optional Optional
3. AVRO vs JSON
All trademarks, logos and brands are the property of their respective owners.
4. Multiple logical replication streams for
resource isolation
- Multiple large/busy tables can overwhelm a single Debezium connector
and also introduce large SLAs for smaller/less busy tables
- Split the tables across multiple logical replication slots which are then
connected to the respective debezium connectors
Total throughput = throughput_per_connector * num_connectors
- Each connector does have small CPU cost
Debezium: Lessons Learned
Table-1
Table-4
Table-2
Table-5
Table-3
Table-n
Consumer-1
Consumer-n
Consumer-2
Table-1
Table-2
Table-3
Table-4
Table-5
Table-n
All trademarks, logos and brands are the property of their respective owners.
Debezium: Lessons Learned
5. Schema evolution and value of
Freezing Schemas
Failed assumption: Schema changes are infrequent and always backwards
compatible.
- Examples:
1. Adding non-nullable columns (Most Common 99/100)
2. Deleting columns
3. Changing data types
4. Can happen anytime during the day #always_on_call
How to handle the non backwards compatible changes?
- Re-bootstrap the table
Alternatives? Freeze the schema
- Debezium allows to specify the list of columns per table.
- Pros:
- #not_always_on_call
- Batch the changes for management window
- Cons:
- Schema is temporarily out of sync
Table-X
Backwards
Incompatible Schema
Change
Consumer-2
- Specific columns
Consumer-1
- All columns
All trademarks, logos and brands are the property of their respective owners.
Deep Dive
Lakehouse
20
Lakehouse - Requirements
● Transaction support
● Scalable Storage and compute
● Openness
● Direct access to files
● End-to-end streaming
● Diverse use cases
All trademarks, logos and brands are the property of their respective owners.
Hudi Table APIs
Apache Hudi - Introduction
● Transactional Lakehouse pioneered by Hudi
● Serverless, transactional layer over lakes.
● Multi-engine, Decoupled storage from
engine/compute
● Upserts, Change capture on lakes
● Introduced Copy-On-Write and Merge-on-Read
● Ideas now heavily borrowed outside
Cloud Storage
Files Metadata Txn log
Writers Queries
https://eng.uber.com/hoodie/ Mar 2017
All trademarks, logos and brands are the property of their respective owners.
Apache Hudi - Upserts & Incrementals
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query(t-1, t)
table_snapshot() at time t
Latest committed records
Streaming
Ingest Compaction Z-ordering/
Clustering
Indexing
All trademarks, logos and brands are the property of their respective owners.
Write Ahead
Log of RDBMS
24
Apache Hudi - Storage Layout
The Community
2000+
Slack Members
225+
Contributors
1000+
GH Engagers
20+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
All trademarks, logos and brands are the property of their respective owners.
Apache Hudi - Community
2K+
All trademarks, logos and brands are the property of their respective owners.
27
Apache Hudi - Relevant Features
DFS/Cloud Storage
Raw Tables
Data Lake
Derived Tables
Hudi upsert()
15 mins
Hudi Incremental query()
scan 10-100x less data
Updated/
Created rows
from databases
● Database abstraction for cloud storage/hdfs
● Near real-time ingestion
● Incremental, Efficient ETL downstream
● ACID Guarantees
All trademarks, logos and brands are the property of their respective owners.
Deep Dive
Ingestion
28
Recap - High Level Architecture
Master
RDS
Replica
RDS
Table Topic
DeltaStreamer
DeltaStreamer
Bootstrap
DATA LAKE
(s3://xxx/…
Update schema
and partition
Write incremental data
and checkpoint offsets
All trademarks, logos and brands are the property of their respective owners.
Data Lake Ingestion - CDC Path
Schema Registry
Hudi Table
Hudi Metadata
1. Get Kafka checkpoint
2. Spark Kafka batch read
and union from most
recently committed kafka
offsets
3. Deserialize using Avro
schema from schema
registry
4. Apply Copy-On-Write
updates and update Kafka
checkpoint
Shard 1 Table 1
Shard 2 Table 1
Shard N Table 1
Shard 1 Table 2
Shard 2 Table 2
Shard N Table 2
All trademarks, logos and brands are the property of their respective owners.
Data Lake Ingestion - Bootstrap Path
Hudi Table
AWS RDS
Replica
Shard 1 Table
Topic
Shard 2 Table
Topic
Shard n Table
Topic
Hudi Table
3. Wait for Replica to
catch up latest
checkpoint
2. Store offsets in Hudi
metadata
1. Get latest topic offsets
4. Spark JDBC Read
5. Bulk Insert Hudi Table
All trademarks, logos and brands are the property of their respective owners.
Operating at Scale
32
33
Production - Challenges
● Lean Team
● 4000+ tables
● Complex Pipelines: Multi-Hop Disparate Systems, Stateful
● Not all Tables are equal
● Managing Cost of Pipelines
● Monitoring & Handling Failures
● Assuring Data Freshness & Quality
34
● Onboarding new tables, databases …
○ Configuration Settings
○ Resource Provisioning
○ Monitoring Setup
● Re-bootstrapping
● Dynamic resource Management to handle traffic patterns
● Self Healing to handle intermittent failures
Automation
35
Tiering
● Distinguish critical tables by Tiers
○ Tier 0 - Critical
○ Tier N : Best Effort
● Resource Isolation
○ Postgres Replication Slots
○ Debezium Connectors
○ Compute & Storage
● SLA Guarantees - Freshness,
36
Freshness
● Need freshness to trigger downstream
● CDC Tables are Live Tables
● Freshness Tracking:
○ Daily Snapshots : Freshness time at source
○ CDC : Not Straightforward
■ Handle Multi-Hop
■ Handle “Slow” moving Tables
37
Freshness Tracking
Key: connector
Value: commit_lsn,
timestamp
Debezium_offsets
topic
Commit Stats and Caught Up
Signal from All Ingestion jobs
commits_topic
debezium_state
Freshness
Improved Freshness
39
Balaji Varadarajan Pritam Dey
Senior Staff Engineer, Robinhood Markets Tech Lead, Robinhood Markets
Apache Hudi PMC
Thank you!
We’ll hangout at back of the room for any QA.

More Related Content

Similar to Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022

Scaling db infra_pay_pal
Scaling db infra_pay_palScaling db infra_pay_pal
Scaling db infra_pay_palpramod garre
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Red hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivoRed hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivoNextel S.A.
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 

Similar to Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022 (20)

Scaling db infra_pay_pal
Scaling db infra_pay_palScaling db infra_pay_pal
Scaling db infra_pay_pal
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Red hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivoRed hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivo
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022

  • 1. 1 How Robinhood Built a Streaming Lake House to Bring Data Freshness from 24h to <15min Balaji Varadarajan Senior Staff Engineer, Robinhood Markets Apache Hudi PMC Running Production CDC Ingestion Pipelines at Scale in Robinhood Pritam Dey Tech Lead, Robinhood Markets © 2022 Robinhood Markets, Inc.
  • 2. • Legacy Data Lake and Ingestion Framework • Deep Dive - Change Data Capture (CDC) • Design • Lessons Learned • Deep Dive - Data Lakehouse Ingestion • Apache Hudi • Lessons learned • End to End Setup Agenda 2
  • 3. Data Lake 3 All trademarks, logos and brands are the property of their respective owners.
  • 4. Daily Snapshots All trademarks, logos and brands are the property of their respective owners. • Daily snapshotting of tables in RDBMS (RDS) • High Read & Write amplifications • Dedicated Replicas to isolate snapshot queries • Bottlenecked by Replica I/O • 24+ hours data latency
  • 5. Need Faster Intraday Ingestion Pipeline Unlock Data Lake for business critical applications
  • 6. Change Data Capture ● Each CRUD operation streamed from DB to Subscriber ● Merge changes to lake house ● Efficient & Fast -> Capture and Apply only deltas All trademarks, logos and brands are the property of their respective owners.
  • 7. Deep Dive CDC using Debezium 7
  • 8. Debezium ● Open source & distributed Kafka-Connect Service for change data capture ● Support CDC from diverse RDBMS (Postgres, MySQL, MongoDB, etc.) ● Pluggable Sinks through Kafka
  • 9. High Level Architecture Primary RDS Replica RDS DeltaStreamer DeltaStreamer Bootstrap DATA LAKE (s3://xxx/… Update schema and partition Write incremental data and checkpoint offsets All trademarks, logos and brands are the property of their respective owners. Table Topic (Partitioned) Initial Bootstrap
  • 10. Debezium - Zooming In Primary DB (RDS) WriteAheadLogs (WALs) 1.All updates to the Postgres RDS database are logged into binary files called WriteAheadLogs (WALs) All trademarks, logos and brands are the property of their respective owners. 2. Debezium consumes WALs using Postgres Logical Replication Table_1 Topic Table_2 Topic Table_n Topic 4. Debezium performs transformations and writes avro serialized updates into table level Kafka topics AVRO Schema Registry 3. Debezium updates and validates avro schemas using Kafka Schema Registry
  • 11. Why did we choose Debezium over alternatives? Debezium AWS Database Migration Service (DMS) Operational Overhead High Low Cost Free, with engineering time cost Relatively expensive, with negligible engineering time cost Speed High Not enough Customizations Yes No Community Support Debezium has a very active and helpful Gitter community. Limited to AWS support. Ease of development Easy Easy All trademarks, logos and brands are the property of their respective owners.
  • 12. Initial table bootstrap All trademarks, logos and brands are the property of their respective owners. 1. Debezium initial snapshot Advantages: ● Replays all rows of a table ● Seamless and reliable way to create bootstrap a CDC table ● Works really well for medium sized tables Challenges: ● Overhead of message persistence in the desired (avro format) ● Too much pressure on postgres primary and kafka infrastructure. ● Other downstream kafka consumers which consume the cdc data for other use cases get impacted when this is done.
  • 13. Initial table bootstrap All trademarks, logos and brands are the property of their respective owners. 2. Bootstrap using JDBC table query Advantages: ● Faster bootstrap for larger tables. ● Can be done asynchronously than the CDC pipeline ● Can be distributed by configuring the jdbc query predicated based on primary key ranges. ● Non lakehouse consumers do not get impacted Challenges: ● Resource cost of maintaining a high IOPS read replica. ● Query predicates need to be tuned for tables with skewed data.
  • 14. Initial table bootstrap All trademarks, logos and brands are the property of their respective owners. 3. Bootstrap using RDS export Advantages: ● No overhead of maintaining read replica ● Faster bootstrap for larger tables than JDBC query ● Can be done asynchronously than the CDC pipeline ● Can easily distributed without any primary key predicates since data is already in parquet format. ● Non lakehouse consumers do not get impacted Challenges: ● AWS export SLAs are unreliable ● The service has issues in handling large skewed tables.
  • 15. 1. Postgres Primary Dependency ONLY the Postgres Primary publishes WriteAheadLogs (WALs). Disk Space: - If a consumer dies, Postgres will keep accumulating WALs to ensure Data Consistency - Can eat up all the disk space - Need proper monitoring and alerting CPU: - Each logical replication consumer uses a small amount of CPU - Postgres10+ uses pgoutput (built-in) : Lightweight Postgres9 uses wal2Json (3rd party) : Heavier - Need upgrades to Postgres10+ Debezium: Lessons Learned Postgres Primary: - Publishes WALs - Record LogSequenceNumber (LSN) for each consumer Consumer-1 Consumer-n Consumer-2 LSN-2 LSN-1 LSN-n Consumer-ID LSN Consumer-1 A3CF/BC Consumer-n A3CF/41 All trademarks, logos and brands are the property of their respective owners.
  • 16. Debezium: Lessons Learned 2. Initial Table Snapshot (Bootstrapping) Need for bootstrapping: - Each table to replicate requires initial snapshot, on top of which ongoing logical updates are applied Problem: - As discussed before each bootstrap strategy has its own set of challenges Solution using Hudi Deltastreamer: - Custom bootstrapping framework using partitioned and distributed spark reads - Use of multiple hybrid strategies discussed in the bootstrapping section before. Primary RDS Replica RDS Topic per table DeltaStreamer Bootstrap DeltaStreamer All trademarks, logos and brands are the property of their respective owners.
  • 17. Debezium: Lessons Learned AVRO JSON JSON + Schema Throughput (Benchmarked using db.r5.24xlarge Postgres RDS instance) Up to 40K mps Up to 11K mps. JSON records are larger than AVRO. Up to 3K mps. Schema adds considerable size to JSON records. Data Types - Supports considerably high number of primitive and complex data types out of the box. - Great for type safety. Values must be one of these 6 data types: - String - Number - JSON object - Array - Boolean - Null Same as JSON Schema Registry Required by clients to deserialize the data. Optional Optional 3. AVRO vs JSON All trademarks, logos and brands are the property of their respective owners.
  • 18. 4. Multiple logical replication streams for resource isolation - Multiple large/busy tables can overwhelm a single Debezium connector and also introduce large SLAs for smaller/less busy tables - Split the tables across multiple logical replication slots which are then connected to the respective debezium connectors Total throughput = throughput_per_connector * num_connectors - Each connector does have small CPU cost Debezium: Lessons Learned Table-1 Table-4 Table-2 Table-5 Table-3 Table-n Consumer-1 Consumer-n Consumer-2 Table-1 Table-2 Table-3 Table-4 Table-5 Table-n All trademarks, logos and brands are the property of their respective owners.
  • 19. Debezium: Lessons Learned 5. Schema evolution and value of Freezing Schemas Failed assumption: Schema changes are infrequent and always backwards compatible. - Examples: 1. Adding non-nullable columns (Most Common 99/100) 2. Deleting columns 3. Changing data types 4. Can happen anytime during the day #always_on_call How to handle the non backwards compatible changes? - Re-bootstrap the table Alternatives? Freeze the schema - Debezium allows to specify the list of columns per table. - Pros: - #not_always_on_call - Batch the changes for management window - Cons: - Schema is temporarily out of sync Table-X Backwards Incompatible Schema Change Consumer-2 - Specific columns Consumer-1 - All columns All trademarks, logos and brands are the property of their respective owners.
  • 21. Lakehouse - Requirements ● Transaction support ● Scalable Storage and compute ● Openness ● Direct access to files ● End-to-end streaming ● Diverse use cases All trademarks, logos and brands are the property of their respective owners.
  • 22. Hudi Table APIs Apache Hudi - Introduction ● Transactional Lakehouse pioneered by Hudi ● Serverless, transactional layer over lakes. ● Multi-engine, Decoupled storage from engine/compute ● Upserts, Change capture on lakes ● Introduced Copy-On-Write and Merge-on-Read ● Ideas now heavily borrowed outside Cloud Storage Files Metadata Txn log Writers Queries https://eng.uber.com/hoodie/ Mar 2017 All trademarks, logos and brands are the property of their respective owners.
  • 23. Apache Hudi - Upserts & Incrementals Hudi Table upsert(records) at time t Changes to table Changes from table incremental_query(t-1, t) table_snapshot() at time t Latest committed records Streaming Ingest Compaction Z-ordering/ Clustering Indexing All trademarks, logos and brands are the property of their respective owners. Write Ahead Log of RDBMS
  • 24. 24 Apache Hudi - Storage Layout
  • 25. The Community 2000+ Slack Members 225+ Contributors 1000+ GH Engagers 20+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants All trademarks, logos and brands are the property of their respective owners.
  • 26. Apache Hudi - Community 2K+ All trademarks, logos and brands are the property of their respective owners.
  • 27. 27 Apache Hudi - Relevant Features DFS/Cloud Storage Raw Tables Data Lake Derived Tables Hudi upsert() 15 mins Hudi Incremental query() scan 10-100x less data Updated/ Created rows from databases ● Database abstraction for cloud storage/hdfs ● Near real-time ingestion ● Incremental, Efficient ETL downstream ● ACID Guarantees All trademarks, logos and brands are the property of their respective owners.
  • 29. Recap - High Level Architecture Master RDS Replica RDS Table Topic DeltaStreamer DeltaStreamer Bootstrap DATA LAKE (s3://xxx/… Update schema and partition Write incremental data and checkpoint offsets All trademarks, logos and brands are the property of their respective owners.
  • 30. Data Lake Ingestion - CDC Path Schema Registry Hudi Table Hudi Metadata 1. Get Kafka checkpoint 2. Spark Kafka batch read and union from most recently committed kafka offsets 3. Deserialize using Avro schema from schema registry 4. Apply Copy-On-Write updates and update Kafka checkpoint Shard 1 Table 1 Shard 2 Table 1 Shard N Table 1 Shard 1 Table 2 Shard 2 Table 2 Shard N Table 2 All trademarks, logos and brands are the property of their respective owners.
  • 31. Data Lake Ingestion - Bootstrap Path Hudi Table AWS RDS Replica Shard 1 Table Topic Shard 2 Table Topic Shard n Table Topic Hudi Table 3. Wait for Replica to catch up latest checkpoint 2. Store offsets in Hudi metadata 1. Get latest topic offsets 4. Spark JDBC Read 5. Bulk Insert Hudi Table All trademarks, logos and brands are the property of their respective owners.
  • 33. 33 Production - Challenges ● Lean Team ● 4000+ tables ● Complex Pipelines: Multi-Hop Disparate Systems, Stateful ● Not all Tables are equal ● Managing Cost of Pipelines ● Monitoring & Handling Failures ● Assuring Data Freshness & Quality
  • 34. 34 ● Onboarding new tables, databases … ○ Configuration Settings ○ Resource Provisioning ○ Monitoring Setup ● Re-bootstrapping ● Dynamic resource Management to handle traffic patterns ● Self Healing to handle intermittent failures Automation
  • 35. 35 Tiering ● Distinguish critical tables by Tiers ○ Tier 0 - Critical ○ Tier N : Best Effort ● Resource Isolation ○ Postgres Replication Slots ○ Debezium Connectors ○ Compute & Storage ● SLA Guarantees - Freshness,
  • 36. 36 Freshness ● Need freshness to trigger downstream ● CDC Tables are Live Tables ● Freshness Tracking: ○ Daily Snapshots : Freshness time at source ○ CDC : Not Straightforward ■ Handle Multi-Hop ■ Handle “Slow” moving Tables
  • 37. 37 Freshness Tracking Key: connector Value: commit_lsn, timestamp Debezium_offsets topic Commit Stats and Caught Up Signal from All Ingestion jobs commits_topic debezium_state Freshness
  • 39. 39 Balaji Varadarajan Pritam Dey Senior Staff Engineer, Robinhood Markets Tech Lead, Robinhood Markets Apache Hudi PMC Thank you! We’ll hangout at back of the room for any QA.