SlideShare a Scribd company logo
Building large-scale Transactional Data Lake using
Apache Hudi
About me
Satish Kotha
- Apache Hudi committer
- Engineer @ Uber
- Previously worked on building
MetricsDB and BlobStore at
Twitter
Apache Hudi : Overview
500B+
records/day
150+ PB
Transactional Data Lake
7000+
Tables
HUDI @ UBER
Data Consistency
Datacenter agnostic, xDC
replication, strong consistency
Time travel queries
Support efficient updates and
deletes
Efficient updates
Support efficient updates
and deletes over DFS
Data Freshness
< 15 min of freshness on
Lake & warehouse
Incremental
Processing
Order of magnitude efficiency
to process only changes
Adaptive Data Layout
Stitch files, Optimize layout,
Prune columns, Encrypt
rows/columns on demand
through a standardized
interface
Hudi for Data
Application
Feature store for ML
Data Accuracy
Semantic validations for
columns: NotNull, Range
etc
HUDI use cases
Motivation
Batch ingestion is too slow..
Rewrite entire tables/partitions
several times a day!
Late arriving data is a
nightmare
DFS/Cloud Storage
Raw Tables
Data Lake
120TB HBase table
ingested every 8 hrs;
Actual change < 500GB
Updated/
Created
rows from
databases
Streaming Data Big Big Batch Jobs...
New Data
Unaffected Data
Updated Data
update
update
Source table
update
ETL table A
update
……..
ETL table B
Write amplification from derived tables
Other challenges
How to avoid duplicate records in dataset?
How to rollback a bad batch of ingestion?
What if bad data gets through? How to restore dataset?
Queries can see dirty data
Solving the small file problem, while keeping data fresh
Obtain changelogs & upsert()
// Command to extract incrementals using sqoop
bin/sqoop import 
-Dmapreduce.job.user.classpath.first=true 
--connect jdbc:mysql://localhost/users 
--username root 
--password ******* 
--table users 
--as-avrodatafile 
--target-dir 
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import org.apache.hudi.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“org.apache.hudi”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "
userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"
country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "
last_mod")
.option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use datasource to read extracted
data and directly “upsert” the users table
on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Key4 …..……...
...
Batch 1
Ts1
Key1 .....……...
...
Key3 …..……...
...
Batch 2
Ts2
Commit Timeline
C1 Commit 1 inflight
C2 Commit 2 inflight
C1 Commit 1 DONE
C2 Commit 2 DONE
upsert
Key1 C2 ..
Key3 C2 ..
Version at C2
Version at C1
Version at C1
Parquet
Files
File 2
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
File 1
Read Optimized Query
Update using Hudi Copy-On-Write Tables
More efficient updates using Merge on Read Table
Hudi Managed Table
Version at C1
upsert
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Key4 …..……...
...
Batch 1
Ts1
Version at C1
Parquet
Files
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Batch 2
Ts2
K1 C2 ...
...
Unmerged update
K2 C2 ...
Unmerged update
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
K3 C2
Real-time
Queries
Read Optimized
Queries
Commit Timeline
C1 Commit 1
C2 Commit 2
C1 Commit 1 inflight
C2 Commit 2 inflight
done
done
Trade-off Copy-On-Write Merge-On-Read
File Format
Exclusively columnar
format
Columnar format snapshots + row
format write ahead log
Update cost
(I/O)
Higher (rewrite entire
parquet)
Lower (append to delta log)
Parquet File Size
Smaller (high update(I/0)
cost)
Larger (low update cost)
Write
Amplification
Higher
Lower (depending on compaction
strategy)
Copy-on-Write vs Merge-on-Read
Indexing
How is index used ?
Key1 ...
Key2 ...
Key3 ...
Key4 ...
upsert
Indexing
Key1 partition, f1
...
Key2 partition, f2
...
Key3 partition, f1
...
Key4 partition, f2
...
Batch at t2 with index metadata
Key1, Key3
Key2,
Key4
f1-t2 (data/log)
f2-t2 (data/log)
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
Batch at t2
f1-t1 f2-t1
Indexing Scope
Global index
Enforce uniqueness of keys across all partitions of a
table
Maintain mapping for record_key to (partition, fileId)
Update/delete cost grows with size of the table
O(size of table)
Local index
Enforce this constraint only within a specific partition.
Writer to provide the same consistent partition path
for a given record key
Maintain mapping (partition, record_key) -> (fileId)
Update/delete cost O(number of records
updated/deleted)
Bloom Index (default)
Ideal workload: Late arriving updates
Simple Index
Ideal workload: Random updates/deletes to a
dimension table
HBase Index
Ideal workload: Global index
Custom Index
Users can provide custom index implementation
Types of Index
Indexing Limitations
Indexing only works on
primary key today
WIP to make this available as
secondary index on other
columns.
Index information is only used
in write path
WIP to use this in read path to
improve query performance.
Index not centralized
Move the index info from
parquet metadata into hudi
metadata
- Bring Streaming APIs on Data Lake
- Order of magnitude faster
- Can leverage Hudi metadata to update all partitions that have changes
- Previously sync all latest N-day partitions
- Huge IO amplification even if there are very small number of changes
- No need to create staged table
- Integration with Hive/Spark
Source
Table
ETL Table
Transform new
entries
Staging
table
Join
Incremental
Pulls
Upserts
Hive + Spark DataSource
Incremental Reading
Streaming Style/Incremental pipelines!
// Spark Datasource
Import org.apache.hudi.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("org.apache.hudi")
.option(VIEW_TYPE_OPT_KEY(),VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.
BEGIN_INSTANTTIME_OPT_KEY(),
commitInstantFor8AM)
.load(“s3://tables/transactions”);
Dataset<Row> stdDF = standardize_payments(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“org.apache.hudi”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_payments”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Hudi Write APIs
Rollback / Restore
Bulk Insert
Hive Registration
Insert Upsert
Insert Overwrite
Delete
Bootstrap
Hudi Read APIs
Snapshot Read
● This is the typical read pattern
● Read data at latest time (standard)
● Read data at some point in time (time
travel)
Incremental Read
● Read records modified only after a
certain time or operation.
● Can be used in incremental processing
pipelines.
Batch
Base files (columnar)
Delta files
(columnar/row)
Primary & Secondary Indexes
Transaction Log
Data Lake/Warehouse
Optimized Data Layout
S
O
U
R
C
E
Query Engines
Cleaning Clustering Replication Archiving Compaction
Streaming
Metastore
Hudi Table Services
Hudi Storage Format
Data Lake Evolution
Hudi Table Services
Clustering
Clustering can make reads more efficient
by changing the physical layout of
records across files
Compaction
Convert files on disk into read optimized files
(applicable for Merge on Read).
Clean
Remove Hudi data files that are no longer
needed
Archiving
Archive Hudi metadata files that are no longer
being actively used
Ingestion and query engines are optimized for different things
Clustering: use case
Data Ingestion Query Engines
Data Locality
Data is stored based on
arrival time
Works better when data queried often is
co-located together
File Size
Prefers small files to
increase parallelism
Typically performance degrades when
there are a lot of small files
● Clustering is a framework to change data layout
○ Pluggable strategy to “re-organize” data
○ Sorting/Stitching strategies provided in open source version
● Flexible policies
○ Configuration to select partitions for rewrite
○ Different partitions can be laid out differently
○ Clustering granularity: global vs local vs custom
● Provides snapshot isolation and time travel for improving operations
○ Clustering is compatible with Hudi Rollback/Restore
○ Updates Hudi metadata and index
● Leverages Multi Version Concurrency Control
○ Clustering can be executed parallel to ingestion
○ Clustering and other hudi table services such as compaction can run concurrently
Clustering Overview
Query Plan before clustering
● Test setup: Popular production Table with 1 partition. No clustering
● Query translates to something like: select c, d from table where a == x, b == y
Query Plan after clustering
● Test setup: Table with 1 partition. Clustering performed by sorting on a, b
● Query: select c, d from table where a == x, b == y
● 10x reduction in input data processed
● 4x reduction in CPU cost
● More than 50% reduction in query latency
Table State Input data size Input rows CPU cost
Non-clustered 2,290 MB 29 M 27.56 sec
Clustered 182 MB 3 M 6.94 sec
Performance summary
On-Going Work
➔ Concurrent Writers [RFC-22] & [PR-2374]
◆ Multiple Writers to Hudi tables with file level concurrency control
➔ Hudi Observability [RFC-23]
◆ Collect metrics such as Physical vs Logical, Users, Stage Skews
◆ Use to feedback jobs for auto-tuning
➔ Point index [RFC-08]
◆ Target usage for primary key indexes, eg. B+ Tree
➔ ORC support [RFC]
◆ Support for ORC file format
➔ Range Index [RFC-15]
◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes)
➔ Enhance Hudi on Flink [RFC-24]
◆ Full feature support for Hudi on Flink version 1.11+
◆ First class support for Flink
➔ Spark-SQL extensions [RFC-25]
◆ DML/DDL operations such as create, insert, merge etc
◆ Spark DatasourceV2 (Spark 3+)
Big Picture
Fills a clear void in data ingestion, storage and processing!
Leads the convergence towards streaming style processing!
Brings transactional semantics to managing data
Positioned to solve impending demand for scale & speed
Evolve as data lake format!
Resources
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/incubator-hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup
Thanks!
Questions?

More Related Content

What's hot

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 

What's hot (20)

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similar to Building large scale transactional data lake using apache hudi

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdf
dogma28
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And HiveCloudera, Inc.
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
Danairat Thanabodithammachari
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
srikanthhadoop
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 

Similar to Building large scale transactional data lake using apache hudi (20)

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdf
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
Handout3o
Handout3oHandout3o
Handout3o
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Bill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Bill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Bill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Bill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Bill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Bill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Bill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Bill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
Bill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Bill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Building large scale transactional data lake using apache hudi

  • 1. Building large-scale Transactional Data Lake using Apache Hudi
  • 2. About me Satish Kotha - Apache Hudi committer - Engineer @ Uber - Previously worked on building MetricsDB and BlobStore at Twitter
  • 3. Apache Hudi : Overview
  • 4. 500B+ records/day 150+ PB Transactional Data Lake 7000+ Tables HUDI @ UBER
  • 5. Data Consistency Datacenter agnostic, xDC replication, strong consistency Time travel queries Support efficient updates and deletes Efficient updates Support efficient updates and deletes over DFS Data Freshness < 15 min of freshness on Lake & warehouse Incremental Processing Order of magnitude efficiency to process only changes Adaptive Data Layout Stitch files, Optimize layout, Prune columns, Encrypt rows/columns on demand through a standardized interface Hudi for Data Application Feature store for ML Data Accuracy Semantic validations for columns: NotNull, Range etc HUDI use cases
  • 6. Motivation Batch ingestion is too slow.. Rewrite entire tables/partitions several times a day! Late arriving data is a nightmare DFS/Cloud Storage Raw Tables Data Lake 120TB HBase table ingested every 8 hrs; Actual change < 500GB Updated/ Created rows from databases Streaming Data Big Big Batch Jobs...
  • 7. New Data Unaffected Data Updated Data update update Source table update ETL table A update …….. ETL table B Write amplification from derived tables
  • 8. Other challenges How to avoid duplicate records in dataset? How to rollback a bad batch of ingestion? What if bad data gets through? How to restore dataset? Queries can see dirty data Solving the small file problem, while keeping data fresh
  • 9. Obtain changelogs & upsert() // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import org.apache.hudi.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hudi dataset inputDataset.write.format(“org.apache.hudi”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), " userID") .option(PARTITIONPATH_FIELD_OPT_KEY()," country") .option(PRECOMBINE_FIELD_OPT_KEY(), " last_mod") .option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Step 1: Extract new changes to users table in MySQL, as avro data files on DFS (or) Use data integration tool of choice to feed db changelogs to Kafka/event queue Step 2: Use datasource to read extracted data and directly “upsert” the users table on DFS/Hive (or) Use the Hudi DeltaStreamer tool
  • 10. Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 Ts1 Key1 .....……... ... Key3 …..……... ... Batch 2 Ts2 Commit Timeline C1 Commit 1 inflight C2 Commit 2 inflight C1 Commit 1 DONE C2 Commit 2 DONE upsert Key1 C2 .. Key3 C2 .. Version at C2 Version at C1 Version at C1 Parquet Files File 2 Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. File 1 Read Optimized Query Update using Hudi Copy-On-Write Tables
  • 11. More efficient updates using Merge on Read Table Hudi Managed Table Version at C1 upsert Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 Ts1 Version at C1 Parquet Files Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Batch 2 Ts2 K1 C2 ... ... Unmerged update K2 C2 ... Unmerged update Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. K3 C2 Real-time Queries Read Optimized Queries Commit Timeline C1 Commit 1 C2 Commit 2 C1 Commit 1 inflight C2 Commit 2 inflight done done
  • 12. Trade-off Copy-On-Write Merge-On-Read File Format Exclusively columnar format Columnar format snapshots + row format write ahead log Update cost (I/O) Higher (rewrite entire parquet) Lower (append to delta log) Parquet File Size Smaller (high update(I/0) cost) Larger (low update cost) Write Amplification Higher Lower (depending on compaction strategy) Copy-on-Write vs Merge-on-Read
  • 14. How is index used ? Key1 ... Key2 ... Key3 ... Key4 ... upsert Indexing Key1 partition, f1 ... Key2 partition, f2 ... Key3 partition, f1 ... Key4 partition, f2 ... Batch at t2 with index metadata Key1, Key3 Key2, Key4 f1-t2 (data/log) f2-t2 (data/log) Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. Batch at t2 f1-t1 f2-t1
  • 15. Indexing Scope Global index Enforce uniqueness of keys across all partitions of a table Maintain mapping for record_key to (partition, fileId) Update/delete cost grows with size of the table O(size of table) Local index Enforce this constraint only within a specific partition. Writer to provide the same consistent partition path for a given record key Maintain mapping (partition, record_key) -> (fileId) Update/delete cost O(number of records updated/deleted)
  • 16. Bloom Index (default) Ideal workload: Late arriving updates Simple Index Ideal workload: Random updates/deletes to a dimension table HBase Index Ideal workload: Global index Custom Index Users can provide custom index implementation Types of Index
  • 17. Indexing Limitations Indexing only works on primary key today WIP to make this available as secondary index on other columns. Index information is only used in write path WIP to use this in read path to improve query performance. Index not centralized Move the index info from parquet metadata into hudi metadata
  • 18. - Bring Streaming APIs on Data Lake - Order of magnitude faster - Can leverage Hudi metadata to update all partitions that have changes - Previously sync all latest N-day partitions - Huge IO amplification even if there are very small number of changes - No need to create staged table - Integration with Hive/Spark Source Table ETL Table Transform new entries Staging table Join Incremental Pulls Upserts Hive + Spark DataSource Incremental Reading
  • 19. Streaming Style/Incremental pipelines! // Spark Datasource Import org.apache.hudi.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("org.apache.hudi") .option(VIEW_TYPE_OPT_KEY(),VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions. BEGIN_INSTANTTIME_OPT_KEY(), commitInstantFor8AM) .load(“s3://tables/transactions”); Dataset<Row> stdDF = standardize_payments(hoodieIncViewDF) // save it as a Hudi dataset inputDataset.write.format(“org.apache.hudi”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_payments”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)
  • 20. Hudi Write APIs Rollback / Restore Bulk Insert Hive Registration Insert Upsert Insert Overwrite Delete Bootstrap
  • 21. Hudi Read APIs Snapshot Read ● This is the typical read pattern ● Read data at latest time (standard) ● Read data at some point in time (time travel) Incremental Read ● Read records modified only after a certain time or operation. ● Can be used in incremental processing pipelines.
  • 22. Batch Base files (columnar) Delta files (columnar/row) Primary & Secondary Indexes Transaction Log Data Lake/Warehouse Optimized Data Layout S O U R C E Query Engines Cleaning Clustering Replication Archiving Compaction Streaming Metastore Hudi Table Services Hudi Storage Format Data Lake Evolution
  • 23. Hudi Table Services Clustering Clustering can make reads more efficient by changing the physical layout of records across files Compaction Convert files on disk into read optimized files (applicable for Merge on Read). Clean Remove Hudi data files that are no longer needed Archiving Archive Hudi metadata files that are no longer being actively used
  • 24. Ingestion and query engines are optimized for different things Clustering: use case Data Ingestion Query Engines Data Locality Data is stored based on arrival time Works better when data queried often is co-located together File Size Prefers small files to increase parallelism Typically performance degrades when there are a lot of small files
  • 25. ● Clustering is a framework to change data layout ○ Pluggable strategy to “re-organize” data ○ Sorting/Stitching strategies provided in open source version ● Flexible policies ○ Configuration to select partitions for rewrite ○ Different partitions can be laid out differently ○ Clustering granularity: global vs local vs custom ● Provides snapshot isolation and time travel for improving operations ○ Clustering is compatible with Hudi Rollback/Restore ○ Updates Hudi metadata and index ● Leverages Multi Version Concurrency Control ○ Clustering can be executed parallel to ingestion ○ Clustering and other hudi table services such as compaction can run concurrently Clustering Overview
  • 26. Query Plan before clustering ● Test setup: Popular production Table with 1 partition. No clustering ● Query translates to something like: select c, d from table where a == x, b == y
  • 27. Query Plan after clustering ● Test setup: Table with 1 partition. Clustering performed by sorting on a, b ● Query: select c, d from table where a == x, b == y
  • 28. ● 10x reduction in input data processed ● 4x reduction in CPU cost ● More than 50% reduction in query latency Table State Input data size Input rows CPU cost Non-clustered 2,290 MB 29 M 27.56 sec Clustered 182 MB 3 M 6.94 sec Performance summary
  • 29. On-Going Work ➔ Concurrent Writers [RFC-22] & [PR-2374] ◆ Multiple Writers to Hudi tables with file level concurrency control ➔ Hudi Observability [RFC-23] ◆ Collect metrics such as Physical vs Logical, Users, Stage Skews ◆ Use to feedback jobs for auto-tuning ➔ Point index [RFC-08] ◆ Target usage for primary key indexes, eg. B+ Tree ➔ ORC support [RFC] ◆ Support for ORC file format ➔ Range Index [RFC-15] ◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes) ➔ Enhance Hudi on Flink [RFC-24] ◆ Full feature support for Hudi on Flink version 1.11+ ◆ First class support for Flink ➔ Spark-SQL extensions [RFC-25] ◆ DML/DDL operations such as create, insert, merge etc ◆ Spark DatasourceV2 (Spark 3+)
  • 30. Big Picture Fills a clear void in data ingestion, storage and processing! Leads the convergence towards streaming style processing! Brings transactional semantics to managing data Positioned to solve impending demand for scale & speed Evolve as data lake format!
  • 31. Resources User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/incubator-hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup