SlideShare a Scribd company logo
1 of 32
Download to read offline
Building large-scale Transactional Data Lake using
Apache Hudi
About me
Satish Kotha
- Apache Hudi committer
- Engineer @ Uber
- Previously worked on building
MetricsDB and BlobStore at
Twitter
Apache Hudi : Overview
500B+
records/day
150+ PB
Transactional Data Lake
7000+
Tables
HUDI @ UBER
Data Consistency
Datacenter agnostic, xDC
replication, strong consistency
Time travel queries
Support efficient updates and
deletes
Efficient updates
Support efficient updates
and deletes over DFS
Data Freshness
< 15 min of freshness on
Lake & warehouse
Incremental
Processing
Order of magnitude efficiency
to process only changes
Adaptive Data Layout
Stitch files, Optimize layout,
Prune columns, Encrypt
rows/columns on demand
through a standardized
interface
Hudi for Data
Application
Feature store for ML
Data Accuracy
Semantic validations for
columns: NotNull, Range
etc
HUDI use cases
Motivation
Batch ingestion is too slow..
Rewrite entire tables/partitions
several times a day!
Late arriving data is a
nightmare
DFS/Cloud Storage
Raw Tables
Data Lake
120TB HBase table
ingested every 8 hrs;
Actual change < 500GB
Updated/
Created
rows from
databases
Streaming Data Big Big Batch Jobs...
New Data
Unaffected Data
Updated Data
update
update
Source table
update
ETL table A
update
……..
ETL table B
Write amplification from derived tables
Other challenges
How to avoid duplicate records in dataset?
How to rollback a bad batch of ingestion?
What if bad data gets through? How to restore dataset?
Queries can see dirty data
Solving the small file problem, while keeping data fresh
Obtain changelogs & upsert()
// Command to extract incrementals using sqoop
bin/sqoop import 
-Dmapreduce.job.user.classpath.first=true 
--connect jdbc:mysql://localhost/users 
--username root 
--password ******* 
--table users 
--as-avrodatafile 
--target-dir 
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import org.apache.hudi.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“org.apache.hudi”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "
userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"
country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "
last_mod")
.option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use datasource to read extracted
data and directly “upsert” the users table
on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Key4 …..……...
...
Batch 1
Ts1
Key1 .....……...
...
Key3 …..……...
...
Batch 2
Ts2
Commit Timeline
C1 Commit 1 inflight
C2 Commit 2 inflight
C1 Commit 1 DONE
C2 Commit 2 DONE
upsert
Key1 C2 ..
Key3 C2 ..
Version at C2
Version at C1
Version at C1
Parquet
Files
File 2
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
File 1
Read Optimized Query
Update using Hudi Copy-On-Write Tables
More efficient updates using Merge on Read Table
Hudi Managed Table
Version at C1
upsert
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Key4 …..……...
...
Batch 1
Ts1
Version at C1
Parquet
Files
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Batch 2
Ts2
K1 C2 ...
...
Unmerged update
K2 C2 ...
Unmerged update
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
K3 C2
Real-time
Queries
Read Optimized
Queries
Commit Timeline
C1 Commit 1
C2 Commit 2
C1 Commit 1 inflight
C2 Commit 2 inflight
done
done
Trade-off Copy-On-Write Merge-On-Read
File Format
Exclusively columnar
format
Columnar format snapshots + row
format write ahead log
Update cost
(I/O)
Higher (rewrite entire
parquet)
Lower (append to delta log)
Parquet File Size
Smaller (high update(I/0)
cost)
Larger (low update cost)
Write
Amplification
Higher
Lower (depending on compaction
strategy)
Copy-on-Write vs Merge-on-Read
Indexing
How is index used ?
Key1 ...
Key2 ...
Key3 ...
Key4 ...
upsert
Indexing
Key1 partition, f1
...
Key2 partition, f2
...
Key3 partition, f1
...
Key4 partition, f2
...
Batch at t2 with index metadata
Key1, Key3
Key2,
Key4
f1-t2 (data/log)
f2-t2 (data/log)
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
Batch at t2
f1-t1 f2-t1
Indexing Scope
Global index
Enforce uniqueness of keys across all partitions of a
table
Maintain mapping for record_key to (partition, fileId)
Update/delete cost grows with size of the table
O(size of table)
Local index
Enforce this constraint only within a specific partition.
Writer to provide the same consistent partition path
for a given record key
Maintain mapping (partition, record_key) -> (fileId)
Update/delete cost O(number of records
updated/deleted)
Bloom Index (default)
Ideal workload: Late arriving updates
Simple Index
Ideal workload: Random updates/deletes to a
dimension table
HBase Index
Ideal workload: Global index
Custom Index
Users can provide custom index implementation
Types of Index
Indexing Limitations
Indexing only works on
primary key today
WIP to make this available as
secondary index on other
columns.
Index information is only used
in write path
WIP to use this in read path to
improve query performance.
Index not centralized
Move the index info from
parquet metadata into hudi
metadata
- Bring Streaming APIs on Data Lake
- Order of magnitude faster
- Can leverage Hudi metadata to update all partitions that have changes
- Previously sync all latest N-day partitions
- Huge IO amplification even if there are very small number of changes
- No need to create staged table
- Integration with Hive/Spark
Source
Table
ETL Table
Transform new
entries
Staging
table
Join
Incremental
Pulls
Upserts
Hive + Spark DataSource
Incremental Reading
Streaming Style/Incremental pipelines!
// Spark Datasource
Import org.apache.hudi.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("org.apache.hudi")
.option(VIEW_TYPE_OPT_KEY(),VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.
BEGIN_INSTANTTIME_OPT_KEY(),
commitInstantFor8AM)
.load(“s3://tables/transactions”);
Dataset<Row> stdDF = standardize_payments(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“org.apache.hudi”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_payments”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Hudi Write APIs
Rollback / Restore
Bulk Insert
Hive Registration
Insert Upsert
Insert Overwrite
Delete
Bootstrap
Hudi Read APIs
Snapshot Read
● This is the typical read pattern
● Read data at latest time (standard)
● Read data at some point in time (time
travel)
Incremental Read
● Read records modified only after a
certain time or operation.
● Can be used in incremental processing
pipelines.
Batch
Base files (columnar)
Delta files
(columnar/row)
Primary & Secondary Indexes
Transaction Log
Data Lake/Warehouse
Optimized Data Layout
S
O
U
R
C
E
Query Engines
Cleaning Clustering Replication Archiving Compaction
Streaming
Metastore
Hudi Table Services
Hudi Storage Format
Data Lake Evolution
Hudi Table Services
Clustering
Clustering can make reads more efficient
by changing the physical layout of
records across files
Compaction
Convert files on disk into read optimized files
(applicable for Merge on Read).
Clean
Remove Hudi data files that are no longer
needed
Archiving
Archive Hudi metadata files that are no longer
being actively used
Ingestion and query engines are optimized for different things
Clustering: use case
Data Ingestion Query Engines
Data Locality
Data is stored based on
arrival time
Works better when data queried often is
co-located together
File Size
Prefers small files to
increase parallelism
Typically performance degrades when
there are a lot of small files
● Clustering is a framework to change data layout
○ Pluggable strategy to “re-organize” data
○ Sorting/Stitching strategies provided in open source version
● Flexible policies
○ Configuration to select partitions for rewrite
○ Different partitions can be laid out differently
○ Clustering granularity: global vs local vs custom
● Provides snapshot isolation and time travel for improving operations
○ Clustering is compatible with Hudi Rollback/Restore
○ Updates Hudi metadata and index
● Leverages Multi Version Concurrency Control
○ Clustering can be executed parallel to ingestion
○ Clustering and other hudi table services such as compaction can run concurrently
Clustering Overview
Query Plan before clustering
● Test setup: Popular production Table with 1 partition. No clustering
● Query translates to something like: select c, d from table where a == x, b == y
Query Plan after clustering
● Test setup: Table with 1 partition. Clustering performed by sorting on a, b
● Query: select c, d from table where a == x, b == y
● 10x reduction in input data processed
● 4x reduction in CPU cost
● More than 50% reduction in query latency
Table State Input data size Input rows CPU cost
Non-clustered 2,290 MB 29 M 27.56 sec
Clustered 182 MB 3 M 6.94 sec
Performance summary
On-Going Work
➔ Concurrent Writers [RFC-22] & [PR-2374]
◆ Multiple Writers to Hudi tables with file level concurrency control
➔ Hudi Observability [RFC-23]
◆ Collect metrics such as Physical vs Logical, Users, Stage Skews
◆ Use to feedback jobs for auto-tuning
➔ Point index [RFC-08]
◆ Target usage for primary key indexes, eg. B+ Tree
➔ ORC support [RFC]
◆ Support for ORC file format
➔ Range Index [RFC-15]
◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes)
➔ Enhance Hudi on Flink [RFC-24]
◆ Full feature support for Hudi on Flink version 1.11+
◆ First class support for Flink
➔ Spark-SQL extensions [RFC-25]
◆ DML/DDL operations such as create, insert, merge etc
◆ Spark DatasourceV2 (Spark 3+)
Big Picture
Fills a clear void in data ingestion, storage and processing!
Leads the convergence towards streaming style processing!
Brings transactional semantics to managing data
Positioned to solve impending demand for scale & speed
Evolve as data lake format!
Resources
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/incubator-hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup
Thanks!
Questions?

More Related Content

What's hot

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfdogma28
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 

What's hot (20)

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdf
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 

Similar to Building large scale transactional data lake using apache hudi

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And HiveCloudera, Inc.
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsContinuent
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 

Similar to Building large scale transactional data lake using apache hudi (20)

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
Handout3o
Handout3oHandout3o
Handout3o
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectBill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeBill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsBill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixBill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScaleBill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningBill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileBill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningBill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsBill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldBill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeBill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Building large scale transactional data lake using apache hudi

  • 1. Building large-scale Transactional Data Lake using Apache Hudi
  • 2. About me Satish Kotha - Apache Hudi committer - Engineer @ Uber - Previously worked on building MetricsDB and BlobStore at Twitter
  • 3. Apache Hudi : Overview
  • 4. 500B+ records/day 150+ PB Transactional Data Lake 7000+ Tables HUDI @ UBER
  • 5. Data Consistency Datacenter agnostic, xDC replication, strong consistency Time travel queries Support efficient updates and deletes Efficient updates Support efficient updates and deletes over DFS Data Freshness < 15 min of freshness on Lake & warehouse Incremental Processing Order of magnitude efficiency to process only changes Adaptive Data Layout Stitch files, Optimize layout, Prune columns, Encrypt rows/columns on demand through a standardized interface Hudi for Data Application Feature store for ML Data Accuracy Semantic validations for columns: NotNull, Range etc HUDI use cases
  • 6. Motivation Batch ingestion is too slow.. Rewrite entire tables/partitions several times a day! Late arriving data is a nightmare DFS/Cloud Storage Raw Tables Data Lake 120TB HBase table ingested every 8 hrs; Actual change < 500GB Updated/ Created rows from databases Streaming Data Big Big Batch Jobs...
  • 7. New Data Unaffected Data Updated Data update update Source table update ETL table A update …….. ETL table B Write amplification from derived tables
  • 8. Other challenges How to avoid duplicate records in dataset? How to rollback a bad batch of ingestion? What if bad data gets through? How to restore dataset? Queries can see dirty data Solving the small file problem, while keeping data fresh
  • 9. Obtain changelogs & upsert() // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import org.apache.hudi.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hudi dataset inputDataset.write.format(“org.apache.hudi”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), " userID") .option(PARTITIONPATH_FIELD_OPT_KEY()," country") .option(PRECOMBINE_FIELD_OPT_KEY(), " last_mod") .option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Step 1: Extract new changes to users table in MySQL, as avro data files on DFS (or) Use data integration tool of choice to feed db changelogs to Kafka/event queue Step 2: Use datasource to read extracted data and directly “upsert” the users table on DFS/Hive (or) Use the Hudi DeltaStreamer tool
  • 10. Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 Ts1 Key1 .....……... ... Key3 …..……... ... Batch 2 Ts2 Commit Timeline C1 Commit 1 inflight C2 Commit 2 inflight C1 Commit 1 DONE C2 Commit 2 DONE upsert Key1 C2 .. Key3 C2 .. Version at C2 Version at C1 Version at C1 Parquet Files File 2 Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. File 1 Read Optimized Query Update using Hudi Copy-On-Write Tables
  • 11. More efficient updates using Merge on Read Table Hudi Managed Table Version at C1 upsert Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 Ts1 Version at C1 Parquet Files Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Batch 2 Ts2 K1 C2 ... ... Unmerged update K2 C2 ... Unmerged update Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. K3 C2 Real-time Queries Read Optimized Queries Commit Timeline C1 Commit 1 C2 Commit 2 C1 Commit 1 inflight C2 Commit 2 inflight done done
  • 12. Trade-off Copy-On-Write Merge-On-Read File Format Exclusively columnar format Columnar format snapshots + row format write ahead log Update cost (I/O) Higher (rewrite entire parquet) Lower (append to delta log) Parquet File Size Smaller (high update(I/0) cost) Larger (low update cost) Write Amplification Higher Lower (depending on compaction strategy) Copy-on-Write vs Merge-on-Read
  • 14. How is index used ? Key1 ... Key2 ... Key3 ... Key4 ... upsert Indexing Key1 partition, f1 ... Key2 partition, f2 ... Key3 partition, f1 ... Key4 partition, f2 ... Batch at t2 with index metadata Key1, Key3 Key2, Key4 f1-t2 (data/log) f2-t2 (data/log) Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. Batch at t2 f1-t1 f2-t1
  • 15. Indexing Scope Global index Enforce uniqueness of keys across all partitions of a table Maintain mapping for record_key to (partition, fileId) Update/delete cost grows with size of the table O(size of table) Local index Enforce this constraint only within a specific partition. Writer to provide the same consistent partition path for a given record key Maintain mapping (partition, record_key) -> (fileId) Update/delete cost O(number of records updated/deleted)
  • 16. Bloom Index (default) Ideal workload: Late arriving updates Simple Index Ideal workload: Random updates/deletes to a dimension table HBase Index Ideal workload: Global index Custom Index Users can provide custom index implementation Types of Index
  • 17. Indexing Limitations Indexing only works on primary key today WIP to make this available as secondary index on other columns. Index information is only used in write path WIP to use this in read path to improve query performance. Index not centralized Move the index info from parquet metadata into hudi metadata
  • 18. - Bring Streaming APIs on Data Lake - Order of magnitude faster - Can leverage Hudi metadata to update all partitions that have changes - Previously sync all latest N-day partitions - Huge IO amplification even if there are very small number of changes - No need to create staged table - Integration with Hive/Spark Source Table ETL Table Transform new entries Staging table Join Incremental Pulls Upserts Hive + Spark DataSource Incremental Reading
  • 19. Streaming Style/Incremental pipelines! // Spark Datasource Import org.apache.hudi.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("org.apache.hudi") .option(VIEW_TYPE_OPT_KEY(),VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions. BEGIN_INSTANTTIME_OPT_KEY(), commitInstantFor8AM) .load(“s3://tables/transactions”); Dataset<Row> stdDF = standardize_payments(hoodieIncViewDF) // save it as a Hudi dataset inputDataset.write.format(“org.apache.hudi”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_payments”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)
  • 20. Hudi Write APIs Rollback / Restore Bulk Insert Hive Registration Insert Upsert Insert Overwrite Delete Bootstrap
  • 21. Hudi Read APIs Snapshot Read ● This is the typical read pattern ● Read data at latest time (standard) ● Read data at some point in time (time travel) Incremental Read ● Read records modified only after a certain time or operation. ● Can be used in incremental processing pipelines.
  • 22. Batch Base files (columnar) Delta files (columnar/row) Primary & Secondary Indexes Transaction Log Data Lake/Warehouse Optimized Data Layout S O U R C E Query Engines Cleaning Clustering Replication Archiving Compaction Streaming Metastore Hudi Table Services Hudi Storage Format Data Lake Evolution
  • 23. Hudi Table Services Clustering Clustering can make reads more efficient by changing the physical layout of records across files Compaction Convert files on disk into read optimized files (applicable for Merge on Read). Clean Remove Hudi data files that are no longer needed Archiving Archive Hudi metadata files that are no longer being actively used
  • 24. Ingestion and query engines are optimized for different things Clustering: use case Data Ingestion Query Engines Data Locality Data is stored based on arrival time Works better when data queried often is co-located together File Size Prefers small files to increase parallelism Typically performance degrades when there are a lot of small files
  • 25. ● Clustering is a framework to change data layout ○ Pluggable strategy to “re-organize” data ○ Sorting/Stitching strategies provided in open source version ● Flexible policies ○ Configuration to select partitions for rewrite ○ Different partitions can be laid out differently ○ Clustering granularity: global vs local vs custom ● Provides snapshot isolation and time travel for improving operations ○ Clustering is compatible with Hudi Rollback/Restore ○ Updates Hudi metadata and index ● Leverages Multi Version Concurrency Control ○ Clustering can be executed parallel to ingestion ○ Clustering and other hudi table services such as compaction can run concurrently Clustering Overview
  • 26. Query Plan before clustering ● Test setup: Popular production Table with 1 partition. No clustering ● Query translates to something like: select c, d from table where a == x, b == y
  • 27. Query Plan after clustering ● Test setup: Table with 1 partition. Clustering performed by sorting on a, b ● Query: select c, d from table where a == x, b == y
  • 28. ● 10x reduction in input data processed ● 4x reduction in CPU cost ● More than 50% reduction in query latency Table State Input data size Input rows CPU cost Non-clustered 2,290 MB 29 M 27.56 sec Clustered 182 MB 3 M 6.94 sec Performance summary
  • 29. On-Going Work ➔ Concurrent Writers [RFC-22] & [PR-2374] ◆ Multiple Writers to Hudi tables with file level concurrency control ➔ Hudi Observability [RFC-23] ◆ Collect metrics such as Physical vs Logical, Users, Stage Skews ◆ Use to feedback jobs for auto-tuning ➔ Point index [RFC-08] ◆ Target usage for primary key indexes, eg. B+ Tree ➔ ORC support [RFC] ◆ Support for ORC file format ➔ Range Index [RFC-15] ◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes) ➔ Enhance Hudi on Flink [RFC-24] ◆ Full feature support for Hudi on Flink version 1.11+ ◆ First class support for Flink ➔ Spark-SQL extensions [RFC-25] ◆ DML/DDL operations such as create, insert, merge etc ◆ Spark DatasourceV2 (Spark 3+)
  • 30. Big Picture Fills a clear void in data ingestion, storage and processing! Leads the convergence towards streaming style processing! Brings transactional semantics to managing data Positioned to solve impending demand for scale & speed Evolve as data lake format!
  • 31. Resources User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/incubator-hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup