Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
5. Data Consistency
Datacenter agnostic, xDC
replication, strong consistency
Time travel queries
Support efficient updates and
deletes
Efficient updates
Support efficient updates
and deletes over DFS
Data Freshness
< 15 min of freshness on
Lake & warehouse
Incremental
Processing
Order of magnitude efficiency
to process only changes
Adaptive Data Layout
Stitch files, Optimize layout,
Prune columns, Encrypt
rows/columns on demand
through a standardized
interface
Hudi for Data
Application
Feature store for ML
Data Accuracy
Semantic validations for
columns: NotNull, Range
etc
HUDI use cases
6. Motivation
Batch ingestion is too slow..
Rewrite entire tables/partitions
several times a day!
Late arriving data is a
nightmare
DFS/Cloud Storage
Raw Tables
Data Lake
120TB HBase table
ingested every 8 hrs;
Actual change < 500GB
Updated/
Created
rows from
databases
Streaming Data Big Big Batch Jobs...
7. New Data
Unaffected Data
Updated Data
update
update
Source table
update
ETL table A
update
……..
ETL table B
Write amplification from derived tables
8. Other challenges
How to avoid duplicate records in dataset?
How to rollback a bad batch of ingestion?
What if bad data gets through? How to restore dataset?
Queries can see dirty data
Solving the small file problem, while keeping data fresh
9. Obtain changelogs & upsert()
// Command to extract incrementals using sqoop
bin/sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://localhost/users
--username root
--password *******
--table users
--as-avrodatafile
--target-dir
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import org.apache.hudi.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“org.apache.hudi”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "
userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"
country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "
last_mod")
.option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use datasource to read extracted
data and directly “upsert” the users table
on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool
14. How is index used ?
Key1 ...
Key2 ...
Key3 ...
Key4 ...
upsert
Indexing
Key1 partition, f1
...
Key2 partition, f2
...
Key3 partition, f1
...
Key4 partition, f2
...
Batch at t2 with index metadata
Key1, Key3
Key2,
Key4
f1-t2 (data/log)
f2-t2 (data/log)
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
Batch at t2
f1-t1 f2-t1
15. Indexing Scope
Global index
Enforce uniqueness of keys across all partitions of a
table
Maintain mapping for record_key to (partition, fileId)
Update/delete cost grows with size of the table
O(size of table)
Local index
Enforce this constraint only within a specific partition.
Writer to provide the same consistent partition path
for a given record key
Maintain mapping (partition, record_key) -> (fileId)
Update/delete cost O(number of records
updated/deleted)
16. Bloom Index (default)
Ideal workload: Late arriving updates
Simple Index
Ideal workload: Random updates/deletes to a
dimension table
HBase Index
Ideal workload: Global index
Custom Index
Users can provide custom index implementation
Types of Index
17. Indexing Limitations
Indexing only works on
primary key today
WIP to make this available as
secondary index on other
columns.
Index information is only used
in write path
WIP to use this in read path to
improve query performance.
Index not centralized
Move the index info from
parquet metadata into hudi
metadata
18. - Bring Streaming APIs on Data Lake
- Order of magnitude faster
- Can leverage Hudi metadata to update all partitions that have changes
- Previously sync all latest N-day partitions
- Huge IO amplification even if there are very small number of changes
- No need to create staged table
- Integration with Hive/Spark
Source
Table
ETL Table
Transform new
entries
Staging
table
Join
Incremental
Pulls
Upserts
Hive + Spark DataSource
Incremental Reading
19. Streaming Style/Incremental pipelines!
// Spark Datasource
Import org.apache.hudi.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("org.apache.hudi")
.option(VIEW_TYPE_OPT_KEY(),VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.
BEGIN_INSTANTTIME_OPT_KEY(),
commitInstantFor8AM)
.load(“s3://tables/transactions”);
Dataset<Row> stdDF = standardize_payments(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“org.apache.hudi”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_payments”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(),UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
21. Hudi Read APIs
Snapshot Read
● This is the typical read pattern
● Read data at latest time (standard)
● Read data at some point in time (time
travel)
Incremental Read
● Read records modified only after a
certain time or operation.
● Can be used in incremental processing
pipelines.
22. Batch
Base files (columnar)
Delta files
(columnar/row)
Primary & Secondary Indexes
Transaction Log
Data Lake/Warehouse
Optimized Data Layout
S
O
U
R
C
E
Query Engines
Cleaning Clustering Replication Archiving Compaction
Streaming
Metastore
Hudi Table Services
Hudi Storage Format
Data Lake Evolution
23. Hudi Table Services
Clustering
Clustering can make reads more efficient
by changing the physical layout of
records across files
Compaction
Convert files on disk into read optimized files
(applicable for Merge on Read).
Clean
Remove Hudi data files that are no longer
needed
Archiving
Archive Hudi metadata files that are no longer
being actively used
24. Ingestion and query engines are optimized for different things
Clustering: use case
Data Ingestion Query Engines
Data Locality
Data is stored based on
arrival time
Works better when data queried often is
co-located together
File Size
Prefers small files to
increase parallelism
Typically performance degrades when
there are a lot of small files
25. ● Clustering is a framework to change data layout
○ Pluggable strategy to “re-organize” data
○ Sorting/Stitching strategies provided in open source version
● Flexible policies
○ Configuration to select partitions for rewrite
○ Different partitions can be laid out differently
○ Clustering granularity: global vs local vs custom
● Provides snapshot isolation and time travel for improving operations
○ Clustering is compatible with Hudi Rollback/Restore
○ Updates Hudi metadata and index
● Leverages Multi Version Concurrency Control
○ Clustering can be executed parallel to ingestion
○ Clustering and other hudi table services such as compaction can run concurrently
Clustering Overview
26. Query Plan before clustering
● Test setup: Popular production Table with 1 partition. No clustering
● Query translates to something like: select c, d from table where a == x, b == y
27. Query Plan after clustering
● Test setup: Table with 1 partition. Clustering performed by sorting on a, b
● Query: select c, d from table where a == x, b == y
28. ● 10x reduction in input data processed
● 4x reduction in CPU cost
● More than 50% reduction in query latency
Table State Input data size Input rows CPU cost
Non-clustered 2,290 MB 29 M 27.56 sec
Clustered 182 MB 3 M 6.94 sec
Performance summary
29. On-Going Work
➔ Concurrent Writers [RFC-22] & [PR-2374]
◆ Multiple Writers to Hudi tables with file level concurrency control
➔ Hudi Observability [RFC-23]
◆ Collect metrics such as Physical vs Logical, Users, Stage Skews
◆ Use to feedback jobs for auto-tuning
➔ Point index [RFC-08]
◆ Target usage for primary key indexes, eg. B+ Tree
➔ ORC support [RFC]
◆ Support for ORC file format
➔ Range Index [RFC-15]
◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes)
➔ Enhance Hudi on Flink [RFC-24]
◆ Full feature support for Hudi on Flink version 1.11+
◆ First class support for Flink
➔ Spark-SQL extensions [RFC-25]
◆ DML/DDL operations such as create, insert, merge etc
◆ Spark DatasourceV2 (Spark 3+)
30. Big Picture
Fills a clear void in data ingestion, storage and processing!
Leads the convergence towards streaming style processing!
Brings transactional semantics to managing data
Positioned to solve impending demand for scale & speed
Evolve as data lake format!
31. Resources
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/incubator-hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup