High Performance Data Lake with Apache Hudi and Alluxio at T3Go

DATA ORCHESTRATION SUMMIT
2020
High-performance data lake with Apache Hudi
and Alluxio at T3GO
Trevor Zhang | Big Data Sr. Engineer
VinoYang | Head of T3Go Big Data Platform

Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice

Data Lake supports T3GO Intelligent Transportation
• Background check
• Face recognition
• transaction
• Behavior
• Driving
• ……
Driver
r
Vehicle Road
Data
Collection
Application
scenario
Cloud
• Safety management
• Driver management
• UBI Insurance
• Driving mode research
• ……
• Vehicle condition
• Driving
• Energy
consumption
• Accident
• Failure
• ……
• Capacity scheduling
• Active maintenance
• Product improvement
• car design
• ……
• Traffic
• Environmental
• Trajectory
• POI
• Abnormal
• ……
• Map drawing
• Real-time traffic
• Safety management
• Municipal management
• ……
• Risk control
• Capacity
• Transaction
• City
• User
• ……
• Intelligent scheduling
• Intelligent decision
• Smart marketing
• Customer
Experience
• ……

A data lake is a centralized repository that allows
you to store all your structured and
unstructured data at any scale. You can store
your data as-is, without having to first structure
the data, and run different types of analytics—from
dashboards and visualizations to big data
processing, real-time analytics, and machine
learning to guide better decisions.
What is data lake ？

Shared-nothing (pros)
• Tables are horizontally partitioned across nodes
• Every node has its own local storage
• Every node is only responsible for its local table partitions
• Elegant and easy to reason about
• Scales well for star-schema queries
• Dominant architecture in data warehousing
Network
CPU
Memory
Disk

Shared-nothing (cons)
• Shared-nothing couples compute and storage resources
• Elasticity
• Resizing compute cluster requires redistributing (lots of) data
• Cannot simply shut off unused compute resources —> no pay-per-use
• Limited availability
• Membership changes (failures, upgrades) significantly
impact performance and may cause downtime
• Homogeneous resources vs. heterogeneous workload
• Bulk loading, reporting, exploratory analysis
Network
CPU
Memory
Disk

Multi-cluster, Shared-data
• No data silos
• Storage decoupled from compute
• Any data
• Native for structeured & semi-structured
• Unlimited scalabilitiy
• Along many dimensions
• Homogeneous resources VS heterogeneous loads
• Bulk loading, reporting, exploration and analysis
Data lake Storage
Ad-Hoc Cluster
OLAP Cluster
Data Warehouse Cluster
ETL
Cluster
BI
Cluster
ML Cluster

Multi-cluster, Shared-data
• All data in one place
• Independently scale storage
and compute
• No unload / reload to
shut off compute
• Every virtual warehouse can
access all data

T3GO data lake technical architecture diagram
Aliyun OSS
YARN
Data Lake Storage
Storage format
Orchestration
acceleration
Resource management
Multiple
calculation
Computing
Storage

Why not traditional Hadoop data warehouse
Tim
e
Order payment
rate
Pay the long tail: pay before the next
trip!
• Long business closed-loop window
• The hot and cold data is updated
randomly
and cannot be identified
• Multi-level update, long link, high cost

High backtracking costs for order analysis
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(Snapshot
Table)
user_id
User(Snapshot
Table)
veh_id
Trip(Snapshot
Table)
The historical snapshot half year ago is no longer accessible!

Data ingestion pipeline cannot guarantee reliability
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?

Summary
Pain points of Hadoop data warehouse system
Low
Reliability
Small File
Problem
Missing Data
Version
Not support
Incremental
Processing
High
Latency

Introduction to Apache Hudi
Hadoop Upserts Deletes and Incrementals
Manage DFS/cloud ultra-large-scale (hundreds of PB)
analysis datasets
Incremental data lake processing framework supporting
insert, update, and delete
Joined Apache incubator in January 2019, graduated as
TLP in May 2020
All cloud services (AWS/Tencent Cloud/Aliyun) are
available out of the box
Has been operating stably on Uber for nearly 4 years
ACID
Storage management Time
travel
Incremental

Hudi plug-in architecture
Pluggable
Index
(Bloom/HBase)
Pluggable
Data format
(Avro, Parquet)
Timeline
Metadata
Hive
Hudi DataSet
Presto
Spark
write read
Storage type Query/View
Impala
Read Optimized Query
COW
MOR
Pluggable Storage(HDFS, OSS, S3)
Java
Flink
Spark
Python
Increamental Query
Snapshot Query

Hudi storage mode and view
Storage Mode
Supported Query
Type
Features
Copy On Write
• Snapshot Query
• Incremental Query
• Read Heavy
• Focus on low-latency queries
• Columnar Parquet data file
Merge On Read
• Snapshot Query
• Incremental Query
• Read Optimized
Query
• Write Heavy
• Focus on rapid data
ingestion
• Columnar Parquet data file
• Line Avro incremental file
Query Engine Snapshot Queries Incremental Queries
Read Optimized
Queries
Hive Y Y -
Spark SQL Y Y -
Spark Datasource Y Y -
Presto Y N -
Impala Y N -
Hive Y Y Y
Spark SQL Y Y Y
Spark Datasource Y N Y
Presto Y N Y
Impala N N Y

The time travel query makes "back in time"
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(v_2020-06-0
1)
user_id
User(v_2020-06-0
1)
veh_id
Trip(v_2020-06-0
1)
Take time back to the
moment
the order occurred!
Time
Travel
Data
Version
Hudi
Feature：

Hudi guarantees the reliability of the data ingestion pipeline
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
Invisible
!
All data commit rollback
!
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
Deduplication based on index
!
Hudi MVCC writes update data to versioned Parquet/base and log
files!

Why T3go data lake need Alluxio
Serious network delay when reading and writing
Multi-cluster naming is not uniform
Low cluster stability
Low memory resource utilization
High timeout tolerance
Inefficient calculation
Serious network delay
Miss Cache ?
T3 Trips
Store

Data Lake benefit from Alluxio
Better read and write performance
Unified namespace
Higher cluster stability
Higher cluster resource utilization
Reduce timeout
T3 Trips
Store
Efficient Calculation
Low Latency
Alluxio

Hudi and Alluxio integration
OSS
Spark
Hudi target-base-path oss://……
Alluxio
Spark
Hudi target-base-path alluxio://……
OSS
change

How T3GO data lake uses Alluxio & Hudi
OSS
Spark Cluster
Presto workers
Write Hudi File
Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Ad-hoc cluster Kylin Cluster
Alluxio Cluster A
Sync to OSS
Alluxio Cluster C

Hudi&Alluxio case 1 :near-real-time analysis on data lake
Low-latency data ingest
• Hudi and Spark decoupling
• Support Flink streaming
write
Efficient and fast data
processing
• Write a commit
notification
• Scheduling integration
Low-latency interactive query analysis
• Zeppelin、presto integration
• Alluxio data orchestration
acceleration
Streaming
consume
Streaming
Product
Scheduling
processing
Data
orchestration

Hudi&Alluxio case 1 :near-real-time analysis on data lake
OSS
Presto workers Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Alluxio Cluster C
Load hudi to kylin temp tableLoad hudi to presto local worker
Ad-hoc
Query
Self-service Report
Analysis

Hudi&Alluxio case 2 : Spark multi-layer ETL and data processing
DWS
OSS
DWD
ODS
Load
Sync

Alluxio pressure test
Hudi on oss
performance is
poor! In the pressure test, after the
data volume is greater than a
certain magnitude (2400W), the
query speed using alluxio+oss
surpasses the HDFS query speed
of hybrid deployment.
After the data volume is greater
than 1E, the query speed starts to
double. After reaching 6E data, it
is up to 12 times higher than
querying native oss and 8 times
higher than querying native
HDFS.
The increase factor depends on
the machine configuration.

2020
Thanks

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Similar to High Performance Data Lake with Apache Hudi and Alluxio at T3Go (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

High Performance Data Lake with Apache Hudi and Alluxio at T3Go