SlideShare a Scribd company logo
DATA ORCHESTRATION SUMMIT
2020
High-performance data lake with Apache Hudi
and Alluxio at T3GO
Trevor Zhang | Big Data Sr. Engineer
VinoYang | Head of T3Go Big Data Platform
Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice
DATA ORCHESTRATION SUMMIT
Data Lake supports T3GO Intelligent Transportation
• Background check
• Face recognition
• transaction
• Behavior
• Driving
• ……
Driver
r
Vehicle Road
Data
Collection
Application
scenario
Cloud
• Safety management
• Driver management
• UBI Insurance
• Driving mode research
• ……
• Vehicle condition
• Driving
• Energy
consumption
• Accident
• Failure
• ……
• Capacity scheduling
• Active maintenance
• Product improvement
• car design
• ……
• Traffic
• Environmental
• Trajectory
• POI
• Abnormal
• ……
• Map drawing
• Real-time traffic
• Safety management
• Municipal management
• ……
• Risk control
• Capacity
• Transaction
• City
• User
• ……
• Intelligent scheduling
• Intelligent decision
• Smart marketing
• Customer
Experience
• ……
DATA ORCHESTRATION SUMMIT
A data lake is a centralized repository that allows
you to store all your structured and
unstructured data at any scale. You can store
your data as-is, without having to first structure
the data, and run different types of analytics—from
dashboards and visualizations to big data
processing, real-time analytics, and machine
learning to guide better decisions.
What is data lake ?
DATA ORCHESTRATION SUMMIT
Shared-nothing (pros)
• Tables are horizontally partitioned across nodes
• Every node has its own local storage
• Every node is only responsible for its local table partitions
• Elegant and easy to reason about
• Scales well for star-schema queries
• Dominant architecture in data warehousing
Network
CPU
Memory
Disk
DATA ORCHESTRATION SUMMIT
Shared-nothing (cons)
• Shared-nothing couples compute and storage resources
• Elasticity
• Resizing compute cluster requires redistributing (lots of) data
• Cannot simply shut off unused compute resources —> no pay-per-use
• Limited availability
• Membership changes (failures, upgrades) significantly
impact performance and may cause downtime
• Homogeneous resources vs. heterogeneous workload
• Bulk loading, reporting, exploratory analysis
Network
CPU
Memory
Disk
DATA ORCHESTRATION SUMMIT
Multi-cluster, Shared-data
• No data silos
• Storage decoupled from compute
• Any data
• Native for structeured & semi-structured
• Unlimited scalabilitiy
• Along many dimensions
• Homogeneous resources VS heterogeneous loads
• Bulk loading, reporting, exploration and analysis
Data lake Storage
Ad-Hoc Cluster
OLAP Cluster
Data Warehouse Cluster
ETL
Cluster
BI
Cluster
ML Cluster
DATA ORCHESTRATION SUMMIT
Multi-cluster, Shared-data
• All data in one place
• Independently scale storage
and compute
• No unload / reload to
shut off compute
• Every virtual warehouse can
access all data
DATA ORCHESTRATION SUMMIT
T3GO data lake technical architecture diagram
Aliyun OSS
YARN
Data Lake Storage
Storage format
Orchestration
acceleration
Resource management
Multiple
calculation
Computing
Storage
DATA ORCHESTRATION SUMMIT
Why not traditional Hadoop data warehouse
Tim
e
Order payment
rate
Pay the long tail: pay before the next
trip!
• Long business closed-loop window
• The hot and cold data is updated
randomly
and cannot be identified
• Multi-level update, long link, high cost
DATA ORCHESTRATION SUMMIT
High backtracking costs for order analysis
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(Snapshot
Table)
user_id
User(Snapshot
Table)
veh_id
Trip(Snapshot
Table)
The historical snapshot half year ago is no longer accessible!
DATA ORCHESTRATION SUMMIT
Data ingestion pipeline cannot guarantee reliability
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
DATA ORCHESTRATION SUMMIT
Summary
Pain points of Hadoop data warehouse system
Low
Reliability
Small File
Problem
Missing Data
Version
Not support
Incremental
Processing
High
Latency
Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice
DATA ORCHESTRATION SUMMIT
Introduction to Apache Hudi
Hadoop Upserts Deletes and Incrementals
Manage DFS/cloud ultra-large-scale (hundreds of PB)
analysis datasets
Incremental data lake processing framework supporting
insert, update, and delete
Joined Apache incubator in January 2019, graduated as
TLP in May 2020
All cloud services (AWS/Tencent Cloud/Aliyun) are
available out of the box
Has been operating stably on Uber for nearly 4 years
ACID
Storage management Time
travel
Incremental
DATA ORCHESTRATION SUMMIT
Hudi plug-in architecture
Pluggable
Index
(Bloom/HBase)
Pluggable
Data format
(Avro, Parquet)
Timeline
Metadata
Hive
Hudi DataSet
Presto
Spark
write read
Storage type Query/View
Impala
Read Optimized Query
COW
MOR
Pluggable Storage(HDFS, OSS, S3)
Java
Flink
Spark
Python
Increamental Query
Snapshot Query
DATA ORCHESTRATION SUMMIT
Hudi storage mode and view
Storage Mode
Supported Query
Type
Features
Copy On Write
• Snapshot Query
• Incremental Query
• Read Heavy
• Focus on low-latency queries
• Columnar Parquet data file
Merge On Read
• Snapshot Query
• Incremental Query
• Read Optimized
Query
• Write Heavy
• Focus on rapid data
ingestion
• Columnar Parquet data file
• Line Avro incremental file
Query Engine Snapshot Queries Incremental Queries
Read Optimized
Queries
Hive Y Y -
Spark SQL Y Y -
Spark Datasource Y Y -
Presto Y N -
Impala Y N -
Hive Y Y Y
Spark SQL Y Y Y
Spark Datasource Y N Y
Presto Y N Y
Impala N N Y
DATA ORCHESTRATION SUMMIT
The time travel query makes "back in time"
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(v_2020-06-0
1)
user_id
User(v_2020-06-0
1)
veh_id
Trip(v_2020-06-0
1)
Take time back to the
moment
the order occurred!
Time
Travel
Data
Version
Hudi
Feature:
DATA ORCHESTRATION SUMMIT
Hudi guarantees the reliability of the data ingestion pipeline
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
Invisible
!
All data commit rollback
!
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
Deduplication based on index
!
Hudi MVCC writes update data to versioned Parquet/base and log
files!
Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice
DATA ORCHESTRATION SUMMIT
Why T3go data lake need Alluxio
Serious network delay when reading and writing
Multi-cluster naming is not uniform
Low cluster stability
Low memory resource utilization
High timeout tolerance
Inefficient calculation
Serious network delay
Miss Cache ?
T3 Trips
Store
DATA ORCHESTRATION SUMMIT
Data Lake benefit from Alluxio
Better read and write performance
Unified namespace
Higher cluster stability
Higher cluster resource utilization
Reduce timeout
T3 Trips
Store
Efficient Calculation
Low Latency
Alluxio
DATA ORCHESTRATION SUMMIT
Hudi and Alluxio integration
OSS
Spark
Hudi target-base-path oss://……
Alluxio
Spark
Hudi target-base-path alluxio://……
OSS
change
DATA ORCHESTRATION SUMMIT
How T3GO data lake uses Alluxio & Hudi
OSS
Spark Cluster
Presto workers
Write Hudi File
Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Ad-hoc cluster Kylin Cluster
Alluxio Cluster A
Sync to OSS
Alluxio Cluster C
DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 1 :near-real-time analysis on data lake
Low-latency data ingest
• Hudi and Spark decoupling
• Support Flink streaming
write
Efficient and fast data
processing
• Write a commit
notification
• Scheduling integration
Low-latency interactive query analysis
• Zeppelin、presto integration
• Alluxio data orchestration
acceleration
Streaming
consume
Streaming
Product
Scheduling
processing
Data
orchestration
DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 1 :near-real-time analysis on data lake
OSS
Presto workers Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Alluxio Cluster C
Load hudi to kylin temp tableLoad hudi to presto local worker
Ad-hoc
Query
Self-service Report
Analysis
DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 2 : Spark multi-layer ETL and data processing
DWS
OSS
DWD
ODS
Load
Sync
DATA ORCHESTRATION SUMMIT
Alluxio pressure test
Hudi on oss
performance is
poor! In the pressure test, after the
data volume is greater than a
certain magnitude (2400W), the
query speed using alluxio+oss
surpasses the HDFS query speed
of hybrid deployment.
After the data volume is greater
than 1E, the query speed starts to
double. After reaching 6E data, it
is up to 12 times higher than
querying native oss and 8 times
higher than querying native
HDFS.
The increase factor depends on
the machine configuration.
DATA ORCHESTRATION SUMMIT
2020
Thanks

More Related Content

What's hot

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouse
Altinity Ltd
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 

What's hot (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouse
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 

Similar to High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
Amazon Web Services
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
Ashish Mrig
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Cloudian
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
Alluxio, Inc.
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
Amazon Web Services
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Alluxio, Inc.
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
Amazon Web Services
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAccelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data Lake
Alluxio, Inc.
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 

Similar to High Performance Data Lake with Apache Hudi and Alluxio at T3Go (20)

Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAccelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data Lake
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Alluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
Alluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 

Recently uploaded

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 

Recently uploaded (20)

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

  • 1. DATA ORCHESTRATION SUMMIT 2020 High-performance data lake with Apache Hudi and Alluxio at T3GO Trevor Zhang | Big Data Sr. Engineer VinoYang | Head of T3Go Big Data Platform
  • 2. Agenda 1.T3GO data lake introduction 2.Why Apache Hudi 3. Hudi & Alluxio practice
  • 3. DATA ORCHESTRATION SUMMIT Data Lake supports T3GO Intelligent Transportation • Background check • Face recognition • transaction • Behavior • Driving • …… Driver r Vehicle Road Data Collection Application scenario Cloud • Safety management • Driver management • UBI Insurance • Driving mode research • …… • Vehicle condition • Driving • Energy consumption • Accident • Failure • …… • Capacity scheduling • Active maintenance • Product improvement • car design • …… • Traffic • Environmental • Trajectory • POI • Abnormal • …… • Map drawing • Real-time traffic • Safety management • Municipal management • …… • Risk control • Capacity • Transaction • City • User • …… • Intelligent scheduling • Intelligent decision • Smart marketing • Customer Experience • ……
  • 4. DATA ORCHESTRATION SUMMIT A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. What is data lake ?
  • 5. DATA ORCHESTRATION SUMMIT Shared-nothing (pros) • Tables are horizontally partitioned across nodes • Every node has its own local storage • Every node is only responsible for its local table partitions • Elegant and easy to reason about • Scales well for star-schema queries • Dominant architecture in data warehousing Network CPU Memory Disk
  • 6. DATA ORCHESTRATION SUMMIT Shared-nothing (cons) • Shared-nothing couples compute and storage resources • Elasticity • Resizing compute cluster requires redistributing (lots of) data • Cannot simply shut off unused compute resources —> no pay-per-use • Limited availability • Membership changes (failures, upgrades) significantly impact performance and may cause downtime • Homogeneous resources vs. heterogeneous workload • Bulk loading, reporting, exploratory analysis Network CPU Memory Disk
  • 7. DATA ORCHESTRATION SUMMIT Multi-cluster, Shared-data • No data silos • Storage decoupled from compute • Any data • Native for structeured & semi-structured • Unlimited scalabilitiy • Along many dimensions • Homogeneous resources VS heterogeneous loads • Bulk loading, reporting, exploration and analysis Data lake Storage Ad-Hoc Cluster OLAP Cluster Data Warehouse Cluster ETL Cluster BI Cluster ML Cluster
  • 8. DATA ORCHESTRATION SUMMIT Multi-cluster, Shared-data • All data in one place • Independently scale storage and compute • No unload / reload to shut off compute • Every virtual warehouse can access all data
  • 9. DATA ORCHESTRATION SUMMIT T3GO data lake technical architecture diagram Aliyun OSS YARN Data Lake Storage Storage format Orchestration acceleration Resource management Multiple calculation Computing Storage
  • 10. DATA ORCHESTRATION SUMMIT Why not traditional Hadoop data warehouse Tim e Order payment rate Pay the long tail: pay before the next trip! • Long business closed-loop window • The hot and cold data is updated randomly and cannot be identified • Multi-level update, long link, high cost
  • 11. DATA ORCHESTRATION SUMMIT High backtracking costs for order analysis Order drive r Vehicl e Passeng er Tri p order_id driver_id user_id veh_id … status create_time lastupdate_time … … … … … … … xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx … … … … … … … Order(Snapshot Table) driver_id Driver(Snapshot Table) user_id User(Snapshot Table) veh_id Trip(Snapshot Table) The historical snapshot half year ago is no longer accessible!
  • 12. DATA ORCHESTRATION SUMMIT Data ingestion pipeline cannot guarantee reliability Business system Data Warehouse BI / Report Data Ingest Data Processing 1. 10W data is successfully written 9.97W? 2. Incorrect calculation logic leads to dirty data? 3. Repeatedly write data due to unstable network?
  • 13. DATA ORCHESTRATION SUMMIT Summary Pain points of Hadoop data warehouse system Low Reliability Small File Problem Missing Data Version Not support Incremental Processing High Latency
  • 14. Agenda 1.T3GO data lake introduction 2.Why Apache Hudi 3. Hudi & Alluxio practice
  • 15. DATA ORCHESTRATION SUMMIT Introduction to Apache Hudi Hadoop Upserts Deletes and Incrementals Manage DFS/cloud ultra-large-scale (hundreds of PB) analysis datasets Incremental data lake processing framework supporting insert, update, and delete Joined Apache incubator in January 2019, graduated as TLP in May 2020 All cloud services (AWS/Tencent Cloud/Aliyun) are available out of the box Has been operating stably on Uber for nearly 4 years ACID Storage management Time travel Incremental
  • 16. DATA ORCHESTRATION SUMMIT Hudi plug-in architecture Pluggable Index (Bloom/HBase) Pluggable Data format (Avro, Parquet) Timeline Metadata Hive Hudi DataSet Presto Spark write read Storage type Query/View Impala Read Optimized Query COW MOR Pluggable Storage(HDFS, OSS, S3) Java Flink Spark Python Increamental Query Snapshot Query
  • 17. DATA ORCHESTRATION SUMMIT Hudi storage mode and view Storage Mode Supported Query Type Features Copy On Write • Snapshot Query • Incremental Query • Read Heavy • Focus on low-latency queries • Columnar Parquet data file Merge On Read • Snapshot Query • Incremental Query • Read Optimized Query • Write Heavy • Focus on rapid data ingestion • Columnar Parquet data file • Line Avro incremental file Query Engine Snapshot Queries Incremental Queries Read Optimized Queries Hive Y Y - Spark SQL Y Y - Spark Datasource Y Y - Presto Y N - Impala Y N - Hive Y Y Y Spark SQL Y Y Y Spark Datasource Y N Y Presto Y N Y Impala N N Y
  • 18. DATA ORCHESTRATION SUMMIT The time travel query makes "back in time" Order drive r Vehicl e Passeng er Tri p order_id driver_id user_id veh_id … status create_time lastupdate_time … … … … … … … xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx … … … … … … … Order(Snapshot Table) driver_id Driver(v_2020-06-0 1) user_id User(v_2020-06-0 1) veh_id Trip(v_2020-06-0 1) Take time back to the moment the order occurred! Time Travel Data Version Hudi Feature:
  • 19. DATA ORCHESTRATION SUMMIT Hudi guarantees the reliability of the data ingestion pipeline Business system Data Warehouse BI / Report Data Ingest Data Processing Invisible ! All data commit rollback ! 1. 10W data is successfully written 9.97W? 2. Incorrect calculation logic leads to dirty data? 3. Repeatedly write data due to unstable network? Deduplication based on index ! Hudi MVCC writes update data to versioned Parquet/base and log files!
  • 20. Agenda 1.T3GO data lake introduction 2.Why Apache Hudi 3. Hudi & Alluxio practice
  • 21. DATA ORCHESTRATION SUMMIT Why T3go data lake need Alluxio Serious network delay when reading and writing Multi-cluster naming is not uniform Low cluster stability Low memory resource utilization High timeout tolerance Inefficient calculation Serious network delay Miss Cache ? T3 Trips Store
  • 22. DATA ORCHESTRATION SUMMIT Data Lake benefit from Alluxio Better read and write performance Unified namespace Higher cluster stability Higher cluster resource utilization Reduce timeout T3 Trips Store Efficient Calculation Low Latency Alluxio
  • 23. DATA ORCHESTRATION SUMMIT Hudi and Alluxio integration OSS Spark Hudi target-base-path oss://…… Alluxio Spark Hudi target-base-path alluxio://…… OSS change
  • 24. DATA ORCHESTRATION SUMMIT How T3GO data lake uses Alluxio & Hudi OSS Spark Cluster Presto workers Write Hudi File Alluxio Cluster B Kylin Short-Circuit Local Reads Short-Circuit Local Reads Read Hive Table Read Hive Table Ad-hoc cluster Kylin Cluster Alluxio Cluster A Sync to OSS Alluxio Cluster C
  • 25. DATA ORCHESTRATION SUMMIT Hudi&Alluxio case 1 :near-real-time analysis on data lake Low-latency data ingest • Hudi and Spark decoupling • Support Flink streaming write Efficient and fast data processing • Write a commit notification • Scheduling integration Low-latency interactive query analysis • Zeppelin、presto integration • Alluxio data orchestration acceleration Streaming consume Streaming Product Scheduling processing Data orchestration
  • 26. DATA ORCHESTRATION SUMMIT Hudi&Alluxio case 1 :near-real-time analysis on data lake OSS Presto workers Alluxio Cluster B Kylin Short-Circuit Local Reads Short-Circuit Local Reads Read Hive Table Read Hive Table Alluxio Cluster C Load hudi to kylin temp tableLoad hudi to presto local worker Ad-hoc Query Self-service Report Analysis
  • 27. DATA ORCHESTRATION SUMMIT Hudi&Alluxio case 2 : Spark multi-layer ETL and data processing DWS OSS DWD ODS Load Sync
  • 28. DATA ORCHESTRATION SUMMIT Alluxio pressure test Hudi on oss performance is poor! In the pressure test, after the data volume is greater than a certain magnitude (2400W), the query speed using alluxio+oss surpasses the HDFS query speed of hybrid deployment. After the data volume is greater than 1E, the query speed starts to double. After reaching 6E data, it is up to 12 times higher than querying native oss and 8 times higher than querying native HDFS. The increase factor depends on the machine configuration.