Iceberg + Alluxio for Fast Data Analytics

Iceberg + Alluxio For fast
Data Analytics
Beinan Wang & Shouwei Chen @ Alluxio
2021/12/14

Introduction
Beinan Wang
● PrestoDB Committer
● PhD in CE @ Syracuse
● Email: beinan@alluxio.com
● Interactive Query / Compute Engine / Caching
Shouwei Chen
● Core Maintainer @ Alluxio
● PhD in ECE @ Rutgers
● Email: shouwei@alluxio.com
● Data lake / Structured data / Community
Find us on Alluxio community slack!
https://alluxio.io/slack
ALLUXIO 2

Outline
● Alluxio Overview
● Running Iceberg with Alluxio
● Querying your Iceberg Table with Presto
● Presto Iceberg connector updates
● Q & A
ALLUXIO 3

Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
5,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million

Data Orchestration for
Analytics & AI in the Cloud
Available:

ALLUXIO 7
DATA ACCESSIBILITY
Access any storage using any compute

ALLUXIO 8
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive

COMMON USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 9
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio

Alluxio - Key Innovations
ALLUXIO 10
Acceleration, eﬀicient
representation and movement of
data based on policies
EFFICIENT ACCESS &
EASY DATA MANAGEMENT
Orchestrate a data platform with
agility across regions for private,
hybrid or multi-cloud
ENVIRONMENT AGNOSTIC
& MULTI-CLOUD READY
Support multiple APIs for
analytics and AI with storage
abstraction and streamlined data
movement across the pipeline
UNIFY DATA LAKES
≈

ALLUXIO 11
EXAMPLE JOURNEY
On-premises storage as the source of truth
v
REGION A
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
DATACENTER 2
INGESTION ETL
Hive

Why using Alluxio with Iceberg?

ALLUXIO 13
Why using Alluxio with Iceberg?
Improve IO performance and eﬀiciency for data analytics with better data locality.
Simplify the management of Iceberg files together with computing engine.
Avoid the eventual consistent file system talk with Iceberg directly.

How to integrate Alluxio with Iceberg?

ALLUXIO 15
Alluxio Write Type
Write Type Description
MUST_CACHE Writes directly to Alluxio
*THROUGH Writes directly to under storage
*CACHE_THROUGH Writes to Alluxio and under storage
synchronously
ASYNC_THROUGH Writes to Alluxio first, then asynchronously
writes to the under storage

When all accesses go through Alluxio (S3 mounted as
under storage with Iceberg tables are stored)
16
Spark can read the iceberg table from Alluxio Data in
S3
Alluxio
Alluxio reads and writes
Iceberg tables from/to S3.
Spark can write Iceberg tables to Alluxio
Alluxio + Iceberg Architecture: Option 1
ALLUXIO 16

When Iceberg tables stored on under storage (e.g. S3 here) can be
updated out side Alluxio, how to avoid reading broken table?
17
On read: Spark query the iceberg table
with “metadata sync interval = 0”
⇒ retrieve the latest iceberg table
Data in
S3
Alluxio
On read: Alluxio always
check meta data and get the
latest Iceberg file and data
file from S3
On write: Alluxio writes to S3
with
CACHE_THROUGH/THROUGH,
which will guarantee the
strong consistency for Iceberg
table commit.
On write: Spark write the Iceberg
file and data file to S3 with
CACHE_THROUGH/THROUGH.
⇒ Strong consistency achieved
for Iceberg table commit.
Alluxio + Iceberg Architecture: Option 2
ALLUXIO 17

Create Table
ALLUXIO 19
create table iceberg.test.test1 with
(format = 'PARQUET', partitioning =
ARRAY['c_birth_month']) as
SELECT
c_customer_sk,
c_birth_day,
c_birth_month
FROM
tpcds.sf100.customer

Insert
ALLUXIO 20
insert into
iceberg.test.test1
values
(
1000, 40, 13
)
;

Query
ALLUXIO 21
Screenshot from Chunxu’s talk earlier.

Schema Evolution
ALLUXIO 22
Screenshot from Chunxu’s talk earlier.

ALLUXIO 24
New Features
Native folder for metadata storage (Jack Ye, AWS)
Enable Iceberg Local Cache (Baolong, Tencent)
Upgrade to iceberg 1.12.0 and Parquet 0.12.0 (Xinli Shang, Uber and Beinan, Alluxio)
Predicate pushdown to iceberg (Beinan Wang, Alluxio)

Iceberg Native Catalog
Native folder for metadata storage (Jack Ye, AWS)
ALLUXIO 25

Iceberg Loca Cache
Enable Iceberg Local Cache (Baolong, Tencent)
ALLUXIO 26
Diagram is from: https://prestodb.io/blog/2021/02/04/raptorx

Predicate Pushdown
Reduce the number of partitions scanned by presto
ALLUXIO 27

Predicate Pushdown Resource Usage
Reduce the number of partitions scanned by presto
ALLUXIO 28

ALLUXIO 29
Ongoing Work
Native Iceberg IO (Jack Ye, AWS)
Materialized view (Chunxu Tang, Twitter)
Iceberg v2 support and Row level Delete(Beinan Wang, Alluxio)

Iceberg + Alluxio for Fast Data Analytics

More Related Content

What's hot

Similar to Iceberg + Alluxio for Fast Data Analytics

More from Alluxio, Inc.

Recently uploaded

Iceberg + Alluxio for Fast Data Analytics