SlideShare a Scribd company logo
1 of 30
Download to read offline
Iceberg + Alluxio For fast
Data Analytics
Beinan Wang & Shouwei Chen @ Alluxio
2021/12/14
Introduction
Beinan Wang
● PrestoDB Committer
● PhD in CE @ Syracuse
● Email: beinan@alluxio.com
● Interactive Query / Compute Engine / Caching
Shouwei Chen
● Core Maintainer @ Alluxio
● PhD in ECE @ Rutgers
● Email: shouwei@alluxio.com
● Data lake / Structured data / Community
Find us on Alluxio community slack!
https://alluxio.io/slack
ALLUXIO 2
Outline
● Alluxio Overview
● Running Iceberg with Alluxio
● Querying your Iceberg Table with Presto
● Presto Iceberg connector updates
● Q & A
ALLUXIO 3
What is Alluxio?
Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
5,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Data Orchestration for
Analytics & AI in the Cloud
Available:
ALLUXIO 7
DATA ACCESSIBILITY
Access any storage using any compute
ALLUXIO 8
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive
COMMON USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 9
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio
Alluxio - Key Innovations
ALLUXIO 10
Acceleration, efficient
representation and movement of
data based on policies
EFFICIENT ACCESS &
EASY DATA MANAGEMENT
Orchestrate a data platform with
agility across regions for private,
hybrid or multi-cloud
ENVIRONMENT AGNOSTIC
& MULTI-CLOUD READY
Support multiple APIs for
analytics and AI with storage
abstraction and streamlined data
movement across the pipeline
UNIFY DATA LAKES
≈
ALLUXIO 11
EXAMPLE JOURNEY
On-premises storage as the source of truth
v
REGION A
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
DATACENTER 2
INGESTION ETL
Hive
Why using Alluxio with Iceberg?
ALLUXIO 13
Why using Alluxio with Iceberg?
Improve IO performance and efficiency for data analytics with better data locality.
Simplify the management of Iceberg files together with computing engine.
Avoid the eventual consistent file system talk with Iceberg directly.
How to integrate Alluxio with Iceberg?
ALLUXIO 15
Alluxio Write Type
Write Type Description
MUST_CACHE Writes directly to Alluxio
*THROUGH Writes directly to under storage
*CACHE_THROUGH Writes to Alluxio and under storage
synchronously
ASYNC_THROUGH Writes to Alluxio first, then asynchronously
writes to the under storage
When all accesses go through Alluxio (S3 mounted as
under storage with Iceberg tables are stored)
16
Spark can read the iceberg table from Alluxio Data in
S3
Alluxio
Alluxio reads and writes
Iceberg tables from/to S3.
Spark can write Iceberg tables to Alluxio
Alluxio + Iceberg Architecture: Option 1
ALLUXIO 16
When Iceberg tables stored on under storage (e.g. S3 here) can be
updated out side Alluxio, how to avoid reading broken table?
17
On read: Spark query the iceberg table
with “metadata sync interval = 0”
⇒ retrieve the latest iceberg table
Data in
S3
Alluxio
On read: Alluxio always
check meta data and get the
latest Iceberg file and data
file from S3
On write: Alluxio writes to S3
with
CACHE_THROUGH/THROUGH,
which will guarantee the
strong consistency for Iceberg
table commit.
On write: Spark write the Iceberg
file and data file to S3 with
CACHE_THROUGH/THROUGH.
⇒ Strong consistency achieved
for Iceberg table commit.
Alluxio + Iceberg Architecture: Option 2
ALLUXIO 17
Query your Iceberg Table
Create Table
ALLUXIO 19
create table iceberg.test.test1 with
(format = 'PARQUET', partitioning =
ARRAY['c_birth_month']) as
SELECT
c_customer_sk,
c_birth_day,
c_birth_month
FROM
tpcds.sf100.customer
Insert
ALLUXIO 20
insert into
iceberg.test.test1
values
(
1000, 40, 13
)
;
Query
ALLUXIO 21
Screenshot from Chunxu’s talk earlier.
Schema Evolution
ALLUXIO 22
Screenshot from Chunxu’s talk earlier.
Iceberg Connector Updates
ALLUXIO 24
New Features
Native folder for metadata storage (Jack Ye, AWS)
Enable Iceberg Local Cache (Baolong, Tencent)
Upgrade to iceberg 1.12.0 and Parquet 0.12.0 (Xinli Shang, Uber and Beinan, Alluxio)
Predicate pushdown to iceberg (Beinan Wang, Alluxio)
Iceberg Native Catalog
Native folder for metadata storage (Jack Ye, AWS)
ALLUXIO 25
Iceberg Loca Cache
Enable Iceberg Local Cache (Baolong, Tencent)
ALLUXIO 26
Diagram is from: https://prestodb.io/blog/2021/02/04/raptorx
Predicate Pushdown
Reduce the number of partitions scanned by presto
ALLUXIO 27
Predicate Pushdown Resource Usage
Reduce the number of partitions scanned by presto
ALLUXIO 28
ALLUXIO 29
Ongoing Work
Native Iceberg IO (Jack Ye, AWS)
Materialized view (Chunxu Tang, Twitter)
Iceberg v2 support and Row level Delete(Beinan Wang, Alluxio)
Q & A

More Related Content

What's hot

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 

What's hot (20)

NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

Similar to Iceberg + Alluxio for Fast Data Analytics

Enabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioEnabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with Alluxio
Alluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 

Similar to Iceberg + Alluxio for Fast Data Analytics (20)

Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with Kubernetes
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
 
Enabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioEnabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with Alluxio
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Running Spark & Alluxio in Kubernetes
Running Spark & Alluxio in KubernetesRunning Spark & Alluxio in Kubernetes
Running Spark & Alluxio in Kubernetes
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data management
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 

Iceberg + Alluxio for Fast Data Analytics

  • 1. Iceberg + Alluxio For fast Data Analytics Beinan Wang & Shouwei Chen @ Alluxio 2021/12/14
  • 2. Introduction Beinan Wang ● PrestoDB Committer ● PhD in CE @ Syracuse ● Email: beinan@alluxio.com ● Interactive Query / Compute Engine / Caching Shouwei Chen ● Core Maintainer @ Alluxio ● PhD in ECE @ Rutgers ● Email: shouwei@alluxio.com ● Data lake / Structured data / Community Find us on Alluxio community slack! https://alluxio.io/slack ALLUXIO 2
  • 3. Outline ● Alluxio Overview ● Running Iceberg with Alluxio ● Querying your Iceberg Table with Presto ● Presto Iceberg connector updates ● Q & A ALLUXIO 3
  • 5. Open Source Started From UC Berkeley AMPLab in 2014 Join the conversation on Slack alluxio.io/slack 1,000+ contributors & growing 5,000+ Slack Community Members Top 10 Most Critical Java Based Open Source Project GitHub’s Top 100 Most Valuable Repositories Out of 96 Million
  • 6. Data Orchestration for Analytics & AI in the Cloud Available:
  • 7. ALLUXIO 7 DATA ACCESSIBILITY Access any storage using any compute
  • 8. ALLUXIO 8 BRING DATA CLOSER TO COMPUTE ACROSS SILOS Access based data movement for compute and storage spread across environments v REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine DATACENTER 2 DATACENTER 1 Hive
  • 9. COMMON USE CASES Hybrid Cloud Gateway to utilize on-prem compute for data in the cloud CASE 02: HYBRID Alluxio Spark PUBLIC CLOUD ON PREMISE Cross Datacenter Access without changing Ingest Pipeline across regions CASE 03: MULTI-DATACENTER Presto Alluxio DATACENTER 1 DATACENTER 2 INGESTION ALLUXIO 9 Consistent SLAs, Performance, and Cost Savings on cloud storage CASE 01: CLOUD PUBLIC CLOUD Tensorflow Alluxio
  • 10. Alluxio - Key Innovations ALLUXIO 10 Acceleration, efficient representation and movement of data based on policies EFFICIENT ACCESS & EASY DATA MANAGEMENT Orchestrate a data platform with agility across regions for private, hybrid or multi-cloud ENVIRONMENT AGNOSTIC & MULTI-CLOUD READY Support multiple APIs for analytics and AI with storage abstraction and streamlined data movement across the pipeline UNIFY DATA LAKES ≈
  • 11. ALLUXIO 11 EXAMPLE JOURNEY On-premises storage as the source of truth v REGION A REGION B PRIVATE DATA CENTERS Amazon EMR DATACENTER 2 INGESTION ETL Hive
  • 12. Why using Alluxio with Iceberg?
  • 13. ALLUXIO 13 Why using Alluxio with Iceberg? Improve IO performance and efficiency for data analytics with better data locality. Simplify the management of Iceberg files together with computing engine. Avoid the eventual consistent file system talk with Iceberg directly.
  • 14. How to integrate Alluxio with Iceberg?
  • 15. ALLUXIO 15 Alluxio Write Type Write Type Description MUST_CACHE Writes directly to Alluxio *THROUGH Writes directly to under storage *CACHE_THROUGH Writes to Alluxio and under storage synchronously ASYNC_THROUGH Writes to Alluxio first, then asynchronously writes to the under storage
  • 16. When all accesses go through Alluxio (S3 mounted as under storage with Iceberg tables are stored) 16 Spark can read the iceberg table from Alluxio Data in S3 Alluxio Alluxio reads and writes Iceberg tables from/to S3. Spark can write Iceberg tables to Alluxio Alluxio + Iceberg Architecture: Option 1 ALLUXIO 16
  • 17. When Iceberg tables stored on under storage (e.g. S3 here) can be updated out side Alluxio, how to avoid reading broken table? 17 On read: Spark query the iceberg table with “metadata sync interval = 0” ⇒ retrieve the latest iceberg table Data in S3 Alluxio On read: Alluxio always check meta data and get the latest Iceberg file and data file from S3 On write: Alluxio writes to S3 with CACHE_THROUGH/THROUGH, which will guarantee the strong consistency for Iceberg table commit. On write: Spark write the Iceberg file and data file to S3 with CACHE_THROUGH/THROUGH. ⇒ Strong consistency achieved for Iceberg table commit. Alluxio + Iceberg Architecture: Option 2 ALLUXIO 17
  • 19. Create Table ALLUXIO 19 create table iceberg.test.test1 with (format = 'PARQUET', partitioning = ARRAY['c_birth_month']) as SELECT c_customer_sk, c_birth_day, c_birth_month FROM tpcds.sf100.customer
  • 21. Query ALLUXIO 21 Screenshot from Chunxu’s talk earlier.
  • 22. Schema Evolution ALLUXIO 22 Screenshot from Chunxu’s talk earlier.
  • 24. ALLUXIO 24 New Features Native folder for metadata storage (Jack Ye, AWS) Enable Iceberg Local Cache (Baolong, Tencent) Upgrade to iceberg 1.12.0 and Parquet 0.12.0 (Xinli Shang, Uber and Beinan, Alluxio) Predicate pushdown to iceberg (Beinan Wang, Alluxio)
  • 25. Iceberg Native Catalog Native folder for metadata storage (Jack Ye, AWS) ALLUXIO 25
  • 26. Iceberg Loca Cache Enable Iceberg Local Cache (Baolong, Tencent) ALLUXIO 26 Diagram is from: https://prestodb.io/blog/2021/02/04/raptorx
  • 27. Predicate Pushdown Reduce the number of partitions scanned by presto ALLUXIO 27
  • 28. Predicate Pushdown Resource Usage Reduce the number of partitions scanned by presto ALLUXIO 28
  • 29. ALLUXIO 29 Ongoing Work Native Iceberg IO (Jack Ye, AWS) Materialized view (Chunxu Tang, Twitter) Iceberg v2 support and Row level Delete(Beinan Wang, Alluxio)
  • 30. Q & A