ALLUXIO
2019
Enabling Ultra-fast Presto in the Cloud with Alluxio
Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack
2019-12-11 @ Presto Summit NYC
ALLUXIO
2019
Outline
• Alluxio Overview: History and its Open
Source Community
• Presto Alluxio Stack (PAS) Today:
Architecture, Benefit, Production Use Cases
• Alluxio Structured Data Service: Deeper
Integration with SQL Engines like Presto
ALLUXIO
2019
Alluxio Overview
History and Open Source Community
The Alluxio Story
Originated asTachyon project, at the UC Berkley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.
2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for Analytics & ML in the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
2018 20192018
Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the
conversation on
Slack
slackin.alluxio.io
Consumer Travel & TransportationTelco & Media
Companies Running Alluxio (Learn More)
TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
Four trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across
the enterprise
Rise
of the object
store
Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE
Data Ecosystem 1.0 – The Challenges
STORAGE
COMPUTE
Complex
Low performance
Expensive
Data silos cross data centers, regions, clouds
HDFS
HIVE
HDFS
Presto
NFS
TENSOR
FLOW
OBJECT
STORE
PRESTO
WAN
HDFS
WAN
S3
Spark
AZURE
PRESTO
DATA IN DISPARATE STORAGE SYSTEMS
COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS
Alluxio: an Open Source Data Orchestration System
Data Platform using a Data Orchestration Approach
HDFS
HIVE Presto
NFS
TENSOR
FLOW
DATA IN DISPARATE STORAGE SYSTEMS
PRESTO
COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS
S3
SPARK
DATA
ORCHESTRATION
DATA
ORCHESTRATION
DATA
ORCHESTRATION
DATA
ORCHESTRATION
DATA
ORCHESTRATION
ANY
DATA
APP
DATA
ORCHESTRATION
ALLUXIO
2019
Presto Alluxio Stack (PAS) Today
Architecture, Benefit, Production Use Cases
§ Distributed Data Orchestration (including caching) on Demand
• Faster: Lower query latency
• SLA: More consistent performance
• Efficiency: More concurrency and Less data transfer
§ Deeper Presto Alluxio Integration
• New Alluxio catalog service
• New Alluxio transformation service
Why Presto on Alluxio
Now available as Developer Preview in v2.1
15
How Presto Works with Alluxio
Presto
Hive
Metastore
location=s3://bucket/table
Read/Write
Metadata
Read/Write
Data
Presto
Alluxio
Mounted to Alluxio
Hive
Metastore
location=alluxio:///table
Read/Write
Metadata Read/Write
Data
16
How to Use Alluxio in Presto CLI
> CREATE TABLE alluxio_table (id varchar)
WITH (external_location = 'alluxio:///table');
> SELECT * FROM alluxio_table
Create A Table on Alluxio
Read A Table from Alluxio
17
▪ S3 performance is variable and consistent
query SLAs are hard to achieve
▪ S3 metadata operations are expensive
making workloads run longer
▪ S3 egress costs add up making the
solution expensive
▪ S3 is eventually consistent making it hard
to predict query results
Challenges with running workloads on cloud storage
Compute caching for S3 /
GCS
Accelerate analytical
frameworks on the public
cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkPresto
or
AlluxioAlluxioAlluxio
▪ Accessing data over WAN too slow
▪ Copying data to compute cloud time
consuming and complex
▪ Using another storage system like S3
means expensive application changes
▪ Using S3 via HDFS connector leads
to extremely low performance
Challenges with Hybrid Cloud
HDFS for Hybrid
Cloud
Alluxio
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Solution Benefits
▪ Same performance as local
▪ Same end-user experience
▪ 100% of I/O is offloaded
PrestoPrestoPrestoPresto
Alluxio
Presto
Alluxio
Presto
Challenges running Big Data on Object Stores & Alluxio Solution
▪ Object stores performance for big
data workloads can be very poor
▪ No native support for popular
frameworks
▪ Expensive metadata operations
reduce performance even more
▪ No support for hybrid environments
directly
Transition to Object
store
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Solution Benefits
▪ Same performance as HDFS
▪ Uses HDFS APIs
▪ Same end-user experience
▪ Storage at fraction of the
cost of HDFS
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
AWS S3
Presto
AWS S3
▪ Cache hot data in Alluxio, leaving all data in S3
▪ Reduce Presto queries from 10 sec to sub second
▪ Faster time to provide data scientists insights
Robolox
Use Case | Compute Caching for Cloud
Use Case | On-premise Caching for Presto
HDFS
▪ Large query variance during peak hours before
▪ Alluxio brings data local to Presto to reduce
the latency during peak hours
NetEase Games
Leading Online Game Company in China
https://www.alluxio.io/blog/presto-on-alluxio-how-netease-
games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Presto
HDFS
Presto
Alluxio
Architecture: Colocate Alluxio with Presto
• Black/Red line – Large Query variance without Alluxio
• Green line - Stable query time with Alluxio
Use Case | On-premise Satellite Cluster for Presto
HDFS
SPARK
▪ Presto workers may read remotely from
HDFS datanodes -> large query variance
▪ Data local to Presto accelerates workloads
JD.com
Leading Online Retailer in China
https://www.slideshare.net/Alluxio/alluxio-in-jd
Presto
HDFS
SPARK Presto
Alluxio
Architecture: Colocate Alluxio with Presto
25
Performance Evaluation
• Yellow line - Stable query time with Alluxio
• < 1sec after first query (cold read)
• Green line – JD Presto without Alluxio : > 10sec
More Examples
27
Details:
www.alluxio.io/powered-by-alluxio/
www.alluxio.io/data-orchestration-summit-2019/
Common Use Cases
Accelerate query performance
as cloud storage caching
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkPresto
On-premise satellite
compute clusters across data centers
Satellite Presto Cluster
Alluxio
SparkHive
Main Hadoop Cluster
Presto
Zero-copy burst workloads in
hybrid cloud environments
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Presto
Alluxio
28
Advanced Use Cases
Spark
Alluxio
Any Cloud / Multi Cloud
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
Standalone
Spark
Alluxio
Orchestrate data frameworks
on the public cloud
Any public /
private cloud
or or
PrestoHive
ALLUXIO
2019
Alluxio Structured Data Service
Deeper Integration with SQL Engines like Presto
Now available as Developer Preview in v2.1
31
Storage Systems SQL Frameworks
Files/Objects
Directories
Raw Bytes
Cost-efficiency
Durability
Tables
Schemas
Rows/Columns
Compute-optimized
Computation
Impedance Mismatch
Further Expand Benefits!
Benefits of Alluxio Data Orchestration
32
Storage
Systems
SQL
Frameworks
Caching
Unified Interface/Namespace
Schema-Aware Optimizations
Compute-Optimized Formats
Physical Data Independence
Alluxio Structured Data Service (from v2.1)
33
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
AlluxioTransformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage
Alluxio Structured Data Service Summary
34
• Significantly speed up queries!
• Detailed presentation:
www.alluxio.io/resources/videos/alluxio-
innovations-for-structured-data/
• Try it out!
§ Check out more tutorials https://www.alluxio.io/presto/
§ More Video & Slides: https://www.alluxio.io/data-orchestration-
summit-2019/
§ Additional Reads:
• Starburst Presto + Alluxio = better together
https://www.starburstdata.com/technical-blog/starburst-presto-alluxio-better-together/
• Top 5 performance tips running Presto with Alluxio
https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1
• Presto + Alluxio + Hive Metastore on your Laptop in 10 min
https://www.alluxio.io/blog/tutorial-presto-alluxio-hive-metastore-on-your-laptop-in-10-min/
• Alluxio Structure Data Service: https://www.alluxio.io/resources/videos/alluxio-innovations-for-
structured-data/
35
Next Step
Thank you!
Questions?
www.alluxio.io | slackin.alluxio.io | @alluxio | haoyuan@alluxio.com

Enabling Ultra-fast Presto in the Cloud with Alluxio

  • 1.
    ALLUXIO 2019 Enabling Ultra-fast Prestoin the Cloud with Alluxio Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack 2019-12-11 @ Presto Summit NYC
  • 2.
    ALLUXIO 2019 Outline • Alluxio Overview:History and its Open Source Community • Presto Alluxio Stack (PAS) Today: Architecture, Benefit, Production Use Cases • Alluxio Structured Data Service: Deeper Integration with SQL Engines like Presto
  • 3.
  • 4.
    The Alluxio Story OriginatedasTachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for Analytics & ML in the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 20192018
  • 5.
    Open Source StartedFrom UC Berkeley AMPLab 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed GitHub’s Top 100 Most Valuable Repositories Out of 96 Million Join the conversation on Slack slackin.alluxio.io
  • 6.
    Consumer Travel &TransportationTelco & Media Companies Running Alluxio (Learn More) TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
  • 7.
    Four trends drivingthe need for a new architecture Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise of the object store
  • 8.
    Data Ecosystem -Beta Data Ecosystem 1.0 COMPUTE STORAGE STORAGE COMPUTE
  • 9.
    Data Ecosystem 1.0– The Challenges STORAGE COMPUTE Complex Low performance Expensive
  • 10.
    Data silos crossdata centers, regions, clouds HDFS HIVE HDFS Presto NFS TENSOR FLOW OBJECT STORE PRESTO WAN HDFS WAN S3 Spark AZURE PRESTO DATA IN DISPARATE STORAGE SYSTEMS COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS
  • 11.
    Alluxio: an OpenSource Data Orchestration System
  • 12.
    Data Platform usinga Data Orchestration Approach HDFS HIVE Presto NFS TENSOR FLOW DATA IN DISPARATE STORAGE SYSTEMS PRESTO COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS S3 SPARK DATA ORCHESTRATION DATA ORCHESTRATION DATA ORCHESTRATION DATA ORCHESTRATION DATA ORCHESTRATION ANY DATA APP DATA ORCHESTRATION
  • 13.
    ALLUXIO 2019 Presto Alluxio Stack(PAS) Today Architecture, Benefit, Production Use Cases
  • 14.
    § Distributed DataOrchestration (including caching) on Demand • Faster: Lower query latency • SLA: More consistent performance • Efficiency: More concurrency and Less data transfer § Deeper Presto Alluxio Integration • New Alluxio catalog service • New Alluxio transformation service Why Presto on Alluxio Now available as Developer Preview in v2.1 15
  • 15.
    How Presto Workswith Alluxio Presto Hive Metastore location=s3://bucket/table Read/Write Metadata Read/Write Data Presto Alluxio Mounted to Alluxio Hive Metastore location=alluxio:///table Read/Write Metadata Read/Write Data 16
  • 16.
    How to UseAlluxio in Presto CLI > CREATE TABLE alluxio_table (id varchar) WITH (external_location = 'alluxio:///table'); > SELECT * FROM alluxio_table Create A Table on Alluxio Read A Table from Alluxio 17
  • 17.
    ▪ S3 performanceis variable and consistent query SLAs are hard to achieve ▪ S3 metadata operations are expensive making workloads run longer ▪ S3 egress costs add up making the solution expensive ▪ S3 is eventually consistent making it hard to predict query results Challenges with running workloads on cloud storage Compute caching for S3 / GCS Accelerate analytical frameworks on the public cloud Same instance / container Alluxio Spark AlluxioAlluxio Spark Alluxio SparkPresto or
  • 18.
    AlluxioAlluxioAlluxio ▪ Accessing dataover WAN too slow ▪ Copying data to compute cloud time consuming and complex ▪ Using another storage system like S3 means expensive application changes ▪ Using S3 via HDFS connector leads to extremely low performance Challenges with Hybrid Cloud HDFS for Hybrid Cloud Alluxio Burst big data workloads in hybrid cloud environments Same instance / container Solution Benefits ▪ Same performance as local ▪ Same end-user experience ▪ 100% of I/O is offloaded PrestoPrestoPrestoPresto
  • 19.
    Alluxio Presto Alluxio Presto Challenges running BigData on Object Stores & Alluxio Solution ▪ Object stores performance for big data workloads can be very poor ▪ No native support for popular frameworks ▪ Expensive metadata operations reduce performance even more ▪ No support for hybrid environments directly Transition to Object store Dramatically speed-up big data on object stores on premise Same container / machine or or Solution Benefits ▪ Same performance as HDFS ▪ Uses HDFS APIs ▪ Same end-user experience ▪ Storage at fraction of the cost of HDFS Alluxio Presto Alluxio Presto
  • 20.
    Alluxio Presto AWS S3 Presto AWS S3 ▪Cache hot data in Alluxio, leaving all data in S3 ▪ Reduce Presto queries from 10 sec to sub second ▪ Faster time to provide data scientists insights Robolox Use Case | Compute Caching for Cloud
  • 21.
    Use Case |On-premise Caching for Presto HDFS ▪ Large query variance during peak hours before ▪ Alluxio brings data local to Presto to reduce the latency during peak hours NetEase Games Leading Online Game Company in China https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/ Presto HDFS Presto Alluxio
  • 22.
    Architecture: Colocate Alluxiowith Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio
  • 23.
    Use Case |On-premise Satellite Cluster for Presto HDFS SPARK ▪ Presto workers may read remotely from HDFS datanodes -> large query variance ▪ Data local to Presto accelerates workloads JD.com Leading Online Retailer in China https://www.slideshare.net/Alluxio/alluxio-in-jd Presto HDFS SPARK Presto Alluxio
  • 24.
  • 25.
    Performance Evaluation • Yellowline - Stable query time with Alluxio • < 1sec after first query (cold read) • Green line – JD Presto without Alluxio : > 10sec
  • 26.
  • 27.
    Common Use Cases Acceleratequery performance as cloud storage caching Alluxio Spark AlluxioAlluxio Spark Alluxio SparkPresto On-premise satellite compute clusters across data centers Satellite Presto Cluster Alluxio SparkHive Main Hadoop Cluster Presto Zero-copy burst workloads in hybrid cloud environments Hive Alluxio Hive Alluxio Hive Alluxio Presto Alluxio 28
  • 28.
    Advanced Use Cases Spark Alluxio AnyCloud / Multi Cloud Same data center / region Presto Enable big data on object stores across single or multiple clouds Standalone Spark Alluxio Orchestrate data frameworks on the public cloud Any public / private cloud or or PrestoHive
  • 29.
    ALLUXIO 2019 Alluxio Structured DataService Deeper Integration with SQL Engines like Presto Now available as Developer Preview in v2.1
  • 30.
    31 Storage Systems SQLFrameworks Files/Objects Directories Raw Bytes Cost-efficiency Durability Tables Schemas Rows/Columns Compute-optimized Computation Impedance Mismatch Further Expand Benefits!
  • 31.
    Benefits of AlluxioData Orchestration 32 Storage Systems SQL Frameworks Caching Unified Interface/Namespace Schema-Aware Optimizations Compute-Optimized Formats Physical Data Independence
  • 32.
    Alluxio Structured DataService (from v2.1) 33 Presto Alluxio Caching Service Alluxio Catalog Service AlluxioTransformation Service Hive Connector Alluxio Connector Hive Metastore Storage
  • 33.
    Alluxio Structured DataService Summary 34 • Significantly speed up queries! • Detailed presentation: www.alluxio.io/resources/videos/alluxio- innovations-for-structured-data/ • Try it out!
  • 34.
    § Check outmore tutorials https://www.alluxio.io/presto/ § More Video & Slides: https://www.alluxio.io/data-orchestration- summit-2019/ § Additional Reads: • Starburst Presto + Alluxio = better together https://www.starburstdata.com/technical-blog/starburst-presto-alluxio-better-together/ • Top 5 performance tips running Presto with Alluxio https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1 • Presto + Alluxio + Hive Metastore on your Laptop in 10 min https://www.alluxio.io/blog/tutorial-presto-alluxio-hive-metastore-on-your-laptop-in-10-min/ • Alluxio Structure Data Service: https://www.alluxio.io/resources/videos/alluxio-innovations-for- structured-data/ 35 Next Step
  • 35.
    Thank you! Questions? www.alluxio.io |slackin.alluxio.io | @alluxio | haoyuan@alluxio.com