Building Fast SQL Analytics on Anything with
Presto,Alluxio
Bin Fan | Founding Engineer @ Alluxio
2019/08/20
Alluxio Overview
Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
Consumer Travel &
Transportation
Telco & Media Healthcare
Community Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
Data Locality via Intelligent Multi-tiering
§ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9/13/19 7
Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory
Use Cases
Typical Use Cases
Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto
Use Case | On-premise Caching for Presto
HDFS
§ Large query variance during peak hours before
§ Alluxio brings data local to Presto to reduce
the latency during peak hours
NetEase Games
Leading Online Game Company in China
https://www.alluxio.io/blog/presto-on-alluxio-how-netease-
games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Presto
HDFS
Presto
Alluxio
Architecture: Colocate Alluxio with Presto
• Black/Red line – Large Query variance without Alluxio
• Green line - Stable query time with Alluxio
Project:
• Offload HDFS with separate clusters
of Presto and Spark
Problem:
• HDFS cluster is compute and
network bound
• Performance is inconsistent
JD.com |
$70B e-commerce retailer
Hadoop Offload Use Case
Alluxio solution:
• Alluxio offloads the network I/O as
well as the compute
Result:
• Teams can run additional workloads
without taxing the existing HDFS
cluster
3000 Node HDFS
PRESTO
Separate Compute
ALLUXIO
Datacenter
SPARK
3000 Node HDFS
PRESTO
Separate Compute
Datacenter
SPARK
https://www.slideshare.net/Alluxio/alluxio-in-jd
Performance Evaluation
• Yellow line - Stable query time with Alluxio
• < 1sec after first query (cold read)
• Green line – JD Presto without Alluxio : > 10sec
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
Alluxio 2.0 & Coming in 2.1 Release
§ Alluxio 2.0: Released in July
§ Metadata scales to 1 bln file or more (based on rocksdb)
§ Self-managed Metadata service based on Quorum
§ Async writes, distributed load
§ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
§ Alluxio 2.1: Scheduled in Sept
§ A Presto-Alluxio Connector with Iceberg Integration
§ Use Alluxio as a caching layer without modifying HMS
Next steps - Try it out!
• Getting Started
• Try 10 Minutes Alluxio & Presto Tutorial on Laptop
• Try 10 Minutes Alluxio & Presto Tutorial on AWS
• Tops 5 Performance tips running Presto on Alluxio
Questions or Suggestions? Engage with us at alluxio.io/slack!
Questions
Slides will be available at slack channel (https://alluxio.io/slack)

Building Fast SQL Analytics on Anything with Presto, Alluxio

  • 1.
    Building Fast SQLAnalytics on Anything with Presto,Alluxio Bin Fan | Founding Engineer @ Alluxio 2019/08/20
  • 2.
  • 3.
    Alluxio is Open-SourceData Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  • 4.
    The Alluxio Story Originatedas Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  • 5.
    Fast-growing Open SourceCommunity 4000+ Github Stars1000+ Contributors Join the community on Slack alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 6.
    Consumer Travel & Transportation Telco& Media Healthcare Community Across Industries Learn more TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
  • 7.
    Data Locality viaIntelligent Multi-tiering § Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL 9/13/19 7
  • 8.
    Spark Presto Bash Tensorflow Java ~$ cat /mnt/alluxio/myInput DataAccessibility via popular APIs > rdd = sc.textFile(“alluxio://master:19998/myInput”) > CREATE SCHEMA hive.web > WITH (location = 'alluxio://master:19998/my-table/') ~$ python classify_image.py --model_dir /mnt/fuse/imagenet/ FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  • 9.
    Data Abstraction viaUnified Namespace Enables effective data management across different Under Store $ ./bin/alluxio fs mount /Data s3://bucket/directory
  • 10.
  • 11.
    Typical Use Cases CloudAnalytics Caching Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage Hybrid Cloud Analytics Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage
  • 12.
    Deployment Approaches Spark Alluxio Storage Co-locate AlluxioWorkers with Spark for optimal I/O performance Any Cloud Same instance / container Spark Alluxio Storage Deploy Alluxio as standalone cluster between Spark and Storage Any Cloud Same data center / region Presto
  • 13.
    Use Case |On-premise Caching for Presto HDFS § Large query variance during peak hours before § Alluxio brings data local to Presto to reduce the latency during peak hours NetEase Games Leading Online Game Company in China https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/ Presto HDFS Presto Alluxio
  • 14.
    Architecture: Colocate Alluxiowith Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio
  • 15.
    Project: • Offload HDFSwith separate clusters of Presto and Spark Problem: • HDFS cluster is compute and network bound • Performance is inconsistent JD.com | $70B e-commerce retailer Hadoop Offload Use Case Alluxio solution: • Alluxio offloads the network I/O as well as the compute Result: • Teams can run additional workloads without taxing the existing HDFS cluster 3000 Node HDFS PRESTO Separate Compute ALLUXIO Datacenter SPARK 3000 Node HDFS PRESTO Separate Compute Datacenter SPARK https://www.slideshare.net/Alluxio/alluxio-in-jd
  • 16.
    Performance Evaluation • Yellowline - Stable query time with Alluxio • < 1sec after first query (cold read) • Green line – JD Presto without Alluxio : > 10sec
  • 17.
    Alluxio MasterZookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio ReferenceArchitecture … … Application Application Under Store 1 Under Store 2
  • 18.
    Read data inAlluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 19.
    Read data notin Alluxio RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 20.
    Write data onlyto Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 21.
    Write data toAlluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 22.
    Alluxio 2.0 &Coming in 2.1 Release § Alluxio 2.0: Released in July § Metadata scales to 1 bln file or more (based on rocksdb) § Self-managed Metadata service based on Quorum § Async writes, distributed load § Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/ § Alluxio 2.1: Scheduled in Sept § A Presto-Alluxio Connector with Iceberg Integration § Use Alluxio as a caching layer without modifying HMS
  • 23.
    Next steps -Try it out! • Getting Started • Try 10 Minutes Alluxio & Presto Tutorial on Laptop • Try 10 Minutes Alluxio & Presto Tutorial on AWS • Tops 5 Performance tips running Presto on Alluxio Questions or Suggestions? Engage with us at alluxio.io/slack!
  • 24.
    Questions Slides will beavailable at slack channel (https://alluxio.io/slack)