2019/08/26 Office Hour
Website | www.alluxio.io
Q&A | https://alluxio.io/slack
Building a Cloud Native Stack with EMR Spark,
Alluxio, and S3
Bin Fan, Nakkul Sreenivas
AWS S3: The Default Data Lake on AWS
▪ Why we love it
▪ Cheap,
▪ High available
▪ Fully managed
▪ Really large scale
▪ Still, limitations & difference:
▪ Slow object listing
▪ Expensive rename
▪ Tput throttling
▪ Variable performance
▪ No data locality on computation
▪ No user-managed cache
Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
Why put Alluxio in AWS
▪ Provide better or consistent performance
▪ Add a data caching tier to S3: cache Hot data/Metadata
▪ Familiar FS semantics: listing, rename
▪ Keep data local to applications like Spark
▪ Compatible with other existing services like Hadoop, Hive, Presto
▪ Mount multiple data sources into the namespace
▪ Files/Objects in different storage GCS, Azure, HDFS
▪ Objects in other S3 buckets
The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
(FAQ for this office hour)
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
Data Locality via Intelligent Multi-tiering
▪ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
8/20/19 7
Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory
Typical Alluxio Use Cases
• Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
• Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Spark
Alluxio
AWS S3
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Deployment Approaches
Same instance
Spark
Alluxio
AWS S3
Deploy Alluxio as standalone cluster
between Spark and Storage
Same data
center / region
Presto
Alluxio-EMR Prerequisites and Design Considerations
▪ IAM Account with the default EMR Roles
▪ S3 Bucket to host Bootstrap script and to act as a UFS
▪ Key Pair for EC2
▪ AWS CLI
▪ Leverage AWS Glue/RDS to persist Hive Metastore State
▪ Bootstrap Scripts
12
Alluxio EMR Service Integration: Bootstrap Actions
▪ EMR provides hooks into the main configuration files for Hadoop
Services:
▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties
▪ Bootstrap Actions
▪ Up to 10 shell scripts specified by the user
▪ Runs before Hadoop service installation
▪ Offering for shutdown actions as well
Demo
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
Alluxio 2.0 & Coming in 2.1 Release
▪ Alluxio 2.0: Released in July
▪ Metadata scales to 1 bln file or more (based on rocksdb)
▪ Self-managed Metadata service based on Quorum
▪ Async writes, distributed load
▪ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
▪ Alluxio 2.1: Scheduled in Sept
▪ A Presto-Alluxio Connector with Iceberg Integration
▪ Use Alluxio as a caching layer without modifying HMS
Next steps - Try it out!
• Getting Started
• Spark Performance Tuning Tips
• Accelerate Spark and Hive Jobs on AWS S3: Use case from Bazaarvoic
• Spark + Alluxio: Tencent Use Case
Questions or Suggestions? Engage with us at alluxio.io/slack!
Questions
Slides will be available at slack channel (https://alluxio.io/slack)

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

  • 1.
    2019/08/26 Office Hour Website| www.alluxio.io Q&A | https://alluxio.io/slack Building a Cloud Native Stack with EMR Spark, Alluxio, and S3 Bin Fan, Nakkul Sreenivas
  • 2.
    AWS S3: TheDefault Data Lake on AWS ▪ Why we love it ▪ Cheap, ▪ High available ▪ Fully managed ▪ Really large scale ▪ Still, limitations & difference: ▪ Slow object listing ▪ Expensive rename ▪ Tput throttling ▪ Variable performance ▪ No data locality on computation ▪ No user-managed cache
  • 3.
    Alluxio is Open-SourceData Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  • 4.
    Why put Alluxioin AWS ▪ Provide better or consistent performance ▪ Add a data caching tier to S3: cache Hot data/Metadata ▪ Familiar FS semantics: listing, rename ▪ Keep data local to applications like Spark ▪ Compatible with other existing services like Hadoop, Hive, Presto ▪ Mount multiple data sources into the namespace ▪ Files/Objects in different storage GCS, Azure, HDFS ▪ Objects in other S3 buckets
  • 5.
    The Alluxio Story Originatedas Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  • 6.
    Fast-growing Open SourceCommunity 4000+ Github Stars1000+ Contributors Join the community on Slack (FAQ for this office hour) alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 7.
    Data Locality viaIntelligent Multi-tiering ▪ Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL 8/20/19 7
  • 8.
    Spark Presto Bash Tensorflow Java ~$ cat /mnt/alluxio/myInput DataAccessibility via popular APIs > rdd = sc.textFile(“alluxio://master:19998/myInput”) > CREATE SCHEMA hive.web > WITH (location = 'alluxio://master:19998/my-table/') ~$ python classify_image.py --model_dir /mnt/fuse/imagenet/ FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  • 9.
    Data Abstraction viaUnified Namespace Enables effective data management across different Under Store $ ./bin/alluxio fs mount /Data s3://bucket/directory
  • 10.
    Typical Alluxio UseCases • Cloud Analytics Caching Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage • Hybrid Cloud Analytics Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage
  • 11.
    Spark Alluxio AWS S3 Co-locate AlluxioWorkers with Spark for optimal I/O performance Deployment Approaches Same instance Spark Alluxio AWS S3 Deploy Alluxio as standalone cluster between Spark and Storage Same data center / region Presto
  • 12.
    Alluxio-EMR Prerequisites andDesign Considerations ▪ IAM Account with the default EMR Roles ▪ S3 Bucket to host Bootstrap script and to act as a UFS ▪ Key Pair for EC2 ▪ AWS CLI ▪ Leverage AWS Glue/RDS to persist Hive Metastore State ▪ Bootstrap Scripts 12
  • 13.
    Alluxio EMR ServiceIntegration: Bootstrap Actions ▪ EMR provides hooks into the main configuration files for Hadoop Services: ▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties ▪ Bootstrap Actions ▪ Up to 10 shell scripts specified by the user ▪ Runs before Hadoop service installation ▪ Offering for shutdown actions as well
  • 14.
  • 15.
    Alluxio MasterZookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio ReferenceArchitecture … … Application Application Under Store 1 Under Store 2
  • 16.
    Read data inAlluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 17.
    Read data notin Alluxio RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 18.
    Write data onlyto Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 19.
    Write data toAlluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 20.
    Alluxio 2.0 &Coming in 2.1 Release ▪ Alluxio 2.0: Released in July ▪ Metadata scales to 1 bln file or more (based on rocksdb) ▪ Self-managed Metadata service based on Quorum ▪ Async writes, distributed load ▪ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/ ▪ Alluxio 2.1: Scheduled in Sept ▪ A Presto-Alluxio Connector with Iceberg Integration ▪ Use Alluxio as a caching layer without modifying HMS
  • 21.
    Next steps -Try it out! • Getting Started • Spark Performance Tuning Tips • Accelerate Spark and Hive Jobs on AWS S3: Use case from Bazaarvoic • Spark + Alluxio: Tencent Use Case Questions or Suggestions? Engage with us at alluxio.io/slack!
  • 22.
    Questions Slides will beavailable at slack channel (https://alluxio.io/slack)