Accelerate Spark workloads on S3 with Alluxio

Accelerate Spark workloads on S3
Dipti Borkar | Product @ Alluxio

Becoming increasingly popular
▪ HDFS complex and not created for
cloud environments
▪ S3 is very easy to use, simple API and
cost effective
Analytics on the cloud object storage
Big data frameworks on the
public cloud
SparkSparkSparkSpark

Running Spark on S3 – Different Options
#1 Native Apache Spark install on
AWS EC2
+
#2 Bundled into services like
AWS EMR
#3 Managed services

Challenges with Analytics on S3
Big data frameworks on the
public cloud
Challenges with S3
▪ Not built for interactive analytics
▪ Expensive metadata operations like list, rename
▪ Eventual consistency
▪ Performance inconsistent

#2 Data Access options on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S

Using Spark with HDFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS
Manual distcp / No SyncManual distcp / No Sync
EMRF
S

Using EMRFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
No data
caching
No data
caching

▪ Provides a data caching layer for Spark
▪ Provides strong consistency for for metadata
operations and faster performance
▪ Provides API compatibility with HDFS & S3
▪ S3 is eventually consistent making it hard to
predict query results
▪ Allows for data outside of S3 to be analyzed
as well
Spark workloads on S3 with Alluxio
Compute caching for Spark on
S3
Accelerate big data
frameworks on the public
cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark

Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Using Alluxio with AWS EMR

Alluxio for Spark
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
10
Alluxio for Spark

In-Memory
Storage
block 1
block 3
In-Memory
Storage
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Jobs
Inter-process sharing slowed down by network I/O
11
Data sharing between jobs

Data Sharing Between Jobs
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
storage &
execution engine
separated
Inter-process sharing can happen at memory speed
12
Data Sharing Between JobsData sharing between jobs

Data Resilience during Crashes
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
13
Data Sharing Between JobsData resilience during crashes

Crash
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
14
Data Sharing Between JobsData resilience during crashes

block 1
block 3
block 2
block 4
Crash
storage engine &
execution engine
same process
15
Data Resilience during CrashesData Sharing Between JobsData resilience during crashes

storage &
execution engine
separated
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
Process crash only needs memory I/O to re-read the data
16

Crash
storage &
execution engine
separated
Process crash only needs memory I/O to re-read the data
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
17

The Alluxio Story
Originated as Tachyon project, at the UC Berkley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.)
Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the
Cloud for data driven apps such as Big Data Analytics, ML and
AI.
2018 20192018

Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE

Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot War
m
Cold
RAM SSD HDD
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL

Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting with Transparent Naming

Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data
available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2

AlluxioAlluxioAlluxio
▪ Accessing data over WAN too slow
▪ Copying data to compute cloud time
consuming and complex
▪ Using another storage system like S3
means expensive application changes
▪ Using S3 via HDFS connector leads
to extremely low performance
Challenges with Hybrid Cloud & Alluxio Solution
HDFS for Hybrid
Cloud
Alluxio
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Solution Benefits
▪ Same performance as local
▪ Same end-user experience
▪ 100% of I/O is offloaded

Challenges with supporting more frameworks & Alluxio Solution
▪ Running new frameworks on existing an
HDFS cluster can dramatically affect
performance of existing workloads
▪ In a disaggregate environment, copying
data to multiple compute clouds time
consuming and error prone
▪ Migrating applications for new storage
systems is complex & time consuming
▪ Storing and managing multiple copies of
the data becomes expensive
Support more
frameworks
Any object store or HDFS
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
or
Spark
Alluxio Alluxio

Spark
Alluxio
Orchestrate data frameworks
on the public cloud
Any public /
private cloud
PrestoHive
Multi-cloud access for your Spark Workloads

Alluxio
MasterZookeeper
/ RAFT
Standby
Master
WA
N
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Applicatio
n
Applicatio
n
Under Store 1
Under Store
2

Demo: Bootstrapping Alluxio with AWS EMR
aws emr create-cluster
--release-label ${RELEASE_LABEL}
--instance-count ${NUM_INSTANCES}
--instance-type ${INSTANCE_TYPE}
--applications Name=Presto Name=Hive Name=Spark
--name "${CLUSTER_NAME}"
--bootstrap-actions
Path=${BOOTSTRAP_PATH},Args=[${ALLUXIO_DOWNLOAD_URL},${ROOT_UFS_
URI}, ${ADDITIONAL_PROPERTIES}]
--configurations
file:///Users/diptiborkar/Downloads/alluxio-emr.json
--ec2-attributes KeyName=${KEY_PAIR}

Bootstrapping Alluxio with AWS EMR
RELEASE_LABEL="emr-5.23.0"
NUM_INSTANCES=5
INSTANCE_TYPE="m4.xlarge"
CLUSTER_NAME="emr-v12"
BOOTSTRAP_PATH="s3://dipti-alx-2019/emr/alluxio-emr.sh"
ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/fil
es/2.0.0/alluxio-2.0.0-RC3-bin.tar.gz"
ROOT_UFS_URI="s3a://dipti-alx-2019/emr/ufs/"
ADDITIONAL_PROPERTIES="alluxio.underfs.s3.owner.id.to.username.m
apping=${S3_ID}=hadoop;alluxio.user.file.writetype.default=ASYNC
_THROUGH"

Enterprises moving towards independent compute & storage
Learn more

Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on
Slack

Questions?
Join the Alluxio Community
http://alluxio.io/ | @alluxio

Accelerate Spark workloads on S3 with Alluxio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerate Spark workloads on S3 with Alluxio

Similar to Accelerate Spark workloads on S3 with Alluxio (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Accelerate Spark workloads on S3 with Alluxio