Best Practice in Accelerating Data Applications with Spark+Alluxio

Getting Started with
Alluxio + Spark For Fast
Data Analytics
David Zhu@ Alluxio
2021/10/12

Introduction
2
• David Zhu
• Core Maintainer @ Alluxio
• PhD in CS @ UC Berkeley AMP lab
• Email: david@alluxio.com
• Performance / Job Service/ Community
• Find me on Alluxio community slack!
https://alluxio-community.slack.com/

Outline
• Overview
• Running Spark on Alluxio
• Data Locality
• Use Cases
3

The Alluxio Story
Originated as Tachyon project, at UC Berkeley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CEO
2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
2019
2018 2020 2021

Fast-growing Open Source Community
5000+ Github Stars
1100+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio

COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE

Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST API
POSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver

Why Using Alluxio with Spark?
• Improve I/O with better data locality
9

Improve I/O with Data Locality
Read data from remote storage
10
/file1
/file3
/file2
/file4

Improve I/O with Data Locality
/file1
/file3
/file2
/file4
Cache input data
closer to Spark
on demand
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file1
/file3
11

Visualizing the Stack
12
FAST
104
- 105
MB/s
MODERATE 103
- 104
MB/s
SLOW 102
- 103
MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem

• Enable Data sharing between Spark jobs
13

/file1
/file2
A pipeline consisting
of multiple jobs,
writing intermedia
data to external storage
Data Sharing Between Jobs
Inter-process sharing slowed down by network bandwidth
14

Data Sharing Between Jobs
/file1
/file2
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file1
/file2
writing data to
closer and faster
Alluxio to exchange
data
Inter-process sharing can happen at memory speed
15

• Checkpoint Data for resilience
16

Checkpoint
In-Memory Storage
RDD 1
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
17

Checkpoint
Crash
In-Memory Storage
RDD 1
storage engine &
execution engine
same process
18

Checkpoint
Crash
storage engine &
execution engine
same process
19

Checkpoint
storage &
execution engine
separated
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file
Process crash only needs memory I/O to re-read the data
20

Checkpoint
Crash
storage &
execution engine
separated
Process crash only needs memory I/O to re-read the data
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file
21

API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
– Minimal code change
– Do not need to reason about logic
• Example:
– val file = sc.textFile(“s3a://my-bucket/myFile”)
– val file = sc.textFile(“alluxio://master:19998/myFile”)
23

Setup
• Spark works with Alluxio client out of box
• Put in spark/conf/spark-defaults.conf
spark.driver.extraClassPath
/<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar
spark.executor.extraClassPath
/<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar
• More advanced setting:
https://docs.alluxio.io/os/user/stable/en/compute/
Spark.html
24

Example of Spark RDDs
Writing to Alluxio
rdd.saveAsTextFile(“alluxio://master:19998/myPath”);
rdd.saveAsObjectFile(“alluxio://master:19998/myPath”);
Reading from Alluxio
rdd = sc.textFile(“alluxio://master:19998/myPath”);
rdd = sc.objectFile(“alluxio://master:19998/myPath”);

Example of Spark DataFrames
Writing to Alluxio
df.write.parquet(“alluxio://master:19998/myPath”)
Reading from Alluxio
df = sc.read.parquet(“alluxio://master:19998/myPath”)

Spark + Alluxio: Data Locality

DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query

Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB

Spark Workflow on Remote Storage
(Without Alluxio)
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context s3://data/
Worker Node
1) run Spark job
Worker Node
Spark
Executor
3) launch executors
and launch tasks
4) access data and compute
2) allocate executors
Takeaway: Remote data, no locality

Step 1: Schedule compute to data cache location
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context
Alluxio
Client
Alluxio
Masters
HostA: 196.0.0.7
Alluxio
Worker
HostB: 196.0.0.8
Alluxio
Worker
(3) allocate on [HostA]
block1
block2
(1) where is
block1?
(2) block1 is
served at
[HostA]
● Alluxio client implements HDFS compatible API with
block location info
● Alluxio masters keep track and serve a list of worker
nodes that currently have the cache.

Step 4: Find local Alluxio Worker and Efficient Data
Exchange
s3://data/
Spark Executor
Alluxio
Worker
Alluxio
Client
HostB: 196.0.0.8
Alluxio
Worker
HostA: 196.0.0.7
block1
Alluxio
Masters
block1?
[HostA]
Efficient I/O via local fs
(e.g., /mnt/ramdisk/) or
local domain socket
(/opt/domain)
● Spark Executor finds local Alluxio Worker by
hostname comparison
● Spark Executor talks to local Alluxio Worker
using either short-circuit access (via local
FS) or local domain socket

Recap: Spark Architecture with Alluxio
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context
s3://data/
Alluxio
Client
Alluxio
Masters
Worker Node
Spark
Executor
Alluxio
Worker
Alluxio
Client
Worker Node
Alluxio
Worker
1) run spark job
2.2) allocate executors
4) access Alluxio for data
and compute
2.1) talk to Alluxio for
where the data cache is
Step 2: Help Spark schedule compute to data cache
Step 4: Enable Spark to access local Alluxio cache
3) launch executors
and launch tasks

Experiments
Spark 2.2.0 + Alluxio 1.6.0
Single worker: Amazon r3.2xlarge
Compare reading cached parquet files

New Context: 50 GB
DataFrame (S3)
6x – 8x speedup

• Hybrid Cloud Analytics
Get in-memory data access for
Spark, Presto, or any analytics
framework on Cloud storage
Typical Use Cases
• Cloud Analytics
Caching
Get in-memory data access for
Spark, Presto, or any analytics
framework on Cloud storage

Elastic Model Training
SPARK
HDFS
SPARK
HDFS
Challenge –
Algorithmic trading in an
top data-driven Hedge
Fund. Model training in
cloud for bursty
workloads
Data access was slow,
costing them $$ in
compute cost and lower
modeler productivity
Solution –
With Alluxio, data access
are 10-30X faster
Impact –
Increased efficiency on
training of ML algorithm,
lowered compute cost and
increased modeler
productivity, resulting in 14
day ROI of Alluxio
Public Cloud
Public Cloud
Leading Hedge Fund

Machine Learning Case
Study
Challenge –
Gain end to end view of
business with large volume
of data
Queries were slow / not
interactive, resulting in
operational inefficiency
Solution –
ETL Data from storage to
Alluxio
Impact –
Faster Time to Market –
“Now we don’t have to work
Sundays”
SPARK
Storage
SPARK
Storage
https://dzone.com/articles/Accelerate-In-Memory-Proces
sing-with-Spark-from-Hours-to-Seconds-With-Tachyon

Cloud Analytics
Challenge –
Queries were slow / not
interactive, resulting in
operational inefficiency
Solution –
Using Alluxio as a read cache
for queries to S3
Impact –
6x - 11x query performance
More scalable analytics
infrastructure
https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-o
n-aws-s3-by-10x-with-alluxio-tiered-storage/

Spark + Alluxio @ Boss直聘
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
Target
● Use Spark to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple model training frameworks
● Control the read/write rate from Alluxio to
Ceph
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control

Spark + Alluxio @ Boss直聘
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/

References
- Spark Performance Tuning Tips
- Accelerate Spark and Hive Jobs on AWS S3:
Use case from Bazaarvoice
- Spark + Alluxio: Tencent News Use Case

• Checkpoint Data for resilience
44

Talk to us
david@alluxio.com
https://alluxio-community.slack.com/
We are hiring

Best Practice in Accelerating Data Applications with Spark+Alluxio

More Related Content

What's hot

Similar to Best Practice in Accelerating Data Applications with Spark+Alluxio

More from Alluxio, Inc.

Recently uploaded

Best Practice in Accelerating Data Applications with Spark+Alluxio