Getting Started with
Alluxio + Spark For Fast
Data Analytics
David Zhu@ Alluxio
2021/10/12
Introduction
2
• David Zhu
• Core Maintainer @ Alluxio
• PhD in CS @ UC Berkeley AMP lab
• Email: david@alluxio.com
• Performance / Job Service/ Community
• Find me on Alluxio community slack!
https://alluxio-community.slack.com/
Outline
• Overview
• Running Spark on Alluxio
• Data Locality
• Use Cases
3
The Alluxio Story
Originated as Tachyon project, at UC Berkeley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CEO
2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
2019
2018 2020 2021
Fast-growing Open Source Community
5000+ Github Stars
1100+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST API
POSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
Overview
Why Using Alluxio with Spark?
• Improve I/O with better data locality
9
Improve I/O with Data Locality
Read data from remote storage
10
/file1
/file3
/file2
/file4
Improve I/O with Data Locality
/file1
/file3
/file2
/file4
Cache input data
closer to Spark
on demand
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file1
/file3
11
Visualizing the Stack
12
FAST
104
- 105
MB/s
MODERATE 103
- 104
MB/s
SLOW 102
- 103
MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem
Why Using Alluxio with Spark?
• Improve I/O with better data locality
• Enable Data sharing between Spark jobs
13
/file1
/file2
A pipeline consisting
of multiple jobs,
writing intermedia
data to external storage
Data Sharing Between Jobs
Inter-process sharing slowed down by network bandwidth
14
Data Sharing Between Jobs
/file1
/file2
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file1
/file2
writing data to
closer and faster
Alluxio to exchange
data
Inter-process sharing can happen at memory speed
15
Why Using Alluxio with Spark?
• Improve I/O with better data locality
• Enable Data sharing between Spark jobs
• Checkpoint Data for resilience
16
Checkpoint
In-Memory Storage
RDD 1
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
17
Checkpoint
Crash
In-Memory Storage
RDD 1
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
18
Checkpoint
Crash
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
19
Checkpoint
storage &
execution engine
separated
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file
Process crash only needs memory I/O to re-read the data
20
Checkpoint
Crash
storage &
execution engine
separated
Process crash only needs memory I/O to re-read the data
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file
21
Running Spark with Alluxio
API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
– Minimal code change
– Do not need to reason about logic
• Example:
– val file = sc.textFile(“s3a://my-bucket/myFile”)
– val file = sc.textFile(“alluxio://master:19998/myFile”)
23
Setup
• Spark works with Alluxio client out of box
• Put in spark/conf/spark-defaults.conf
spark.driver.extraClassPath
/<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar
spark.executor.extraClassPath
/<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar
• More advanced setting:
https://docs.alluxio.io/os/user/stable/en/compute/
Spark.html
24
Example of Spark RDDs
Writing to Alluxio
rdd.saveAsTextFile(“alluxio://master:19998/myPath”);
rdd.saveAsObjectFile(“alluxio://master:19998/myPath”);
Reading from Alluxio
rdd = sc.textFile(“alluxio://master:19998/myPath”);
rdd = sc.objectFile(“alluxio://master:19998/myPath”);
Example of Spark DataFrames
Writing to Alluxio
df.write.parquet(“alluxio://master:19998/myPath”)
Reading from Alluxio
df = sc.read.parquet(“alluxio://master:19998/myPath”)
Spark + Alluxio: Data Locality
DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
Spark Workflow on Remote Storage
(Without Alluxio)
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context s3://data/
Worker Node
1) run Spark job
Worker Node
Spark
Executor
3) launch executors
and launch tasks
4) access data and compute
2) allocate executors
Takeaway: Remote data, no locality
Step 1: Schedule compute to data cache location
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context
Alluxio
Client
Alluxio
Masters
HostA: 196.0.0.7
Alluxio
Worker
HostB: 196.0.0.8
Alluxio
Worker
(3) allocate on [HostA]
block1
block2
(1) where is
block1?
(2) block1 is
served at
[HostA]
● Alluxio client implements HDFS compatible API with
block location info
● Alluxio masters keep track and serve a list of worker
nodes that currently have the cache.
Step 4: Find local Alluxio Worker and Efficient Data
Exchange
s3://data/
Spark Executor
Alluxio
Worker
Alluxio
Client
HostB: 196.0.0.8
Alluxio
Worker
HostA: 196.0.0.7
block1
Alluxio
Masters
block1?
[HostA]
Efficient I/O via local fs
(e.g., /mnt/ramdisk/) or
local domain socket
(/opt/domain)
● Spark Executor finds local Alluxio Worker by
hostname comparison
● Spark Executor talks to local Alluxio Worker
using either short-circuit access (via local
FS) or local domain socket
Recap: Spark Architecture with Alluxio
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context
s3://data/
Alluxio
Client
Alluxio
Masters
Worker Node
Spark
Executor
Alluxio
Worker
Alluxio
Client
Worker Node
Alluxio
Worker
1) run spark job
2.2) allocate executors
4) access Alluxio for data
and compute
2.1) talk to Alluxio for
where the data cache is
Step 2: Help Spark schedule compute to data cache
Step 4: Enable Spark to access local Alluxio cache
3) launch executors
and launch tasks
Experiments
Spark 2.2.0 + Alluxio 1.6.0
Single worker: Amazon r3.2xlarge
Compare reading cached parquet files
New Context: 50 GB
DataFrame (S3)
6x – 8x speedup
Use Cases
• Hybrid Cloud Analytics
Get in-memory data access for
Spark, Presto, or any analytics
framework on Cloud storage
Typical Use Cases
• Cloud Analytics
Caching
Get in-memory data access for
Spark, Presto, or any analytics
framework on Cloud storage
Elastic Model Training
SPARK
HDFS
SPARK
HDFS
Challenge –
Algorithmic trading in an
top data-driven Hedge
Fund. Model training in
cloud for bursty
workloads
Data access was slow,
costing them $$ in
compute cost and lower
modeler productivity
Solution –
With Alluxio, data access
are 10-30X faster
Impact –
Increased efficiency on
training of ML algorithm,
lowered compute cost and
increased modeler
productivity, resulting in 14
day ROI of Alluxio
Public Cloud
Public Cloud
Leading Hedge Fund
Machine Learning Case
Study
Challenge –
Gain end to end view of
business with large volume
of data
Queries were slow / not
interactive, resulting in
operational inefficiency
Solution –
ETL Data from storage to
Alluxio
Impact –
Faster Time to Market –
“Now we don’t have to work
Sundays”
SPARK
Storage
SPARK
Storage
https://dzone.com/articles/Accelerate-In-Memory-Proces
sing-with-Spark-from-Hours-to-Seconds-With-Tachyon
Cloud Analytics
Challenge –
Queries were slow / not
interactive, resulting in
operational inefficiency
Solution –
Using Alluxio as a read cache
for queries to S3
Impact –
6x - 11x query performance
More scalable analytics
infrastructure
https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-o
n-aws-s3-by-10x-with-alluxio-tiered-storage/
Spark + Alluxio @ Boss直聘
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
Target
● Use Spark to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple model training frameworks
● Control the read/write rate from Alluxio to
Ceph
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control
Spark + Alluxio @ Boss直聘
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
References
- Spark Performance Tuning Tips
- Accelerate Spark and Hive Jobs on AWS S3:
Use case from Bazaarvoice
- Spark + Alluxio: Tencent News Use Case
Why Using Alluxio with Spark?
• Improve I/O with better data locality
• Enable Data sharing between Spark jobs
• Checkpoint Data for resilience
44
Talk to us
david@alluxio.com
https://alluxio-community.slack.com/
We are hiring

Best Practice in Accelerating Data Applications with Spark+Alluxio

  • 1.
    Getting Started with Alluxio+ Spark For Fast Data Analytics David Zhu@ Alluxio 2021/10/12
  • 2.
    Introduction 2 • David Zhu •Core Maintainer @ Alluxio • PhD in CS @ UC Berkeley AMP lab • Email: david@alluxio.com • Performance / Job Service/ Community • Find me on Alluxio community slack! https://alluxio-community.slack.com/
  • 3.
    Outline • Overview • RunningSpark on Alluxio • Data Locality • Use Cases 3
  • 4.
    The Alluxio Story Originatedas Tachyon project, at UC Berkeley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CEO 2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2019 2018 2020 2021
  • 5.
    Fast-growing Open SourceCommunity 5000+ Github Stars 1100+ Contributors Join the community on Slack alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 6.
    COMPANIES USING ALLUXIO INTERNET PUBLICCLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGY FINANCIAL SERVICES TELCO & MEDIA LEARN MORE
  • 7.
    Alluxio is Open-SourceData Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST API POSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  • 8.
  • 9.
    Why Using Alluxiowith Spark? • Improve I/O with better data locality 9
  • 10.
    Improve I/O withData Locality Read data from remote storage 10 /file1 /file3 /file2 /file4
  • 11.
    Improve I/O withData Locality /file1 /file3 /file2 /file4 Cache input data closer to Spark on demand HDFS disk block 1 block 3 block 2 block 4 In-Memory /file1 /file3 11
  • 12.
    Visualizing the Stack 12 FAST 104 -105 MB/s MODERATE 103 - 104 MB/s SLOW 102 - 103 MB/s Only when necessary Limited Often SSD HDD Mem
  • 13.
    Why Using Alluxiowith Spark? • Improve I/O with better data locality • Enable Data sharing between Spark jobs 13
  • 14.
    /file1 /file2 A pipeline consisting ofmultiple jobs, writing intermedia data to external storage Data Sharing Between Jobs Inter-process sharing slowed down by network bandwidth 14
  • 15.
    Data Sharing BetweenJobs /file1 /file2 HDFS disk block 1 block 3 block 2 block 4 In-Memory /file1 /file2 writing data to closer and faster Alluxio to exchange data Inter-process sharing can happen at memory speed 15
  • 16.
    Why Using Alluxiowith Spark? • Improve I/O with better data locality • Enable Data sharing between Spark jobs • Checkpoint Data for resilience 16
  • 17.
    Checkpoint In-Memory Storage RDD 1 storageengine & execution engine same process Process crash requires network I/O to re-read the data 17
  • 18.
    Checkpoint Crash In-Memory Storage RDD 1 storageengine & execution engine same process Process crash requires network I/O to re-read the data 18
  • 19.
    Checkpoint Crash storage engine & executionengine same process Process crash requires network I/O to re-read the data 19
  • 20.
    Checkpoint storage & execution engine separated HDFS disk block1 block 3 block 2 block 4 In-Memory /file Process crash only needs memory I/O to re-read the data 20
  • 21.
    Checkpoint Crash storage & execution engine separated Processcrash only needs memory I/O to re-read the data HDFS disk block 1 block 3 block 2 block 4 In-Memory /file 21
  • 22.
  • 23.
    API Selection • Accessdata directly through the FileSystem API, but change scheme to alluxio:// – Minimal code change – Do not need to reason about logic • Example: – val file = sc.textFile(“s3a://my-bucket/myFile”) – val file = sc.textFile(“alluxio://master:19998/myFile”) 23
  • 24.
    Setup • Spark workswith Alluxio client out of box • Put in spark/conf/spark-defaults.conf spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar • More advanced setting: https://docs.alluxio.io/os/user/stable/en/compute/ Spark.html 24
  • 25.
    Example of SparkRDDs Writing to Alluxio rdd.saveAsTextFile(“alluxio://master:19998/myPath”); rdd.saveAsObjectFile(“alluxio://master:19998/myPath”); Reading from Alluxio rdd = sc.textFile(“alluxio://master:19998/myPath”); rdd = sc.objectFile(“alluxio://master:19998/myPath”);
  • 26.
    Example of SparkDataFrames Writing to Alluxio df.write.parquet(“alluxio://master:19998/myPath”) Reading from Alluxio df = sc.read.parquet(“alluxio://master:19998/myPath”)
  • 27.
    Spark + Alluxio:Data Locality
  • 28.
    DATA LOCALITY WITHSCALE-OUT WORKERS Local performance for remote data with intelligent multi-tiering Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 29.
    Synchronization of changesacross clusters Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization Mutation On-premises Public Cloud Model Training Big Data ETL Big Data Query RAM SSD METADATA LOCALITY WITH SCALEABLE MASTERS RocksDB
  • 30.
    Spark Workflow onRemote Storage (Without Alluxio) Cluster Manager (i.e.YARN/Mesos) Application Spark Context s3://data/ Worker Node 1) run Spark job Worker Node Spark Executor 3) launch executors and launch tasks 4) access data and compute 2) allocate executors Takeaway: Remote data, no locality
  • 31.
    Step 1: Schedulecompute to data cache location Cluster Manager (i.e.YARN/Mesos) Application Spark Context Alluxio Client Alluxio Masters HostA: 196.0.0.7 Alluxio Worker HostB: 196.0.0.8 Alluxio Worker (3) allocate on [HostA] block1 block2 (1) where is block1? (2) block1 is served at [HostA] ● Alluxio client implements HDFS compatible API with block location info ● Alluxio masters keep track and serve a list of worker nodes that currently have the cache.
  • 32.
    Step 4: Findlocal Alluxio Worker and Efficient Data Exchange s3://data/ Spark Executor Alluxio Worker Alluxio Client HostB: 196.0.0.8 Alluxio Worker HostA: 196.0.0.7 block1 Alluxio Masters block1? [HostA] Efficient I/O via local fs (e.g., /mnt/ramdisk/) or local domain socket (/opt/domain) ● Spark Executor finds local Alluxio Worker by hostname comparison ● Spark Executor talks to local Alluxio Worker using either short-circuit access (via local FS) or local domain socket
  • 33.
    Recap: Spark Architecturewith Alluxio Cluster Manager (i.e.YARN/Mesos) Application Spark Context s3://data/ Alluxio Client Alluxio Masters Worker Node Spark Executor Alluxio Worker Alluxio Client Worker Node Alluxio Worker 1) run spark job 2.2) allocate executors 4) access Alluxio for data and compute 2.1) talk to Alluxio for where the data cache is Step 2: Help Spark schedule compute to data cache Step 4: Enable Spark to access local Alluxio cache 3) launch executors and launch tasks
  • 34.
    Experiments Spark 2.2.0 +Alluxio 1.6.0 Single worker: Amazon r3.2xlarge Compare reading cached parquet files
  • 35.
    New Context: 50GB DataFrame (S3) 6x – 8x speedup
  • 36.
  • 37.
    • Hybrid CloudAnalytics Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage Typical Use Cases • Cloud Analytics Caching Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage
  • 38.
    Elastic Model Training SPARK HDFS SPARK HDFS Challenge– Algorithmic trading in an top data-driven Hedge Fund. Model training in cloud for bursty workloads Data access was slow, costing them $$ in compute cost and lower modeler productivity Solution – With Alluxio, data access are 10-30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio Public Cloud Public Cloud Leading Hedge Fund
  • 39.
    Machine Learning Case Study Challenge– Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency Solution – ETL Data from storage to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” SPARK Storage SPARK Storage https://dzone.com/articles/Accelerate-In-Memory-Proces sing-with-Spark-from-Hours-to-Seconds-With-Tachyon
  • 40.
    Cloud Analytics Challenge – Querieswere slow / not interactive, resulting in operational inefficiency Solution – Using Alluxio as a read cache for queries to S3 Impact – 6x - 11x query performance More scalable analytics infrastructure https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-o n-aws-s3-by-10x-with-alluxio-tiered-storage/
  • 41.
    Spark + Alluxio@ Boss直聘 https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/ Target ● Use Spark to process data ● Model training on top of the processed data Previous solution ● Spark/flink + Ceph + model training Problems ● Write temporary files into Ceph cause high Ceph pressure ● Cannot control Ceph read/write pressure, cluster unstable Solution with Alluxio Spark/flink + Alluxio + Ceph + Alluxio + model training ● Alluxio supports multiple data sources and multiple model training frameworks ● Control the read/write rate from Alluxio to Ceph ● Multiple independent Alluxio clusters, support multi-tenants, customized configuration, access control
  • 42.
    Spark + Alluxio@ Boss直聘 https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
  • 43.
    References - Spark PerformanceTuning Tips - Accelerate Spark and Hive Jobs on AWS S3: Use case from Bazaarvoice - Spark + Alluxio: Tencent News Use Case
  • 44.
    Why Using Alluxiowith Spark? • Improve I/O with better data locality • Enable Data sharing between Spark jobs • Checkpoint Data for resilience 44
  • 45.