Spark Pipelines in the Cloud with Alluxio by Bin Fan

Spark Pipelines in the
Cloud with Alluxio
Bin Fan, Alluxio, Inc.
BIG DATA DAY LA – Aug 2017

Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 2

Data Ecosystem Yesterday
• One Compute
Framework
• Single Storage
System
• Co-located

Data Ecosystem Today
• Many Compute
Frameworks
• Multiple Storage
Systems
• Most not co-located

Data Ecosystem Issues
• Each application
manage multiple data
sources
• Add/Removing data
sources require
application changes
• Storage optimizations
requires application
change
• Lower performance
due to lack of locality

Data Ecosystem with Alluxio
• Apps only talk to
Alluxio
• Simple
Add/Remove
• No App Changes
• Highest
performance in
Memory
• No Lock in
Native File System
Hadoop
Compatible File
System
Native Key-Value
Interface
Fuse Compatible
File System
HDFS Interface
Amazon S3
Interface
Swift Interface
GlusterFS
Interface

Next Gen Analytics with Alluxio
Native File System
Hadoop
Compatible File
System
Native Key-Value
Interface
Fuse Compatible
File System
HDFS Interface
Amazon S3
Interface
Swift Interface
GlusterFS
Interface
Apps, Data & Storage
at Mem Speed
 Big Data/IoT
 AI/ML
 Deep Learning
 Cloud Migration
 Multi Platform
 Autonomous

Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
500+ Contributors from
100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive

Outline
Alluxio Overview
Demo
1
2
3
4
5

Big Data Case Study –
1 010/11/2017 ©2017 Alluxio, Inc. All Rights Reserved
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –
ETL Data from Teradata to
Alluxio
Impact –
Faster Time to Market – “Now
we don’t have to work Sundays”
http://bit.ly/2oMx95W

1110/11/2017 ©2017 Alluxio, Inc. All Rights Reserved
Challenge –
data
SPARK
Baidu File System
SPARK
Baidu File System
Solution –
With Alluxio, data queries are
30X faster
Impact –
Higher operational efficiency
http://bit.ly/2pDHS3O

Challenge –
data for $5B Travel Site
SPARK
HDFS
Solution –
With Alluxio, 300x improvement
in performance
Impact –
Increased revenue from
immediate response to user
behavior
Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
©2017 Alluxio, Inc. All Rights Reserved 1 2

Machine Learning Case Study –
1 310/11/2017 ©2017 Alluxio, Inc. All Rights Reserved
Challenge –
Disparate Data both on-prem
and Cloud. Heterogeneous
types of data.
Scaling of Exabyte size data.
Slow due to disk based
approach.
SPARK
HDFS
SPARK
MINIO
Solution –
Using Alluxio to prevent I/O
bottlenecks
Impact –
Orders of magnitude higher
performance than before.
http://bit.ly/2p18ds3
MESOS

Outline
Alluxio Overview
Demo
1
2
3
4
5

Consolidating Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3

Consolidating Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage

Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process

CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process

CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data

Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process

• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process

Accessing Alluxio Data From
Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file

Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)

Code Example for Spark
DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)

Outline
Alluxio Overview
Demo
1
2
3
4
5

Experiments
Spark 2.0.0 + Alluxio 1.2.0
Single worker: Amazon r3.2xlarge
Comparisons:
Alluxio
Spark Storage Level: MEMORY_ONLY
Spark Storage Level: MEMORY_ONLY_SER
Spark Storage Level: DISK_ONLY

0
50
100
150
200
250
0 5 10 15 20 25 30 35 40 45 50
Time[seconds]
RDD Size [GB]
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
Reading Cached RDD

0 100 200 300 400 500 600 700 800
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
7x speedup
16x speedup
New Context: Read 50 GB RDD
(S3)

Reading Cached DataFrame
(parquet)
0
50
100
150
200
250
0 5 10 15 20 25 30 35 40 45 50
Time[seconds]
DataFrame Size [GB]
Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY

New Context: Read 50 GB DataFrame
(S3)
0 250 500 750 1000 1250 1500 1750
Alluxio
No Alluxio
Time [seconds]
10x average speedup, 17x peak speedup

Outline
Alluxio Overview
Demo
1
2
3
4
5

Demo Environment
Spark
Alluxio

Conclusion
Easy to use Alluxio with Spark
Predictable and improved performance
Easily connect to various storages

Thank you!
Gene Pang Cheng Chang
gene@alluxio.com cc@alluxio.com
Twitter: @unityxx Twitter: @uronce
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media

Spark Pipelines in the Cloud with Alluxio by Bin Fan

More Related Content

What's hot

Similar to Spark Pipelines in the Cloud with Alluxio by Bin Fan

More from Data Con LA

Recently uploaded

Spark Pipelines in the Cloud with Alluxio by Bin Fan

Editor's Notes