More Related Content
Similar to Best Practices for Using Alluxio with Spark (20)
More from Alluxio, Inc. (20)
Best Practices for Using Alluxio with Spark
- 1. Best Practices for Using
Alluxio with Spark
Gene Pang,Alluxio, Inc.
Spark Summit EU - October 2017
- 2. About Me
• Gene Pang
• Software engineer @ Alluxio, Inc.
• Alluxio open source PMC member
• Ph.D. from AMPLab @ UC Berkeley
• Worked at Google before UC Berkeley
• Twitter: @unityxx
• Github: @gpang
©2017 Alluxio, Inc.All Rights Reserved 2
- 5. Data Ecosystem Today
• Many Compute
Frameworks
• Multiple Storage Systems
• Most not co-located
©2017 Alluxio, Inc.All Rights Reserved 5
- 6. Data Ecosystem Issues
• Each application manage
multiple data sources
• Add/Removing data
sources require
application changes
• Storage optimizations
requires application
change
• Lower performance due
to lack of locality
©2017 Alluxio, Inc.All Rights Reserved 6
- 7. Data Ecosystem with Alluxio
• Apps only talk to
Alluxio
• Simple Add/Remove
• No App Changes
• Memory
Performance
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc.All Rights Reserved 7
- 8. Next Gen Analytics with Alluxio
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage
at Memory Speed
ü Big Data/IoT
ü AI/ML
ü Deep Learning
ü Cloud Migration
ü Multi Platform
ü Autonomous
©2017 Alluxio, Inc.All Rights Reserved 8
- 9. Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
600+ Contributors from
100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc.All Rights Reserved 9
- 11. Big Data Case Study –
1 19/17/19 ©2017 Alluxio, Inc.All Rights Reserved
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
http://bit.ly/2oMx95W
- 12. Big Data Case Study –
1 29/17/19 ©2017 Alluxio, Inc.All Rights Reserved
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution –
With Alluxio, data queries are 30X
faster
Impact –
Higher operational efficiency
http://bit.ly/2pDHS3O
- 13. Big Data Case Study –
Challenge –
Gain end to end view of
business with large volume of
data for $5B Travel Site
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
HDFS
Solution –
With Alluxio, 300x improvement in
performance
Impact –
Increased revenue from immediate
response to user behavior
Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
©2017 Alluxio, Inc.All Rights Reserved 1 3
- 14. Machine Learning Case Study –
1 49/17/19 ©2017 Alluxio, Inc.All Rights Reserved
Challenge –
Disparate Data both on-prem
and Cloud. Heterogeneous
types of data.
Scaling of Exabyte size data.
Slow due to disk based
approach.
SPARK
HDFS
SPARK
MINIO
Solution –
Using Alluxio to prevent I/O
bottlenecks
Impact –
Orders of magnitude higher
performance than before.
http://bit.ly/2p18ds3
MESOS
- 16. 1 6©2017 Alluxio, Inc.All Rights Reserved
Application
AlluxioClient
Alluxio
Master
Alluxio
Worker
Alluxio
Worker
…
Storage
Storage
…
Alluxio Architecture
- 17. Alluxio Client
Applications interact with Alluxio via the Alluxio client
• Java Native Alluxio Filesystem Client
• Alluxio specific operations like [un]pin, [un]mount, [un]set TTL
• HDFS-Compatible Filesystem Client
• No code change necessary
• S3 API
©2017 Alluxio, Inc.All Rights Reserved 1 7
- 18. Alluxio Master
Master is responsible for managing metadata
• Filesystem namespace metadata
• Blocks / workers metadata
Primary master writes journal for durable operations
• Secondary masters replay journal entries
©2017 Alluxio, Inc.All Rights Reserved 1 8
- 19. Alluxio Worker
Worker is responsible for managing block data
Worker stores block data on various storage media
• HDD, SSD, Memory
Reads and writes data to underlying storage systems
©2017 Alluxio, Inc.All Rights Reserved 1 9
- 21. Sharing Data via Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
©2017 Alluxio, Inc.All Rights Reserved 2 1
- 22. Sharing Data via Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
©2017 Alluxio, Inc.All Rights Reserved 2 2
- 23. Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
©2017 Alluxio, Inc.All Rights Reserved 2 3
- 24. Data Resilience During Crash
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process
©2017 Alluxio, Inc.All Rights Reserved 2 4
- 25. Data Resilience During Crash
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
©2017 Alluxio, Inc.All Rights Reserved 2 5
- 26. Data Resilience During Crash
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
©2017 Alluxio, Inc.All Rights Reserved 2 6
- 27. Data Resilience During Crash
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
©2017 Alluxio, Inc.All Rights Reserved 2 7
- 28. Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
©2017 Alluxio, Inc.All Rights Reserved 2 8
- 29. Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)
©2017 Alluxio, Inc.All Rights Reserved 2 9
- 30. Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
©2017 Alluxio, Inc.All Rights Reserved 3 0
- 31. Deploying Alluxio with Spark
©2017 Alluxio, Inc.All Rights Reserved 3 1
Spark
Alluxio
Storage
Spark
Alluxio
Storage
Colocate Alluxio Workers with Spark
for optimal I/O performance
Deploy Alluxio between
Spark and Storage
- 33. Experiments
Spark 2.2.0 + Alluxio 1.6.0
Single worker:Amazon r3.2xlarge
Compare reading cached parquet files
©2017 Alluxio, Inc.All Rights Reserved 3 3
- 35. New Context: 50 GB DataFrame (S3)
6x – 8x speedup
©2017 Alluxio, Inc.All Rights Reserved 3 5
- 36. Conclusion
Easy to use Alluxio with Spark
Alluxio enables improved I/O performance
Easily interact with various storage systems with Alluxio
©2017 Alluxio, Inc.All Rights Reserved 3 6