EFFECTIVE SPARK WITH ALLUXIO
Spark Summit EU
Jiri Simsa, Alluxio, Inc.
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
2
3
BIG DATA ECOSYSTEM YESTERDAY
BIG DATA ECOSYSTEM TODAY
…
…
3
…
…
BIG DATA ECOSYSTEM ISSUES
3
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
WHY ALLUXIO
4
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
5
Unification
New workflows across any
data in any storage system
Orders of magnitude
improvement in run time
Choice in compute and
storage – grow each
independently, buy only
what is needed
Performance Flexibility
BENEFITS
6
Popular Open Source Projects’ Growth
Spark Kafka Cassandra HDFS
FASTEST GROWING BIG DATA PROJECTS
6
• Fastest growing
open-source project
in the big data
ecosystem
• 300+ contributors
from 100+
organizations
• Running in large
production clusters
Popular Open Source Projects’ Growth
Alluxio
Spark Kafka Cassandra HDFS
FASTEST GROWING BIG DATA PROJECTS
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
7
ACCELERATE I/O TO/FROM REMOTE STORAGE
Baidu’s PMs and analysts run 

interactive queries to gain insights 

into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
8
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark SQL
alone, it took 100-150 seconds to finish a query;
using Alluxio, where data may hit local or
remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB
of RAM space
• By using Alluxio, batch queries usually lasting
over 15 minutes were transformed into an
interactive query taking less than 30 seconds
Baidu’s PMs and analysts run 

interactive queries to gain insights 

into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
8
SHARE DATA ACROSS JOBS AT MEMORY SPEED
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
9
SHARE DATA ACROSS JOBS AT MEMORY SPEED
Thanks to Alluxio, we now have the raw data
immediately available at every iteration and
we can skip the costs of loading in terms of time
waiting, network traffic, and RDBMS activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased
from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O cost
of loading and storing in Alluxio is now on the
order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
9
MANAGE DATA ACROSS STORAGE SYSTEMS
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
10
MANAGE DATA ACROSS STORAGE SYSTEMS
We’ve been running Alluxio in production for
over 9 months, Alluxio’s unified namespace
enable different applications and frameworks
to easily interact with data from different
storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark
batch and Flink jobs provide efficient data
sharing
• Improved the performance of their system with
15x – 300x speedups
• Tiered storage feature manages storage
resources including memory and HDD
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
10
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
11
CONSOLIDATING MEMORY
12
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
CONSOLIDATING MEMORY
13
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
DATA RESILIENCE DURING CRASH
14
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
DATA RESILIENCE DURING CRASH
15
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
Storage Engine &
Execution Engine
Same Process
DATA RESILIENCE DURING CRASH
16
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
DATA RESILIENCE DURING CRASH
17
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
DATA RESILIENCE DURING CRASH
18
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
ACCESSING ALLUXIO DATA FROM SPARK
19
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
CODE EXAMPLE FOR SPARK RDDS
20
Writing RDD
to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD
from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)
CODE EXAMPLE FOR SPARK DATAFRAMES
21
Writing DataFrame
to Alluxio
df.write.parquet(alluxioPath)
Reading DataFrame
from Alluxio
df = sc.read.parquet(alluxioPath)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
22
ENVIRONMENT
Spark 2.0.0 + Alluxio 1.2.0
Single r3.2xlarge instance (61GB RAM)
Comparisons:
• Alluxio
• Spark Storage Level: MEMORY_ONLY
• Spark Storage Level: MEMORY_ONLY_SER
• Spark Storage Level: DISK_ONLY
23
Reading Cached RDDTime[seconds]
0
75
150
225
300
RDD Size [GB]
0 13 25 38 50
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
24
25
New Context: Read 50 GB RDD (SSD)
No Alluxio
Alluxio
(objectFile)
Alluxio
(textFile)
Time [seconds]
0 55 110 165 220
2x speedup
4x speedup
26
New Context: Read 50 GB RDD (S3)
No Alluxio
Alluxio
(objectFile)
Alluxio
(textFile)
Time [seconds]
0 200 400 600 800
7x speedup
16x speedup
27
Reading CACHED DATAFRAME (parquet)Time[seconds]
0
62.5
125
187.5
250
DataFrame Size [GB]
0 13 25 38 50
Alluxio (textFile) DISK_ONLY MEMORY_ONLY_SER
MEMORY_ONLY
28
New Context: Read 50 GB DATAFRAME
(SSD)
No Alluxio
Alluxio
Time [seconds]
0 60 120 180 240
2.5x speedup
29
New CONTEXT: Read 50 GB DataFrame
(S3)
No Alluxio
Alluxio
Time [seconds]
0 250 500 750 1000 1250 1500 1750
10x speedup (avg), 17x speedup (peak)
CONCLUSION
30
• Easy to use with Spark
• Better Performance
• Predictable Performance
Contact: jiri@alluxio.com
Twitter: @jsimsa
Websites: www.alluxio.com and www.alluxio.org
Thank you!
31

Spark Summit EU talk by Jiri Simsa

  • 1.
    EFFECTIVE SPARK WITHALLUXIO Spark Summit EU Jiri Simsa, Alluxio, Inc.
  • 2.
    OUTLINE • Alluxio Overview •Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 2
  • 3.
  • 4.
    BIG DATA ECOSYSTEMTODAY … … 3
  • 5.
  • 6.
    BIG DATA ECOSYSTEMWITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3
  • 7.
    BIG DATA ECOSYSTEMWITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Unifying Data at Memory Speed GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3
  • 8.
    WHY ALLUXIO 4 Co-located computeand data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only
  • 9.
    5 Unification New workflows acrossany data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility BENEFITS
  • 10.
    6 Popular Open SourceProjects’ Growth Spark Kafka Cassandra HDFS FASTEST GROWING BIG DATA PROJECTS
  • 11.
    6 • Fastest growing open-sourceproject in the big data ecosystem • 300+ contributors from 100+ organizations • Running in large production clusters Popular Open Source Projects’ Growth Alluxio Spark Kafka Cassandra HDFS FASTEST GROWING BIG DATA PROJECTS
  • 12.
    OUTLINE • Alluxio Overview •Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 7
  • 13.
    ACCELERATE I/O TO/FROMREMOTE STORAGE Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 8
  • 14.
    ACCELERATE I/O TO/FROMREMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 8
  • 15.
    SHARE DATA ACROSSJOBS AT MEMORY SPEED Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 9
  • 16.
    SHARE DATA ACROSSJOBS AT MEMORY SPEED Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 9
  • 17.
    MANAGE DATA ACROSSSTORAGE SYSTEMS • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 10
  • 18.
    MANAGE DATA ACROSSSTORAGE SYSTEMS We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems. - Qunar RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory and HDD • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 10
  • 19.
    OUTLINE • Alluxio Overview •Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 11
  • 20.
    CONSOLIDATING MEMORY 12 Storage Engine& Execution Engine Same Process • Two copies of data in memory – double the memory used • Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  • 21.
    CONSOLIDATING MEMORY 13 Storage Engine& Execution Engine Different process • Half the memory used • Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  • 22.
    DATA RESILIENCE DURINGCRASH 14 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  • 23.
    DATA RESILIENCE DURINGCRASH 15 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  • 24.
    DATA RESILIENCE DURINGCRASH 16 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  • 25.
    DATA RESILIENCE DURINGCRASH 17 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  • 26.
    DATA RESILIENCE DURINGCRASH 18 • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  • 27.
    ACCESSING ALLUXIO DATAFROM SPARK 19 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  • 28.
    CODE EXAMPLE FORSPARK RDDS 20 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  • 29.
    CODE EXAMPLE FORSPARK DATAFRAMES 21 Writing DataFrame to Alluxio df.write.parquet(alluxioPath) Reading DataFrame from Alluxio df = sc.read.parquet(alluxioPath)
  • 30.
    OUTLINE • Alluxio Overview •Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 22
  • 31.
    ENVIRONMENT Spark 2.0.0 +Alluxio 1.2.0 Single r3.2xlarge instance (61GB RAM) Comparisons: • Alluxio • Spark Storage Level: MEMORY_ONLY • Spark Storage Level: MEMORY_ONLY_SER • Spark Storage Level: DISK_ONLY 23
  • 32.
    Reading Cached RDDTime[seconds] 0 75 150 225 300 RDDSize [GB] 0 13 25 38 50 Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY 24
  • 33.
    25 New Context: Read50 GB RDD (SSD) No Alluxio Alluxio (objectFile) Alluxio (textFile) Time [seconds] 0 55 110 165 220 2x speedup 4x speedup
  • 34.
    26 New Context: Read50 GB RDD (S3) No Alluxio Alluxio (objectFile) Alluxio (textFile) Time [seconds] 0 200 400 600 800 7x speedup 16x speedup
  • 35.
    27 Reading CACHED DATAFRAME(parquet)Time[seconds] 0 62.5 125 187.5 250 DataFrame Size [GB] 0 13 25 38 50 Alluxio (textFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY
  • 36.
    28 New Context: Read50 GB DATAFRAME (SSD) No Alluxio Alluxio Time [seconds] 0 60 120 180 240 2.5x speedup
  • 37.
    29 New CONTEXT: Read50 GB DataFrame (S3) No Alluxio Alluxio Time [seconds] 0 250 500 750 1000 1250 1500 1750 10x speedup (avg), 17x speedup (peak)
  • 38.
    CONCLUSION 30 • Easy touse with Spark • Better Performance • Predictable Performance
  • 39.
    Contact: jiri@alluxio.com Twitter: @jsimsa Websites:www.alluxio.com and www.alluxio.org Thank you! 31