2. Becoming increasingly popular
▪ HDFS complex and not created for
cloud environments
▪ S3 is very easy to use, simple API and
cost effective
Analytics on the cloud object storage
Big data frameworks on the
public cloud
SparkSparkSparkSpark
3. Running Spark on S3 – Different Options
#1 Native Apache Spark install on
AWS EC2
SparkSparkSparkSpark
+
#2 Bundled into services like
AWS EMR
#3 Managed services
4. Challenges with Analytics on S3
Big data frameworks on the
public cloud
SparkSparkSparkSpark
Challenges with S3
▪ Not built for interactive analytics
▪ Expensive metadata operations like list, rename
▪ Eventual consistency
▪ Performance inconsistent
5. #2 Data Access options on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
6. Using Spark with HDFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS
Manual distcp / No SyncManual distcp / No Sync
EMRF
S
7. Using EMRFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
No data
caching
No data
caching
8. ▪ Provides a data caching layer for Spark
▪ Provides strong consistency for for metadata
operations and faster performance
▪ Provides API compatibility with HDFS & S3
▪ S3 is eventually consistent making it hard to
predict query results
▪ Allows for data outside of S3 to be analyzed
as well
Spark workloads on S3 with Alluxio
Compute caching for Spark on
S3
Accelerate big data
frameworks on the public
cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
9. Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Using Alluxio with AWS EMR
10. Alluxio for Spark
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
10
Alluxio for Spark
11. In-Memory
Storage
block 1
block 3
In-Memory
Storage
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Jobs
Inter-process sharing slowed down by network I/O
11
Data sharing between jobs
12. Data Sharing Between Jobs
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
storage &
execution engine
separated
Inter-process sharing can happen at memory speed
12
Data Sharing Between JobsData sharing between jobs
13. Data Resilience during Crashes
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
13
Data Sharing Between JobsData resilience during crashes
14. Data Resilience during Crashes
Crash
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
14
Data Sharing Between JobsData resilience during crashes
15. Data Resilience during Crashes
block 1
block 3
block 2
block 4
Crash
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
15
Data Resilience during CrashesData Sharing Between JobsData resilience during crashes
16. Data Resilience during Crashes
storage &
execution engine
separated
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
Process crash only needs memory I/O to re-read the data
16
Data Resilience during CrashesData Sharing Between JobsData resilience during crashes
17. Data Resilience during Crashes
Crash
storage &
execution engine
separated
Process crash only needs memory I/O to re-read the data
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
17
Data Resilience during CrashesData Sharing Between JobsData resilience during crashes
18. The Alluxio Story
Originated as Tachyon project, at the UC Berkley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.)
Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the
Cloud for data driven apps such as Big Data Analytics, ML and
AI.
2018 20192018
19. Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE
20. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage
21. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
22. Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot War
m
Cold
RAM SSD HDD
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
23. Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
24. Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting with Transparent Naming
25. Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data
available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
26. AlluxioAlluxioAlluxio
▪ Accessing data over WAN too slow
▪ Copying data to compute cloud time
consuming and complex
▪ Using another storage system like S3
means expensive application changes
▪ Using S3 via HDFS connector leads
to extremely low performance
Challenges with Hybrid Cloud & Alluxio Solution
HDFS for Hybrid
Cloud
Alluxio
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Solution Benefits
▪ Same performance as local
▪ Same end-user experience
▪ 100% of I/O is offloaded
SparkSparkSparkSpark
27. Challenges with supporting more frameworks & Alluxio Solution
▪ Running new frameworks on existing an
HDFS cluster can dramatically affect
performance of existing workloads
▪ In a disaggregate environment, copying
data to multiple compute clouds time
consuming and error prone
▪ Migrating applications for new storage
systems is complex & time consuming
▪ Storing and managing multiple copies of
the data becomes expensive
Support more
frameworks
Any object store or HDFS
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
or
Spark
Alluxio Alluxio
33. Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on
Slack