- Alluxio provides a data caching layer for analytics frameworks like Spark running on AWS EMR, addressing challenges of using S3 directly like inconsistent performance and expensive metadata operations.
- It mounts S3 as a unified filesystem and caches frequently used data in memory across workers for faster queries while continuously syncing data to S3.
- Alluxio's multi-tier storage enables data to be accessed locally from remote locations like S3 using intelligent policies to promote and demote data between memory, SSDs and disks.
2. AWS S3:The Default Data Lake on AWS
Why build your data lake on S3?
High available
Simple API
Fully managed
Really large scale
Cost-effective
Integration with many different services and frameworks
3. Analytics with AWS EMR on S3
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
4. HDFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS
Manual distcp / No SyncManual distcp / No Sync
EMRF
S
5. Using EMRFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
No data
caching
No data
caching
6. Challenges with Analytics on S3
Big data frameworks on the
public cloud
SparkSparkSparkSpark
Challenges with S3
§ Expensive metadata operations like list, rename
§ Eventual consistency in some cases
§ Performance inconsistent / throttling / limits
7. § Provides a data caching layer for Spark
§ Provides strong consistency for for metadata
operations and faster performance
§ Provides API compatibility with HDFS & S3
§ S3 is eventually consistent making it hard to
predict query results
§ Allows for data outside of S3 to be analyzed
as well – like remote HDFS
Spark workloads on S3 with Alluxio
Compute caching for Spark on S3
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
8. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage
9. Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Using Alluxio with AWS EMR
10. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Alluxio – Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Instance
11. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Alluxio - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
Instance
12. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Alluxio – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
Instance
13. Flexible APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
13
16. Alluxio for Spark
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
16
Example:Alluxio for Spark
17. In-Memory
Storage
block 1
block 3
In-Memory
Storage
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Jobs
Inter-process sharing slowed down by network I/O
17
Data sharing between jobs
18. Data Sharing Between Jobs
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
storage &
execution engine
separated
Inter-process sharing can happen at memory speed
18
Data Sharing Between JobsData sharing between jobs
19. Data Resilience during Crashes
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
19
Data Sharing Between JobsData resilience during crashes
20. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
22. Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
23. Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
24. Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
25. Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
26. Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.io/slack