This document discusses modernizing a data platform for analytics and AI across single, hybrid, or multi-cloud environments using Alluxio. It describes Alluxio's key features like data locality, metadata locality, asynchronous data operations, and policy-driven data management that enable consistent performance, portability, and cost savings. Examples are provided of how Alluxio can be used to transition from on-premises HDFS to object storage to hybrid cloud and multi-cloud configurations.
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
1. Modernizing Your Data Platform
for Analytics and AI
Across a single cloud, hybrid cloud or multi-cloud
2. About Me
ALLUXIO 2
Product Management, Alluxio, Inc.
PMC member, Alluxio Open Source Project
MS from Carnegie Mellon University
BS from Indian Institute of Technology - Delhi
Adit Madan
3. Co-located
DATA STACK JOURNEY AND INNOVATION PATHS
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Support Presto, Spark
Tensorflow, PyTorch
across DCs without
app changes
Enable & accelerate
big data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
▪ Typically compute-bound
clusters over 100% capacity
▪ Compute & I/O need to be
scaled together even when
not needed
▪ Compute & I/O can be
scaled independently but I/O
still needed on HDFS which
is expensive
ALLUXIO 3
6. ALLUXIO 6
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive
7. DATA ORCHESTRATION WITHIN A SINGLE DATACENTER
OR CLOUD REGION
Consistent SLAs, Performance, and Cost
Savings on cloud storage
CASE 01: CLOUD CASE 02: ON PREM
PUBLIC CLOUD
Tensorflow
Alluxio
Speed-up analytics on on-prem
object stores
ON PREMISE
Spark
Alluxio
OR OR
ALLUXIO 7
8. DATA ORCHESTRATION ACROSS DATACENTERS
Burst compute to a public cloud
and gradually migrate
CASE 03: HYBRID
Hive
Alluxio
PUBLIC CLOUD
ON PREMISE
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 04: HYBRID
Alluxio
Pytorch
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 05: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 8
9. Alluxio - Key Innovations
ALLUXIO 9
Performance acceleration with
efficient representation and
caching of data close to compute
EFFICIENT ACCESS &
DATA LOCALITY
Orchestrate a data platform with
agility across regions for private,
hybrid or multi-cloud with policy
based data management
ENVIRONMENT AGNOSTIC
DATA MANAGEMENT
Support multiple APIs for
analytics and AI with storage
abstraction and streamlined data
movement across the pipeline
UNIFY DATA LAKES
10. ALLUXIO 10
Unified
namespace
Mount HDFS and object
storage into a common
Alluxio cluster
1
Object store
analytics
Caching layer to speed up
Presto and Spark Jobs
2
Hybrid-cloud
Burst Compute to a single
public cloud first
Run managed Hadoop.
K8s and cloud native AI for
model training
3
Multi-cloud
Replicate setup on AWS to
Google Cloud
Choose the right tool for the
job, regardless of the cloud
provider
4
EXAMPLE JOURNEY 01
On-premises HDFS to Object Storage to Hybrid Cloud
11. ALLUXIO 11
EXAMPLE JOURNEY 01
On-premises object storage as the source of truth
v
REGION A
v
REGION B
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
INGESTION ETL
Hive
12. Burst analytics
in the cloud
Presto and Alluxio in the cloud
accessing on-prem HDFS and
cloud storage
1
EXAMPLE JOURNEY 02
Hybrid Cloud to Multi Cloud
Efficient data caching
and representation
High availability &
modernization
Data replication across HDFS
clusters in different data
centers
2
Seamless data
synchronization
Multi-cloud with
Azure and AWS
Storage abstraction
regardless of cloud provider
3
Multi-cloud fabric
Why data orchestration?
ALLUXIO 12
With data pinning capability
for cache control
For always active data across
the data pipeline
Abstraction for infrastructure
spanning multiple data centers
and clouds
13. v
ALLUXIO 13
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
DATACENTER 2
DATACENTER 1
Hive
INGESTION ETL
EXAMPLE JOURNEY 02
Global data platform for analytics & AI built on data and container orchestration
Hive
ETL
15. ALLUXIO 15
DATA LOCALITY
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
16. ALLUXIO 16
METADATA LOCALITY
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
17. ALLUXIO 17
ASYNCHRONOUS DATA OPERATIONS
Data pre-loading and fast durable writes
Distributed Load
Alluxio Data Orchestration and Control
Service
Preload Cache
File A
File B
File C
(3 replicas, 3 blocks) / file
File A
(1 replica, 3 blocks)
Async write
Fast Durable Write
Alluxio Data Orchestration and Control
Service
File D
(3 replicas, 3 blocks) / file
File D
(3 replicas, 3 blocks until HDFS write completed)
(1 replica, 3 blocks) Tmp files not written to HDFS
.staging
.tmp
18. ALLUXIO 18
POLICY DRIVEN DATA MANAGEMENT
Unified namespace for live data migration
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
19. ALLUXIO 19
SEAMLESS CATALOG DEFINITIONS
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in HDFS
B. Hive Metastore table definitions pointing to HDFS
II. Compute cluster with Alluxio
A. Catalog points to Hive Metastore
B. Alluxio intercepts Presto calls to HDFS
III. Query execution
A. Accesses to HDFS are served by Alluxio
B. No manual data copies or application re-writes
Presto Catalog
Hive
Metastore
Hive Connector
hdfs://ns/table
1.
1I.
Presto
Alluxio
III.
Public Cloud
On-premise
s Hive
Metastore
HDFS
20. Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Million+ Download;
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the conversation
on Slack
slackin.alluxio.io
#9
Most critical open
source Java projects
(Google OpenSSF)
21. ALLUXIO 21
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE