Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

Architect a Heterogeneous Data Platform Across Clusters,
Regions, and Clouds
Bin Fan (binfan@alluxio.com)
Founding Engineer, VP of Open Source @ Alluxio

ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University

● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects

Alluxio (Tachyon) back in 2015
Screenshot of Tachyon talk at AMPLab back in 2015
What is Tachyon Stack Release Growth

5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD

Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
Whatʼs Diﬀerent Today
ALLUXIO 6

Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 7

Unprecedented Complexity of Data Platforms
8
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites
8

Inefficient Manual Copy Across Data Centers, Regions, Clouds
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES
9
9

Strong Market Demand For Simplification
Acceleration &
auto-tiering of remote
data sources
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Serve analytics & AI from
multiple data locations
UNIFICATION OF
DATA LAKES
≈
10

Analytics & AI
in the Hybrid & Multi-Cloud Era
Available:
11

No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
SOLUTION
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
12
ALLUXIO 12

Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,200+ contributors
& growing
9,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million

COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
14

ALLUXIO 15
Examples to eliminate data copies
Case Studies
15

Top Online Travel Platform: Unify Data Lake Across Multiple Geo Regions in AWS
Problems Encountered Alluxio’s Solution Results Achieved
Data silos caused by different
brands/teams ingesting data
dispersed across multiple
regions in AWS
Central analytics query across
data silos suffered from poor
user experience and long time to
insight
Manual replication resulted in
inefficiencies, operational
overheads and expensive S3
egress cost
Enhanced user experience with
consistent & high performance
analytics, reducing time to
insights
50%
Reduced cost per query
Unify data silos without the
need to copy or move data
Federate Data Lakes without Replication & Serve Varies Compute Engines
v
TEAM A
v
TEAM B
TEAM C MAIN REGION: CENTRAL ANALYTICS
us-west-1
us-east-1
us-east-2
us-west-2
Hive
Mounted

ALLUXIO 17
Real-time responses & analysis, while
saving costs on S3 storage
Problems Encountered Alluxio’s Solution Results Achieved
2-4x
Average Performance
Improvement
7x
Key Query Speed-up
At Least 50% Cost Saving
Newly introduced chatbot to better
manage communication with gamers
globally
Presto engine performs huge
amounts of queries to support
instantaneous responses
Urgently looking for a new solution to
slash costs without losing
performance
PUBLIC CLOUD
Large Scale Analytics within a Single Cloud

ALLUXIO 18
A Typical Customer Journey
Example Journeys
18

Unified
namespace
Mount HDFS and object
storage into a common
Alluxio cluster
1
Object store
analytics
Caching layer to speed up
Presto and Spark Jobs
2
Hybrid-cloud
Burst Compute to a single
public cloud first
Run analytics on K8s and
cloud native AI for model
training
3
Multi-cloud
Replicate setup on AWS to
Google Cloud
Choose the right tool for the
job, regardless of the cloud
provider
4
EXAMPLE JOURNEY 01
On-premises HDFS to Object Storage to Hybrid Cloud
19

EXAMPLE JOURNEY 01
On-premises object storage as the source of truth
v
REGION A
v
REGION B
MULTIPLE INSTANCES
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER
INGESTION ETL
Hive
20

ALLUXIO 21
Enable a Hybrid Data Lake
Architecture Overview
21

ALLUXIO 22
ARCHITECTURE
Alluxio
Master
Consensus
Standby
Master
WAN
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
Under Store 1
Under Store 2
Control Path
Data Path
Alluxio
Client
Alluxio
Client
22

DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
23

Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
24

SEAMLESS CATALOG DEFINITIONS
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in HDFS
B. Hive Metastore table definitions pointing to HDFS
II. Compute cluster with Alluxio
A. Catalog points to Hive Metastore
B. Alluxio intercepts Presto calls to HDFS
III. Query execution
A. Accesses to HDFS are served by Alluxio
B. No manual data copies or application re-writes
Presto Catalog
Hive
Metastore
Hive Connector
hdfs://ns/table
1.
1I.
Presto
Alluxio
III.
Public Cloud
On-premise
s Hive
Metastore
HDFS
25

Spark
Alluxio
Storage
Co-locate Alluxio Workers with compute for
optimal I/O performance
Remote cluster
Same cluster
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between compute and Storage
Remote cluster
Same data center / region
Presto
DEPLOYMENT APPROACHES
26
Long-running Instances Ephemeral Elastic
26

UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
27

POLICY DRIVEN DATA MANAGEMENT
Decouple logical file system namespace from physical storage
Alluxio
Master
Alluxio Policy Engine
Example Policy
Move files older than 90
days from HDFS to S3
Application
Apps access the same path regardless
of where the actual data is stored
28

Alluxio Proprietary and Confidential ALLUXIO 29
Training & Data Pre-processing
ML/DL
29

I/O Challenges in ML/DL
ALLUXIO 30
Training data often
consists of a
massive amount of
small files (billions
of 100KB photos)
Size of training
data keeps
growing & can
exceed individual
server capacity.
Training jobs are
highly concurrent,
require high I/O to
keep GPU utilized
Whatʼs Diﬀerent
30

Using Alluxio for DL
Alluxio
Server
Alluxio
Server ...
Training Instances
POSIX POSIX POSIX
ALLUXIO 31
- Only fetch data on on cache miss
- No need to copy data before use
Distributed Caching
31

Consistent
Performance
Direct access to
data
Low latency and
high throughput
High GPU
utilization rate
ALLUXIO 32
Using Alluxio for DL
Distributed Caching
32

MOMONASDAQ:MOMO
runs thousands of Alluxio nodes across multiple Alluxio clusters,
managing more than 100+ TB data for search and training:
● Support multiple storage & compute frameworks.
● Accelerate compute & training tasks
● Reduce the metadata and data overhead
Model Training using PyTorch + Alluxio + Ceph
● 2 billion small files
● Reduce metadata & data interactions with Ceph to improve performance
33
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
Large Scale Deep Learning
TOPOLOGY: ON-PREMISES
Alluxio’s Solution
33

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Q&A
34

ALLUXIO 35
Hybrid Cloud Storage Gateway with Compute On-prem
USE CASE: HYBRID CLOUD

ALLUXIO 36
• 6+ Alluxio Clusters in production
• Largest Alluxio Cluster = 1000 nodes*
• 2.5x performance improvement for IO intensive
queries & 1.2x on average
• 30% reduction in query failures due to timeouts
* Largest single Alluxio cluster across any customer is 3000 nodes
Multiple Analytics Clusters On-premises
USE CASE: ALL IN DATACENTER
Spark SQL
Alluxio
Impala

ALLUXIO 37
Cross Datacenter Access without changing Ingest Pipeline
USE CASE: MULTI DATACENTER
Trino
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
REMOTE DATA RESULTS
• Ad-hoc SQL workloads in a secondary DC as analyst
headcount reached 1800 people
• Leverage a 220+ node Alluxio cluster for compute resources
outside primary DC

ALLUXIO 38
• 40%+ reduction in AI training time & cost
• Data prefetching using asynchronous loading
• 200 GPU instances with 4x NVIDIA V100
• Alluxio uses CPU cores and NVMe
USE CASE: ALL IN CLOUD

ALLUXIO 39
Compute in GCP with Data On-premises
• 2x Performance Improvement
For range queries
• Improved Concurrency and Pinning
• Elastic compute for up to 2x cost savings
• Auto-scaling of Alluxio workers

v
ALLUXIO 40
Burst Compute to AWS with Storage On-prem
• 4+ Alluxio clusters on-prem with
synchronization requirements
• Alluxio as the only way to read or
write to on-prem data lake
• Compute stack cloned in second CSP
v
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
PRIVATE
DATACENTER
INGESTION ETL
Hive

ALLUXIO 41
Shared Previously
• 40%+ reduction in training stage time & cost
over direct access to cloud storage
Whatʼs New in 2.7
• Optimal resource utilization with NVIDIA Data
Loading Library (DALI) + Alluxio
• 8-12x performance improvement in data loading
and preprocessing stages
• I/O and training can now execute in parallel,
eliminating serialization delays caused by the
copy-to-local approach
USE CASE: ALL IN CLOUD
Distributed
Deep
Learning

ALLUXIO 42
WeRide uses Alluxio as a Hybrid Cloud Storage Gateway
USER STORY: HYBRID CLOUD
Alluxio
ON PREMISE
PUBLIC CLOUD
• Network egress cost savings with cross-region access over
data copy-based solutions
• Multiple locations with GPU clusters access a centralized
data lake in AWS for training autonomous driving
• Terabytes of data generated daily from simulations & test
drives shared across regions
GPU training

ALLUXIO 43
Cross Datacenter Access without changing Ingest Pipeline
USE CASE: MULTI DATACENTER
Trino
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
REMOTE DATA RESULTS
• Ad-hoc SQL workloads in a secondary DC as analyst
headcount reached 1800 people
• Leverage a 220+ node Alluxio cluster for compute resources
outside primary DC

ALLUXIO 44
GPU Accelerated
Analytics

Alluxio and RAPIDS Accelerator for Apache Spark
A Collaboration between Alluxio and NVIDIA
Integration of RAPIDS on GPUs for compute acceleration and Alluxio for data acceleration
70 % better ROI for GPU-based processing compared to CPUs
1.9x better performance for a decision support workload
v
Cloud
Dataproc
Spark
GPU Enabled Cluster

BENCHMARKING RESULTS
90 NVIDIA DECISION SUPPORT QUERIES
CPU Config: Master: n1-standard-4, Worker: 4 x n1-standard-32 (128 cores, 480GB RAM), Cloud Costs: $7.82/hr
GPU Config: Master: n1-standard-4, Worker: 4 x n1-standard-32 (128 cores, 480GB RAM + 16 x T4), Cloud Costs: $13.41/hr

Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

More Related Content

Similar to Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

More from Alluxio, Inc.

Recently uploaded

Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds