Architect a Heterogeneous Data Platform Across Clusters,
Regions, and Clouds
Bin Fan (binfan@alluxio.com)
Founding Engineer, VP of Open Source @ Alluxio
ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
Alluxio (Tachyon) back in 2015
Screenshot of Tachyon talk at AMPLab back in 2015
What is Tachyon Stack Release Growth
5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD
Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
Whatʼs Different Today
ALLUXIO 6
Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 7
Unprecedented Complexity of Data Platforms
8
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites
8
Inefficient Manual Copy Across Data Centers, Regions, Clouds
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES
9
9
Strong Market Demand For Simplification
Acceleration &
auto-tiering of remote
data sources
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Serve analytics & AI from
multiple data locations
UNIFICATION OF
DATA LAKES
≈
10
Analytics & AI
in the Hybrid & Multi-Cloud Era
Available:
11
No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
SOLUTION
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
12
ALLUXIO 12
Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,200+ contributors
& growing
9,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
14
ALLUXIO 15
Examples to eliminate data copies
Case Studies
15
Top Online Travel Platform: Unify Data Lake Across Multiple Geo Regions in AWS
Problems Encountered Alluxio’s Solution Results Achieved
Data silos caused by different
brands/teams ingesting data
dispersed across multiple
regions in AWS
Central analytics query across
data silos suffered from poor
user experience and long time to
insight
Manual replication resulted in
inefficiencies, operational
overheads and expensive S3
egress cost
Enhanced user experience with
consistent & high performance
analytics, reducing time to
insights
50%
Reduced cost per query
Unify data silos without the
need to copy or move data
Federate Data Lakes without Replication & Serve Varies Compute Engines
v
TEAM A
v
TEAM B
TEAM C MAIN REGION: CENTRAL ANALYTICS
us-west-1
us-east-1
us-east-2
us-west-2
Hive
Mounted
ALLUXIO 17
Real-time responses & analysis, while
saving costs on S3 storage
Problems Encountered Alluxio’s Solution Results Achieved
2-4x
Average Performance
Improvement
7x
Key Query Speed-up
At Least 50% Cost Saving
Newly introduced chatbot to better
manage communication with gamers
globally
Presto engine performs huge
amounts of queries to support
instantaneous responses
Urgently looking for a new solution to
slash costs without losing
performance
PUBLIC CLOUD
Large Scale Analytics within a Single Cloud
ALLUXIO 18
A Typical Customer Journey
Example Journeys
18
Unified
namespace
Mount HDFS and object
storage into a common
Alluxio cluster
1
Object store
analytics
Caching layer to speed up
Presto and Spark Jobs
2
Hybrid-cloud
Burst Compute to a single
public cloud first
Run analytics on K8s and
cloud native AI for model
training
3
Multi-cloud
Replicate setup on AWS to
Google Cloud
Choose the right tool for the
job, regardless of the cloud
provider
4
EXAMPLE JOURNEY 01
On-premises HDFS to Object Storage to Hybrid Cloud
19
EXAMPLE JOURNEY 01
On-premises object storage as the source of truth
v
REGION A
v
REGION B
MULTIPLE INSTANCES
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER
INGESTION ETL
Hive
20
ALLUXIO 21
Enable a Hybrid Data Lake
Architecture Overview
21
ALLUXIO 22
ARCHITECTURE
Alluxio
Master
Consensus
Standby
Master
WAN
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
Under Store 1
Under Store 2
Control Path
Data Path
Alluxio
Client
Alluxio
Client
22
DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
23
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
24
SEAMLESS CATALOG DEFINITIONS
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in HDFS
B. Hive Metastore table definitions pointing to HDFS
II. Compute cluster with Alluxio
A. Catalog points to Hive Metastore
B. Alluxio intercepts Presto calls to HDFS
III. Query execution
A. Accesses to HDFS are served by Alluxio
B. No manual data copies or application re-writes
Presto Catalog
Hive
Metastore
Hive Connector
hdfs://ns/table
1.
1I.
Presto
Alluxio
III.
Public Cloud
On-premise
s Hive
Metastore
HDFS
25
Spark
Alluxio
Storage
Co-locate Alluxio Workers with compute for
optimal I/O performance
Remote cluster
Same cluster
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between compute and Storage
Remote cluster
Same data center / region
Presto
DEPLOYMENT APPROACHES
26
Long-running Instances Ephemeral Elastic
26
UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
27
POLICY DRIVEN DATA MANAGEMENT
Decouple logical file system namespace from physical storage
Alluxio
Master
Alluxio Policy Engine
Example Policy
Move files older than 90
days from HDFS to S3
Application
Apps access the same path regardless
of where the actual data is stored
28
Alluxio Proprietary and Confidential ALLUXIO 29
Training & Data Pre-processing
ML/DL
29
I/O Challenges in ML/DL
ALLUXIO 30
Training data often
consists of a
massive amount of
small files (billions
of 100KB photos)
Size of training
data keeps
growing & can
exceed individual
server capacity.
Training jobs are
highly concurrent,
require high I/O to
keep GPU utilized
Whatʼs Different
30
Using Alluxio for DL
Alluxio
Server
Alluxio
Server ...
Training Instances
POSIX POSIX POSIX
ALLUXIO 31
- Only fetch data on on cache miss
- No need to copy data before use
Distributed Caching
31
Consistent
Performance
Direct access to
data
Low latency and
high throughput
High GPU
utilization rate
ALLUXIO 32
Using Alluxio for DL
Distributed Caching
32
MOMONASDAQ:MOMO
runs thousands of Alluxio nodes across multiple Alluxio clusters,
managing more than 100+ TB data for search and training:
● Support multiple storage & compute frameworks.
● Accelerate compute & training tasks
● Reduce the metadata and data overhead
Model Training using PyTorch + Alluxio + Ceph
● 2 billion small files
● Reduce metadata & data interactions with Ceph to improve performance
33
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
Large Scale Deep Learning
TOPOLOGY: ON-PREMISES
Alluxio’s Solution
33
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Q&A
34
ALLUXIO 35
Hybrid Cloud Storage Gateway with Compute On-prem
USE CASE: HYBRID CLOUD
ALLUXIO 36
• 6+ Alluxio Clusters in production
• Largest Alluxio Cluster = 1000 nodes*
• 2.5x performance improvement for IO intensive
queries & 1.2x on average
• 30% reduction in query failures due to timeouts
* Largest single Alluxio cluster across any customer is 3000 nodes
Multiple Analytics Clusters On-premises
USE CASE: ALL IN DATACENTER
Spark SQL
Alluxio
Impala
ALLUXIO 37
Cross Datacenter Access without changing Ingest Pipeline
USE CASE: MULTI DATACENTER
Trino
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
REMOTE DATA RESULTS
• Ad-hoc SQL workloads in a secondary DC as analyst
headcount reached 1800 people
• Leverage a 220+ node Alluxio cluster for compute resources
outside primary DC
ALLUXIO 38
• 40%+ reduction in AI training time & cost
• Data prefetching using asynchronous loading
• 200 GPU instances with 4x NVIDIA V100
• Alluxio uses CPU cores and NVMe
Large Scale Deep Learning
USE CASE: ALL IN CLOUD
ALLUXIO 39
Compute in GCP with Data On-premises
USE CASE: HYBRID CLOUD
• 2x Performance Improvement
For range queries
• Improved Concurrency and Pinning
• Elastic compute for up to 2x cost savings
• Auto-scaling of Alluxio workers
v
ALLUXIO 40
Burst Compute to AWS with Storage On-prem
USE CASE: HYBRID CLOUD
• 4+ Alluxio clusters on-prem with
synchronization requirements
• Alluxio as the only way to read or
write to on-prem data lake
• Compute stack cloned in second CSP
v
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
PRIVATE
DATACENTER
INGESTION ETL
Hive
ALLUXIO 41
Shared Previously
• 40%+ reduction in training stage time & cost
over direct access to cloud storage
Whatʼs New in 2.7
• Optimal resource utilization with NVIDIA Data
Loading Library (DALI) + Alluxio
• 8-12x performance improvement in data loading
and preprocessing stages
• I/O and training can now execute in parallel,
eliminating serialization delays caused by the
copy-to-local approach
Large Scale Deep Learning
USE CASE: ALL IN CLOUD
Distributed
Deep
Learning
ALLUXIO 42
WeRide uses Alluxio as a Hybrid Cloud Storage Gateway
USER STORY: HYBRID CLOUD
Alluxio
ON PREMISE
PUBLIC CLOUD
• Network egress cost savings with cross-region access over
data copy-based solutions
• Multiple locations with GPU clusters access a centralized
data lake in AWS for training autonomous driving
• Terabytes of data generated daily from simulations & test
drives shared across regions
GPU training
ALLUXIO 43
Cross Datacenter Access without changing Ingest Pipeline
USE CASE: MULTI DATACENTER
Trino
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
REMOTE DATA RESULTS
• Ad-hoc SQL workloads in a secondary DC as analyst
headcount reached 1800 people
• Leverage a 220+ node Alluxio cluster for compute resources
outside primary DC
ALLUXIO 44
GPU Accelerated
Analytics
Alluxio and RAPIDS Accelerator for Apache Spark
A Collaboration between Alluxio and NVIDIA
Integration of RAPIDS on GPUs for compute acceleration and Alluxio for data acceleration
70 % better ROI for GPU-based processing compared to CPUs
1.9x better performance for a decision support workload
v
Cloud
Dataproc
Spark
GPU Enabled Cluster
BENCHMARKING RESULTS
90 NVIDIA DECISION SUPPORT QUERIES
CPU Config: Master: n1-standard-4, Worker: 4 x n1-standard-32 (128 cores, 480GB RAM), Cloud Costs: $7.82/hr
GPU Config: Master: n1-standard-4, Worker: 4 x n1-standard-32 (128 cores, 480GB RAM + 16 x T4), Cloud Costs: $13.41/hr

Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

  • 1.
    Architect a HeterogeneousData Platform Across Clusters, Regions, and Clouds Bin Fan (binfan@alluxio.com) Founding Engineer, VP of Open Source @ Alluxio
  • 2.
    ALLUXIO 2 About Me 2 BinFan (https://www.linkedin.com/in/bin-fan/) ● Founding Engineer, VP Open Source @ Alluxio ● Alluxio PMC Co-Chair, Presto TSC/committer ● Email: binfan@alluxio.com ● PhD in CS @ Carnegie Mellon University
  • 3.
    ● Originally aresearch project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student Haoyuan Li (Alluxio founder CEO) ● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M) announced in 2021 ● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc ● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were contributed by the community users ● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1] Alluxio Overview ALLUXIO 3 [1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
  • 4.
    Alluxio (Tachyon) backin 2015 Screenshot of Tachyon talk at AMPLab back in 2015 What is Tachyon Stack Release Growth
  • 5.
    5 AMPLab活动上Tachyon演讲的截图 Alluxio (Tachyon) in2015 Spark Task1 Spark Task 2 HDFS / Amazon S3 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory RDD
  • 6.
    Topology ● On-prem Hadoop→ Cloud-native, Multi- or Hybrid-cloud, Multi-datacenter Computation ● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch …. ● More mature frameworks (less frequent OOM etc) Data access pattern ● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc read into structured/columnar data ● Hundred to thousand of big files → millions of small files Whatʼs Different Today ALLUXIO 6
  • 7.
    Data Storage ● On-prem& colocated HDFS → S3 !!! and other object stores (possibly across regions like us-east & us-west), and legacy on-prem HDFS in service Resource/Job Orchestration ● YARN → K8s ○ Lost focus on data locality The Evolution from Hadoop to Cloud-native Era ALLUXIO 7
  • 8.
    Unprecedented Complexity ofData Platforms 8 Data Trend Complex Platform New compute and storage tech created every 3-8 years On-premise, cloud, hybrid, multi-cloud environments all have different environment properties More data generated every day, and stored in data silos Data copies, synchronization costs More people and teams need to access and leverage these data Multiple APIs necessitate integration and application rewrites 8
  • 9.
    Inefficient Manual CopyAcross Data Centers, Regions, Clouds v REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine Hive DATACENTER 2 DATACENTER 1 ERROR PRONE AND NETWORK INTENSIVE DATA COPIES 9 9
  • 10.
    Strong Market DemandFor Simplification Acceleration & auto-tiering of remote data sources EFFICIENT ACCESS & DATA MANAGEMENT Agility across regions for private, hybrid or multi-cloud ENVIRONMENT AGNOSTICITY Serve analytics & AI from multiple data locations UNIFICATION OF DATA LAKES ≈ 10
  • 11.
    Analytics & AI inthe Hybrid & Multi-Cloud Era Available: 11
  • 12.
    No-copy data accessacross silos agnostic to compute engine Foundation of a heterogeneous data platform across geos SOLUTION ≈ Multi-Cloud Ready Analytics & AI Platform v REGION A v REGION B REGION A REGION B GKE DATACENTER 2 DATACENTER 1 HMS 12 ALLUXIO 12
  • 13.
    Open Source StartedFrom UC Berkeley AMPLab in 2014 Join the conversation on Slack alluxio.io/slack 1,200+ contributors & growing 9,000+ Slack Community Members Top 10 Most Critical Java Based Open Source Project GitHub’s Top 100 Most Valuable Repositories Out of 96 Million
  • 14.
    COMPANIES USING ALLUXIO INTERNET PUBLICCLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGY FINANCIAL SERVICES TELCO & MEDIA LEARN MORE 14
  • 15.
    ALLUXIO 15 Examples toeliminate data copies Case Studies 15
  • 16.
    Top Online TravelPlatform: Unify Data Lake Across Multiple Geo Regions in AWS Problems Encountered Alluxio’s Solution Results Achieved Data silos caused by different brands/teams ingesting data dispersed across multiple regions in AWS Central analytics query across data silos suffered from poor user experience and long time to insight Manual replication resulted in inefficiencies, operational overheads and expensive S3 egress cost Enhanced user experience with consistent & high performance analytics, reducing time to insights 50% Reduced cost per query Unify data silos without the need to copy or move data Federate Data Lakes without Replication & Serve Varies Compute Engines v TEAM A v TEAM B TEAM C MAIN REGION: CENTRAL ANALYTICS us-west-1 us-east-1 us-east-2 us-west-2 Hive Mounted
  • 17.
    ALLUXIO 17 Real-time responses& analysis, while saving costs on S3 storage Problems Encountered Alluxio’s Solution Results Achieved 2-4x Average Performance Improvement 7x Key Query Speed-up At Least 50% Cost Saving Newly introduced chatbot to better manage communication with gamers globally Presto engine performs huge amounts of queries to support instantaneous responses Urgently looking for a new solution to slash costs without losing performance PUBLIC CLOUD Large Scale Analytics within a Single Cloud
  • 18.
    ALLUXIO 18 A TypicalCustomer Journey Example Journeys 18
  • 19.
    Unified namespace Mount HDFS andobject storage into a common Alluxio cluster 1 Object store analytics Caching layer to speed up Presto and Spark Jobs 2 Hybrid-cloud Burst Compute to a single public cloud first Run analytics on K8s and cloud native AI for model training 3 Multi-cloud Replicate setup on AWS to Google Cloud Choose the right tool for the job, regardless of the cloud provider 4 EXAMPLE JOURNEY 01 On-premises HDFS to Object Storage to Hybrid Cloud 19
  • 20.
    EXAMPLE JOURNEY 01 On-premisesobject storage as the source of truth v REGION A v REGION B MULTIPLE INSTANCES PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine DATACENTER INGESTION ETL Hive 20
  • 21.
    ALLUXIO 21 Enable aHybrid Data Lake Architecture Overview 21
  • 22.
    ALLUXIO 22 ARCHITECTURE Alluxio Master Consensus Standby Master WAN Alluxio Worker RAM /SSD / HDD Alluxio Worker RAM / SSD / HDD … … Under Store 1 Under Store 2 Control Path Data Path Alluxio Client Alluxio Client 22
  • 23.
    DATA LOCALITY WITHSCALE-OUT WORKERS Local performance for remote data with intelligent multi-tiering Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL On-premises Public Cloud Model Training Big Data ETL Big Data Query 23
  • 24.
    Synchronization of changesacross clusters Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization Mutation On-premises Public Cloud Model Training Big Data ETL Big Data Query RAM SSD METADATA LOCALITY WITH SCALEABLE MASTERS RocksDB 24
  • 25.
    SEAMLESS CATALOG DEFINITIONS Notable redefinitions required using “Transparent URI” Example Scenario I. Initial state A. Data in HDFS B. Hive Metastore table definitions pointing to HDFS II. Compute cluster with Alluxio A. Catalog points to Hive Metastore B. Alluxio intercepts Presto calls to HDFS III. Query execution A. Accesses to HDFS are served by Alluxio B. No manual data copies or application re-writes Presto Catalog Hive Metastore Hive Connector hdfs://ns/table 1. 1I. Presto Alluxio III. Public Cloud On-premise s Hive Metastore HDFS 25
  • 26.
    Spark Alluxio Storage Co-locate Alluxio Workerswith compute for optimal I/O performance Remote cluster Same cluster Spark Alluxio Storage Deploy Alluxio as standalone cluster between compute and Storage Remote cluster Same data center / region Presto DEPLOYMENT APPROACHES 26 Long-running Instances Ephemeral Elastic 26
  • 27.
    UNIFIED NAMESPACE With Replication& Live Data Migration Capabilities hdfs://host:port/directory/ Reports Sales • Single Alluxio path backed by multiple storage systems • Example policy: Migrate data older than 7 days from HDFS to S3 27
  • 28.
    POLICY DRIVEN DATAMANAGEMENT Decouple logical file system namespace from physical storage Alluxio Master Alluxio Policy Engine Example Policy Move files older than 90 days from HDFS to S3 Application Apps access the same path regardless of where the actual data is stored 28
  • 29.
    Alluxio Proprietary andConfidential ALLUXIO 29 Training & Data Pre-processing ML/DL 29
  • 30.
    I/O Challenges inML/DL ALLUXIO 30 Training data often consists of a massive amount of small files (billions of 100KB photos) Size of training data keeps growing & can exceed individual server capacity. Training jobs are highly concurrent, require high I/O to keep GPU utilized Whatʼs Different 30
  • 31.
    Using Alluxio forDL Alluxio Server Alluxio Server ... Training Instances POSIX POSIX POSIX ALLUXIO 31 - Only fetch data on on cache miss - No need to copy data before use Distributed Caching 31
  • 32.
    Consistent Performance Direct access to data Lowlatency and high throughput High GPU utilization rate ALLUXIO 32 Using Alluxio for DL Distributed Caching 32
  • 33.
    MOMONASDAQ:MOMO runs thousands ofAlluxio nodes across multiple Alluxio clusters, managing more than 100+ TB data for search and training: ● Support multiple storage & compute frameworks. ● Accelerate compute & training tasks ● Reduce the metadata and data overhead Model Training using PyTorch + Alluxio + Ceph ● 2 billion small files ● Reduce metadata & data interactions with Ceph to improve performance 33 https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/ Large Scale Deep Learning TOPOLOGY: ON-PREMISES Alluxio’s Solution 33
  • 34.
  • 35.
    ALLUXIO 35 Hybrid CloudStorage Gateway with Compute On-prem USE CASE: HYBRID CLOUD
  • 36.
    ALLUXIO 36 • 6+Alluxio Clusters in production • Largest Alluxio Cluster = 1000 nodes* • 2.5x performance improvement for IO intensive queries & 1.2x on average • 30% reduction in query failures due to timeouts * Largest single Alluxio cluster across any customer is 3000 nodes Multiple Analytics Clusters On-premises USE CASE: ALL IN DATACENTER Spark SQL Alluxio Impala
  • 37.
    ALLUXIO 37 Cross DatacenterAccess without changing Ingest Pipeline USE CASE: MULTI DATACENTER Trino Alluxio DATACENTER 1 a DATACENTER 2 Hive REMOTE DATA RESULTS • Ad-hoc SQL workloads in a secondary DC as analyst headcount reached 1800 people • Leverage a 220+ node Alluxio cluster for compute resources outside primary DC
  • 38.
    ALLUXIO 38 • 40%+reduction in AI training time & cost • Data prefetching using asynchronous loading • 200 GPU instances with 4x NVIDIA V100 • Alluxio uses CPU cores and NVMe Large Scale Deep Learning USE CASE: ALL IN CLOUD
  • 39.
    ALLUXIO 39 Compute inGCP with Data On-premises USE CASE: HYBRID CLOUD • 2x Performance Improvement For range queries • Improved Concurrency and Pinning • Elastic compute for up to 2x cost savings • Auto-scaling of Alluxio workers
  • 40.
    v ALLUXIO 40 Burst Computeto AWS with Storage On-prem USE CASE: HYBRID CLOUD • 4+ Alluxio clusters on-prem with synchronization requirements • Alluxio as the only way to read or write to on-prem data lake • Compute stack cloned in second CSP v Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine PRIVATE DATACENTER INGESTION ETL Hive
  • 41.
    ALLUXIO 41 Shared Previously •40%+ reduction in training stage time & cost over direct access to cloud storage Whatʼs New in 2.7 • Optimal resource utilization with NVIDIA Data Loading Library (DALI) + Alluxio • 8-12x performance improvement in data loading and preprocessing stages • I/O and training can now execute in parallel, eliminating serialization delays caused by the copy-to-local approach Large Scale Deep Learning USE CASE: ALL IN CLOUD Distributed Deep Learning
  • 42.
    ALLUXIO 42 WeRide usesAlluxio as a Hybrid Cloud Storage Gateway USER STORY: HYBRID CLOUD Alluxio ON PREMISE PUBLIC CLOUD • Network egress cost savings with cross-region access over data copy-based solutions • Multiple locations with GPU clusters access a centralized data lake in AWS for training autonomous driving • Terabytes of data generated daily from simulations & test drives shared across regions GPU training
  • 43.
    ALLUXIO 43 Cross DatacenterAccess without changing Ingest Pipeline USE CASE: MULTI DATACENTER Trino Alluxio DATACENTER 1 a DATACENTER 2 Hive REMOTE DATA RESULTS • Ad-hoc SQL workloads in a secondary DC as analyst headcount reached 1800 people • Leverage a 220+ node Alluxio cluster for compute resources outside primary DC
  • 44.
  • 45.
    Alluxio and RAPIDSAccelerator for Apache Spark A Collaboration between Alluxio and NVIDIA Integration of RAPIDS on GPUs for compute acceleration and Alluxio for data acceleration 70 % better ROI for GPU-based processing compared to CPUs 1.9x better performance for a decision support workload v Cloud Dataproc Spark GPU Enabled Cluster
  • 46.
    BENCHMARKING RESULTS 90 NVIDIADECISION SUPPORT QUERIES CPU Config: Master: n1-standard-4, Worker: 4 x n1-standard-32 (128 cores, 480GB RAM), Cloud Costs: $7.82/hr GPU Config: Master: n1-standard-4, Worker: 4 x n1-standard-32 (128 cores, 480GB RAM + 16 x T4), Cloud Costs: $13.41/hr