DATA ORCHESTRATION SUMMI
T
Alluxio Use Cases and Future Directions
Bin Fan - Founding Engineer, VP of Open Source @ Alluxio
Calvin Jia - Founding Engineer @ Alluxio
Data Orchestration for
Analytics & AI in the Cloud
A DATA ORCHESTRATION APPROACH
Available:
DATA ORCHESTRATION SUMMIT
Agenda
• Alluxio Use Cases
• Future Directions
• Community Collaborations
DATA ORCHESTRATION 
SUMMIT
2020
Alluxio Use Cases
DATA ORCHESTRATION SUMMIT
Companies Using Alluxio
DATA ORCHESTRATION SUMMIT
Single Cloud & On-Prem Use Cases
Consistent SLAs, Performance, and
Cost Savings on cloud storage
USE CASE 01: CLOUD USE CASE 02: ON PREM
PUBLIC CLOUD
Tensorflow
Alluxio
Speed-up analytics on on-prem
object stores
ON PREMISE
Spark
Alluxio
OR OR
CHALLENGES WITH CLOUD STORAGE
USE CASE 01: CLOUD
Inefficient access to cloud storage
• Performance is variable and consistent SLAs are hard to achieve
• Metadata operations are expensive & slowdown workloads
• Embedded caching solutions are ineffective for ephemeral
workloads & clusters
Tensorflow
Alluxio
OR
• 40%+ reduction in AI training time & cost
• 2-8x performance with Analytics engines
• Eliminate storage access cost to cut total cost by up to 50%
• Reduce latency spikes by up to 6x using data pre-loading &
consistent performance guarantees
• Optional off-cluster caching for ephemeral workloads
SOLUTION
Consistent SLAs, Performance &
Cost Savings on cloud storage
USE CASE 01: CLOUD
Tensorflow
Alluxio
OR
CHALLENGES WITH ON-PREM OBJECT STORES
USE CASE 02: ON PREM
Slow transition to object storage
• Performance for analytics & AI workloads can be very poor
• No native support for popular frameworks
• Expensive metadata operations further reduce performance
t
Spark
Alluxio
OR OR
• Improved performance over co-located HDFS with the
flexibility of segregated storage
• Support for multiple APIs
• No changes to the end-user experience
• Enable cheap storage at a fraction of the cost
SOLUTION
Speed-up analytics & AI on
on-prem object stores
USE CASE 02: ON PREM
t
Spark
Alluxio
SAME REGION
OR OR
DATA ORCHESTRATION SUMMIT
Hybrid Cloud & Multi-Datacenter
Burst compute to a public cloud
and gradually migrate
USE CASE 03: HYBRID
Hive
Alluxio
PUBLIC CLOUD
ON PREMISE
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
USE CASE 04: HYBRID
Alluxio
Pytorch
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
USE CASE 05: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 12
CHALLENGES WITH HYBRID CLOUD BURSTING
USE CASE 03: HYBRID
Migrating Analytics or AI to the
Cloud is Hard
• Repeated data access across the corporate network to a public
cloud is not feasible
• Copying data to cloud storage is time consuming and complex
• Using a cloud storage system like S3 means expensive
application changes and low performance
t
Hive
Alluxio
t
Hive
Alluxio
SAME REGION
ALLUXIO 13
• Performance as if data is on the cloud compute cluster
• 100% of I/O is offloaded from on-premises
• No changes to end-user experience and security model
• Common data fabric with only a logical data copies
• Utilization of elastic cloud compute for up to 4x costs savings
SOLUTION
Burst Compute to a Public Cloud
and Gradually Migrate
USE CASE 03: HYBRID
DATA ORCHESTRATION SUMMIT
Alluxio @ Walmart
• Zero-Copy
○ No new copies of data in the cloud
• High Performance
○ Data caching accelerates queries
• Lower Costs
○ One source of truth for data avoids
additional storage
ALLUXIO 15
CHALLENGES WITH HYBRID CLOUD STORAGE
USE CASE 04: HYBRID
Accessing Cloud Storage from a
Private Datacenter
• No unified view for cloud and on-prem storage
• Prohibitively high network egress costs
• Inability to utilize compute on-premises for data generated
in the cloud
• Inadequate performance for analytics and AI
PyTorch
ON PREMISE
PUBLIC CLOUD
ALLUXIO 16
• Performance as if data is on the on-prem compute cluster
• Intelligent distributed caching for reads & writes
• Network cost savings of up to 80% by eliminating replication
• No changes to the end-user experience with flexible APIs and
security model on cloud storage
SOLUTION
Hybrid Cloud Storage Gateway for
data in the cloud
USE CASE 04: HYBRID
Alluxio
PyTorch
ON PREMISE
PUBLIC CLOUD
ALLUXIO 17
CHALLENGES WITH SUPPORTING SATELLITE CLUSTERS
ACROSS DATA CENTERS
USE CASE 05: MULTI DATACENTER
Utilization of compute resources
across datacenters
• Orchestrating data to compute clusters in another data center is
manual and time consuming
• Storing and managing multiple copies of the data is expensive
with unnecessary network traffic for replication
• Running replication frameworks on an overloaded storage
cluster dramatically impacts performance of existing workloads
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
ALLUXIO 18
• No redundant data copies across datacenters
• Elimination of complex data synchronization
• 3-6x performance compared to remote data access across regions
• Self-service data infrastructure across business units
SOLUTION
Cross Datacenter Access without
changing Ingest Pipeline
USE CASE 05: MULTI DATACENTER
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
DATA ORCHESTRATION SUMMIT
Alluxio @ Adobe
Primary DC with large Hadoop Cluster out of
space, ad hoc SQL workloads exponentially
growing as analyst headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
DATA ORCHESTRATION SUMMIT
Alluxio & Data Analytics
• Data Analytics runs on Data Lakes
• Data Lakes are designed for data storage, not access
• Alluxio is the Data Orchestration layer which bridges the
compute and data layers
○ If the Data Lake is remote
○ If the Data Lake is overloaded
○ If the Data Lake has variable latency
○ If the Data Lake has low performance
○ If the Data Lake doesn’t support the same semantics
○ ...
DATA ORCHESTRATION 
SUMMIT
2020
Growing Workloads
DATA ORCHESTRATION SUMMIT
Alluxio & AI w/ K8s
• Machine Learning & AI runs on Data Lakes
• Compared to Data Analytics, AI workloads have different
characteristics, but a similar mismatch between compute
and storage
DATA ORCHESTRATION SUMMIT
Alluxio & AI - Better Together
• Access Pattern - Repeated access on a dataset
• Dataset - Many small files
• Preferred API - Posix Filesystem
• Workload Regularity - Predictable, bulk access
DATA ORCHESTRATION SUMMIT
Powered by the Community
• Future directions and growing workloads for Alluxio are
greatly influenced by our community! Thank you!
DATA ORCHESTRATION 
SUMMIT
2020
Community Collaborations
DATA ORCHESTRATION SUMMIT
Alluxio Open Source Project Stats
Latest stable release: 2.4.1
Total number of contributors: 1092
+1013 more commits since v2.1.0 (Nov 2019, 1st Summit)
5100+ Slack users (alluxio.io/slack)
DATA ORCHESTRATION SUMMIT
Fast Growing User Slack Channel
alluxio.io/slack
DATA ORCHESTRATION SUMMIT
Production Deployments at Scale
● Top-tier cell phone provider
○ 3000+ Alluxio servers in a single cluster
● Top-tier social network company
○ 10,000+ concurrent Alluxio clients
○ 10+PB data managed
DATA ORCHESTRATION SUMMIT
Special Interest Groups in Ecosystem
● SIG in Machine Learning/K8s on Alluxio
■ Regular Community R&D meetings
■ Re-implemented JNI-based FUSE integration
■ Performance optimizations for small files, RPCs
● A new SIG kicked off in Presto on Alluxio
DATA ORCHESTRATION SUMMIT
Experimental Two-week Release Cycle
● Previous release cadence: quarterly
● New experimental release schedule:
○ every two weeks
○ starting early December!
● What does it bring to Alluxio community?
○ deliver feature/bug fixes faster
DATA ORCHESTRATION SUMMIT
Welcome to Join Alluxio Community!
alluxio.io/slack Alluxio-Global-Online-Meetup/

Alluxio Use Cases and Future Directions

  • 1.
    DATA ORCHESTRATION SUMMI T Alluxio Use Casesand Future Directions Bin Fan - Founding Engineer, VP of Open Source @ Alluxio Calvin Jia - Founding Engineer @ Alluxio
  • 2.
    Data Orchestration for Analytics& AI in the Cloud A DATA ORCHESTRATION APPROACH Available:
  • 3.
    DATA ORCHESTRATION SUMMIT Agenda • Alluxio UseCases • Future Directions • Community Collaborations
  • 4.
  • 5.
  • 6.
    DATA ORCHESTRATION SUMMIT Single Cloud &On-Prem Use Cases Consistent SLAs, Performance, and Cost Savings on cloud storage USE CASE 01: CLOUD USE CASE 02: ON PREM PUBLIC CLOUD Tensorflow Alluxio Speed-up analytics on on-prem object stores ON PREMISE Spark Alluxio OR OR
  • 7.
    CHALLENGES WITH CLOUDSTORAGE USE CASE 01: CLOUD Inefficient access to cloud storage • Performance is variable and consistent SLAs are hard to achieve • Metadata operations are expensive & slowdown workloads • Embedded caching solutions are ineffective for ephemeral workloads & clusters Tensorflow Alluxio OR
  • 8.
    • 40%+ reductionin AI training time & cost • 2-8x performance with Analytics engines • Eliminate storage access cost to cut total cost by up to 50% • Reduce latency spikes by up to 6x using data pre-loading & consistent performance guarantees • Optional off-cluster caching for ephemeral workloads SOLUTION Consistent SLAs, Performance & Cost Savings on cloud storage USE CASE 01: CLOUD Tensorflow Alluxio OR
  • 9.
    CHALLENGES WITH ON-PREMOBJECT STORES USE CASE 02: ON PREM Slow transition to object storage • Performance for analytics & AI workloads can be very poor • No native support for popular frameworks • Expensive metadata operations further reduce performance t Spark Alluxio OR OR
  • 10.
    • Improved performanceover co-located HDFS with the flexibility of segregated storage • Support for multiple APIs • No changes to the end-user experience • Enable cheap storage at a fraction of the cost SOLUTION Speed-up analytics & AI on on-prem object stores USE CASE 02: ON PREM t Spark Alluxio SAME REGION OR OR
  • 11.
    DATA ORCHESTRATION SUMMIT Hybrid Cloud &Multi-Datacenter Burst compute to a public cloud and gradually migrate USE CASE 03: HYBRID Hive Alluxio PUBLIC CLOUD ON PREMISE Hybrid Cloud Gateway to utilize on-prem compute for data in the cloud USE CASE 04: HYBRID Alluxio Pytorch PUBLIC CLOUD ON PREMISE Cross Datacenter Access without changing Ingest Pipeline across regions USE CASE 05: MULTI-DATACENTER Presto Alluxio DATACENTER 1 DATACENTER 2 INGESTION
  • 12.
    ALLUXIO 12 CHALLENGES WITHHYBRID CLOUD BURSTING USE CASE 03: HYBRID Migrating Analytics or AI to the Cloud is Hard • Repeated data access across the corporate network to a public cloud is not feasible • Copying data to cloud storage is time consuming and complex • Using a cloud storage system like S3 means expensive application changes and low performance t Hive Alluxio
  • 13.
    t Hive Alluxio SAME REGION ALLUXIO 13 •Performance as if data is on the cloud compute cluster • 100% of I/O is offloaded from on-premises • No changes to end-user experience and security model • Common data fabric with only a logical data copies • Utilization of elastic cloud compute for up to 4x costs savings SOLUTION Burst Compute to a Public Cloud and Gradually Migrate USE CASE 03: HYBRID
  • 14.
    DATA ORCHESTRATION SUMMIT Alluxio @ Walmart •Zero-Copy ○ No new copies of data in the cloud • High Performance ○ Data caching accelerates queries • Lower Costs ○ One source of truth for data avoids additional storage
  • 15.
    ALLUXIO 15 CHALLENGES WITHHYBRID CLOUD STORAGE USE CASE 04: HYBRID Accessing Cloud Storage from a Private Datacenter • No unified view for cloud and on-prem storage • Prohibitively high network egress costs • Inability to utilize compute on-premises for data generated in the cloud • Inadequate performance for analytics and AI PyTorch ON PREMISE PUBLIC CLOUD
  • 16.
    ALLUXIO 16 • Performanceas if data is on the on-prem compute cluster • Intelligent distributed caching for reads & writes • Network cost savings of up to 80% by eliminating replication • No changes to the end-user experience with flexible APIs and security model on cloud storage SOLUTION Hybrid Cloud Storage Gateway for data in the cloud USE CASE 04: HYBRID Alluxio PyTorch ON PREMISE PUBLIC CLOUD
  • 17.
    ALLUXIO 17 CHALLENGES WITHSUPPORTING SATELLITE CLUSTERS ACROSS DATA CENTERS USE CASE 05: MULTI DATACENTER Utilization of compute resources across datacenters • Orchestrating data to compute clusters in another data center is manual and time consuming • Storing and managing multiple copies of the data is expensive with unnecessary network traffic for replication • Running replication frameworks on an overloaded storage cluster dramatically impacts performance of existing workloads Presto Alluxio DATACENTER 1 a DATACENTER 2 Hive
  • 18.
    ALLUXIO 18 • Noredundant data copies across datacenters • Elimination of complex data synchronization • 3-6x performance compared to remote data access across regions • Self-service data infrastructure across business units SOLUTION Cross Datacenter Access without changing Ingest Pipeline USE CASE 05: MULTI DATACENTER Presto Alluxio DATACENTER 1 a DATACENTER 2 Hive
  • 19.
    DATA ORCHESTRATION SUMMIT Alluxio @ Adobe PrimaryDC with large Hadoop Cluster out of space, ad hoc SQL workloads exponentially growing as analyst headcount as reached 1800 ppl PROBLEM ● 80% less network usage ● More stable infrastructure ● Lower costs ● Results come in faster ● Easier to scale ● Ability handle new analysts with no impact and increase response times ● Self-service for end-users Leverage compute resources outside of primary on-prem DC for multiple analytical frameworks. SOLUTION REMOTE DATA RESULTS
  • 20.
    DATA ORCHESTRATION SUMMIT Alluxio & DataAnalytics • Data Analytics runs on Data Lakes • Data Lakes are designed for data storage, not access • Alluxio is the Data Orchestration layer which bridges the compute and data layers ○ If the Data Lake is remote ○ If the Data Lake is overloaded ○ If the Data Lake has variable latency ○ If the Data Lake has low performance ○ If the Data Lake doesn’t support the same semantics ○ ...
  • 21.
  • 22.
    DATA ORCHESTRATION SUMMIT Alluxio & AIw/ K8s • Machine Learning & AI runs on Data Lakes • Compared to Data Analytics, AI workloads have different characteristics, but a similar mismatch between compute and storage
  • 23.
    DATA ORCHESTRATION SUMMIT Alluxio & AI- Better Together • Access Pattern - Repeated access on a dataset • Dataset - Many small files • Preferred API - Posix Filesystem • Workload Regularity - Predictable, bulk access
  • 24.
    DATA ORCHESTRATION SUMMIT Powered by theCommunity • Future directions and growing workloads for Alluxio are greatly influenced by our community! Thank you!
  • 25.
  • 26.
    DATA ORCHESTRATION SUMMIT Alluxio Open SourceProject Stats Latest stable release: 2.4.1 Total number of contributors: 1092 +1013 more commits since v2.1.0 (Nov 2019, 1st Summit) 5100+ Slack users (alluxio.io/slack)
  • 27.
    DATA ORCHESTRATION SUMMIT Fast Growing UserSlack Channel alluxio.io/slack
  • 28.
    DATA ORCHESTRATION SUMMIT Production Deployments atScale ● Top-tier cell phone provider ○ 3000+ Alluxio servers in a single cluster ● Top-tier social network company ○ 10,000+ concurrent Alluxio clients ○ 10+PB data managed
  • 29.
    DATA ORCHESTRATION SUMMIT Special Interest Groupsin Ecosystem ● SIG in Machine Learning/K8s on Alluxio ■ Regular Community R&D meetings ■ Re-implemented JNI-based FUSE integration ■ Performance optimizations for small files, RPCs ● A new SIG kicked off in Presto on Alluxio
  • 30.
    DATA ORCHESTRATION SUMMIT Experimental Two-week ReleaseCycle ● Previous release cadence: quarterly ● New experimental release schedule: ○ every two weeks ○ starting early December! ● What does it bring to Alluxio community? ○ deliver feature/bug fixes faster
  • 31.
    DATA ORCHESTRATION SUMMIT Welcome to JoinAlluxio Community! alluxio.io/slack Alluxio-Global-Online-Meetup/