Alluxio –Virtual Unified File System
Li Haoyuan – Founder and CEO at Alluxio
haoyuan@alluxio.com
The Global Datasphere will grow from
33 ZB in 2018 to 175 ZB by 2025
China’s Datasphere is expected to grow 30% on average
over the next 7 years &
will be the largest Datasphere of all regions by 2025
Source: IDC White Paper – #US44413318
We are in the era where
Data is your biggest asset
Extracting maximum value from your data
The Data Ecosystem Evolution
Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE
Data Ecosystem 1.0 – The Challenges
STORAGE
COMPUTE
Complex
Low performance
Expensive
3 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid –Multi
cloud
environments
Self-service data
across the
enterprise
The Data Architecture for the Digital Future
Core requirements of 2.0 data ecosystem
Unified Memory-first Native APIs Multi-hybrid cloud
AVirtual Unified File System
Virtual Unified File System
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Unified
Namespace
Bring all files into a
single interface
Interact with data
using any API
Accelerate & tier
data transparently
API
Translation
Intelligent
Multi-tiering
Key Innovations of theVirtual Unified File System
Unified Namespace: Global Data Accessibility
FUSE Interface makes all enterprise data available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
Server-side API Translation: From legacy to modern
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Intelligent Multi-tiering: Get high-value data faster
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Real world Use cases
Virtual
Data Lake
§ Accelerate batch, micro-
batch & streaming jobs
§ Slowly transition to
lower cost object stores
§ Run in hybrid cloud
environment with
compute in the cloud
§ Accelerate ML jobs
running on object stores
or file systems
§ Provide consistent
performance to data
scientists
§ Provide unified interface
to access all data
§ Accelerate & tier data
transparently across
storage tiers
§ Co-locate remote data
with compute for
performance
Machine Learning
Productivity
Self-service data
across hybrid cloud
Popular Technical Use Cases
100+ Known Production Deployments
Massive clusters deployed, many with 500+ nodes
Financial Services Case Study
Machine Learning Use Case
Challenge –
Gain end to end view of business
with large volume of data
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
SPARK
TERADATA
SPARK
TERADATA
Retail Case Study
Customer Analytics Use Case
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
SPARK
HDFS
SPARK
HDFS
Telecom Case Study
Customer 360 Insights
Challenge –
Desired a central view of consumer
information in near real time for
proactive support.
Many HDFS, different distributions,
many incompatible versions. On-
prem & cloud. Integration through
heavy ETL.
Solution –
Alluxio integrates data into central
catalog for fast access to consumer
interaction records.
Impact –
Reduced integration time
Faster data speed & freshness
HADOOP ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS
Machine Learning / Deep Learning –
Maximizes GPU investment:
• Self-serve data access for data
scientists
• Rapid integration of new data
sources
• Improved memory management &
performance
Incredible Open Source Momentum with growing community
920+ contributors &
growing
3760+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Download Alluxio today @ www.alluxio.org
ThankYou
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | Twitter: @alluxio

Alluxio - Virtual Unified File System

  • 1.
    Alluxio –Virtual UnifiedFile System Li Haoyuan – Founder and CEO at Alluxio haoyuan@alluxio.com
  • 2.
    The Global Dataspherewill grow from 33 ZB in 2018 to 175 ZB by 2025 China’s Datasphere is expected to grow 30% on average over the next 7 years & will be the largest Datasphere of all regions by 2025 Source: IDC White Paper – #US44413318
  • 3.
    We are inthe era where Data is your biggest asset
  • 4.
    Extracting maximum valuefrom your data The Data Ecosystem Evolution
  • 5.
    Data Ecosystem -Beta Data Ecosystem 1.0 COMPUTE STORAGE STORAGE COMPUTE
  • 6.
    Data Ecosystem 1.0– The Challenges STORAGE COMPUTE Complex Low performance Expensive
  • 7.
    3 big trendsdriving the need for a new architecture Separation of Compute & Storage Hybrid –Multi cloud environments Self-service data across the enterprise
  • 8.
    The Data Architecturefor the Digital Future
  • 9.
    Core requirements of2.0 data ecosystem Unified Memory-first Native APIs Multi-hybrid cloud
  • 10.
  • 11.
    Virtual Unified FileSystem Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  • 12.
    Unified Namespace Bring all filesinto a single interface Interact with data using any API Accelerate & tier data transparently API Translation Intelligent Multi-tiering Key Innovations of theVirtual Unified File System
  • 13.
    Unified Namespace: GlobalData Accessibility FUSE Interface makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  • 14.
    Server-side API Translation:From legacy to modern Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 15.
    Intelligent Multi-tiering: Gethigh-value data faster Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 16.
  • 17.
    Virtual Data Lake § Acceleratebatch, micro- batch & streaming jobs § Slowly transition to lower cost object stores § Run in hybrid cloud environment with compute in the cloud § Accelerate ML jobs running on object stores or file systems § Provide consistent performance to data scientists § Provide unified interface to access all data § Accelerate & tier data transparently across storage tiers § Co-locate remote data with compute for performance Machine Learning Productivity Self-service data across hybrid cloud Popular Technical Use Cases
  • 18.
    100+ Known ProductionDeployments Massive clusters deployed, many with 500+ nodes
  • 19.
    Financial Services CaseStudy Machine Learning Use Case Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” SPARK TERADATA SPARK TERADATA
  • 20.
    Retail Case Study CustomerAnalytics Use Case Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency SPARK HDFS SPARK HDFS
  • 21.
    Telecom Case Study Customer360 Insights Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On- prem & cloud. Integration through heavy ETL. Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness HADOOP ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS
  • 22.
    Machine Learning /Deep Learning – Maximizes GPU investment: • Self-serve data access for data scientists • Rapid integration of new data sources • Improved memory management & performance
  • 23.
    Incredible Open SourceMomentum with growing community 920+ contributors & growing 3760+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads Download Alluxio today @ www.alluxio.org
  • 24.
    ThankYou Join the AlluxioCommunity www.alluxio.org | www.alluxio.com | Twitter: @alluxio