SlideShare a Scribd company logo
1 of 33
Alluxio: Unify Data at Memory Speed
• Yupeng Fu
2
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
3
Data Ecosystem Develops
• One Compute Framework
• Single Storage System
• Co-located
ETL
ETL
ETL
4
Data Ecosystem Explodes
…
• Many Compute
Frameworks
• Many Storage Systems
• Most not co-located
…
5
Data Ecosystem Issues
• Each app manages
multiple data sources
• Data source changes
require global updates
• Storage optimizations
requires app change
• Poor performance due
to lack of locality
…
…
6
Data Ecosystem Challenges
2 Data Freshness
• Cross-network movement is slow
• Copies create lag
• Data quality suffers with copies
4 Security & Governance
• Data security & governance is
increasingly complex
1 Speed & Complexity
• Integration and interoperability issues
(on prem, hybrid, cloud)
• Many departments & groups
3 Cost
• Many-to-many integrations are
expensive
• Data duplication
6
Heavy integrations create painful organizational drag
7
Data Ecosystem with Alluxio
• Apps only talk to
Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance
in Memory
Java File API HDFS Interface
Amazon S3
Interface
REST Web Service
HDFS Interface
Amazon S3
Interface
Swift Interface NFS Interface
…
…
8
Alluxio Design Principles
2
Data Sharing
• Don’t own the data
• Multiple apps sharing common data
• Data stored in multiple, hybrid systems
4
Enterprise Class
• Distributed architecture
• Commodity hardware
• Service-oriented
• High availability
• Security
1
Big Data & Machine Learning
• Interoperability with leading projects
• Large scale data sets
• High IO
3
High Speed Data Access
• Remote data
• Hot/warm/cold data
• Temporary data
• Read/write support
8
9
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
10
Alluxio Innovation:
Server-side API Translation
Convert from Client-side Interface to Native Storage Interface
HDFS Interface / S3 Interface
HDFS Interface S3A Interface Swift Interface
Google Cloud
Interface
11
Alluxio Innovation:
Server-side API Translation
Convert between different versions of HDFS
HDFS 2.7 Interface
HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
12
Alluxio Innovation:
Unified Namespace
Enables effective data management across different Under Stores
Uses Mounting with Transparent Naming
13
Alluxio Innovation:
Unified Namespace
Create a catalog of available data sources for Data Scientists
/finance/customer-transactions/
/finance/vendor-transactions/
/operations/device-logs/
/operations/phone-call-recordings/
/operations/check-images/
/research/us-economic-data/
/research/intl-economic-data/
/marketing/advertising-dataset/
/marketing/marketing-funnel-dataset/
alluxio://
14
Alluxio Innovation:
Intelligent Cache
Local performance from remote data using native multi-tier storage
RAM
SSD
HDD
Hot Warm Cold
15
Where to use Alluxio
Finding high-fit Alluxio use-cases
Compute Zone
Standalone or managed with Mesos or Yarn
Storage in Different Availability Zone
Either on-prem or cloud
Alluxio is installed with or near compute to unify data
stores, stage remote data, and improve system
performance.
Spark Tensorflow Presto
HDFS
Guidelines
 Cloud deployment
 Compute separated from storage
 I/O or network latency exists
 Unification of many storage systems
 Applications sharing long lived data
More checks result in higher fit applications
16
100+ known production deployments
AND MORE!
17
Machine Learning Case Study
Challenge –
Slow training of model for
algorithmic trading in $46B data
driven Hedge Fund
Data access was slow, costing them
$$ in compute cost and lower
modeler productivity
SPARK
HDFS
SPARK
HDFS
Solution –
With Alluxio, data access are 10-30X
faster
Impact –
Increased efficiency on training of ML
algorithm, lowered compute cost and
increased modeler productivity,
resulting in 14 day ROI of Alluxio
MESOS
MES
OS
Public Internet
Public Internet
18
Consumer Intelligence Use Case – Top 3 Telco
Challenge –
Desired a central view of consumer
information in near real time for
proactive support.
Many HDFS, different distributions,
many incompatible versions. On-
prem & cloud. Integration through
heavy ETL.
HADOOP
Solution –
Alluxio integrates data into central
catalog for fast access to consumer
interaction records.
Impact –
Reduced integration time
Faster data speed & freshness
ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS
20
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
21
Alluxio Architecture
Alluxio
Master
Zookeeper
Standby
Master
Alluxio
Worker
Alluxio
Worker
Under Store
Under Store
Alluxio
Client
Application
RAM / SSD / HDD
RAM / SSD / HDD
Control Path
Data Path
Alluxio Master
- Master responsible for
managing metadata
- Secondary masters used for
journal checkpoints and fault
tolerance
- Performs distributed storage
metadata operations
222017 Alluxio, Inc. All Rights Reserved
File
System
Metadata
Block
Metadata
Worker
Metadata
RPC
Servic
e
Journal
Storage
Primary Master
File System
Metadata
Block
Metadata
Worker
Metadata
Secondary Master
Alluxio Worker
- Worker responsible for managing block data
- Each worker manages metadata for the
block data it stores
- Workers store block data on various local
storage mediums
- Performs distributed storage data operations
232017 Alluxio, Inc. All Rights Reserved
Block
Metadata
RPC
Servic
e
Data
Transfe
r
Service
RAM
Under
Storage
SSD
HDD
Data Flow In Alluxio
Applications Read/Write data via the Alluxio Client
Ideally, Alluxio deployed on same nodes as compute so
Alluxio Client and Alluxio Workers on same node
Different Read Scenarios
Read data in Alluxio, on same node as client
Read data in Alluxio, not on same node as client
Read data not in Alluxio
Different Write Scenarios
Write data only to Alluxio
Write data to Alluxio and Under Store synchronously
25
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
26
Read data not in Alluxio + Caching
26
RAM / SSD / HDD
Network / Disk Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
27
Read data in Alluxio, not on same node as
client + Caching
RAM / SSD / HDD
Network Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
28
Write data only to Alluxio on same
node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
29
Write data to Alluxio and Under Store
synchronously
RAM / SSD / HDD
Network / Disk Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
31
Outline
Why we built Alluxio
Alluxio’s innovations
Alluxio’s Architecture
What’s new in 1.7.0
1
2
3
4
• Demo• 5
32
New features in 1.7.0
Async caching
Kubernates integration
Tiered locality
Under store synchronization
FUSE improvement
33
Partial Caching
Alluxio Worker
Alluxio Client
Machine A
Application
Under
Storage
RAM
block
34
Async Partial Caching
Alluxio Worker
Alluxio Client
Machine A
Application
Under
Storage
RAM
block
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
41
• Thank you!
• Yupeng Fu yupeng@alluxio.com
• Github: yupeng9
• wechat: richbird9
•
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.org
E-mail
info@alluxio.com
@
Social Media

More Related Content

What's hot

SANsymphony-v r9.0.2
SANsymphony-v r9.0.2SANsymphony-v r9.0.2
SANsymphony-v r9.0.2mashenta
 
Block Level Storage Vs File Level Storage
Block Level Storage Vs File Level StorageBlock Level Storage Vs File Level Storage
Block Level Storage Vs File Level StoragePradeep Jagan
 
Exchange Recovery
Exchange RecoveryExchange Recovery
Exchange Recoveryedbmark
 
Introduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRIntroduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRPeter Elst
 
EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...
EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...
EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...Principled Technologies
 
Create non-cdb (traditional) oracle database 12c on windows
Create non-cdb (traditional) oracle database 12c on windowsCreate non-cdb (traditional) oracle database 12c on windows
Create non-cdb (traditional) oracle database 12c on windowsBiju Thomas
 
Ahsay Backup Solution for Business End Users
Ahsay Backup Solution for Business End UsersAhsay Backup Solution for Business End Users
Ahsay Backup Solution for Business End UsersAh Say
 
Windows server 2012 R2 private cloud virtualization and storage
Windows server 2012 R2 private cloud virtualization and storageWindows server 2012 R2 private cloud virtualization and storage
Windows server 2012 R2 private cloud virtualization and storageSathishkumar A
 
Improving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using AlluxioImproving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using AlluxioAlluxio, Inc.
 

What's hot (11)

SANsymphony-v r9.0.2
SANsymphony-v r9.0.2SANsymphony-v r9.0.2
SANsymphony-v r9.0.2
 
Block Level Storage Vs File Level Storage
Block Level Storage Vs File Level StorageBlock Level Storage Vs File Level Storage
Block Level Storage Vs File Level Storage
 
Exchange Recovery
Exchange RecoveryExchange Recovery
Exchange Recovery
 
Introduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRIntroduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIR
 
EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...
EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...
EMC XtremIO storage array 4.0 and VMware vSphere 6.0: Scaling mixed-database ...
 
Create non-cdb (traditional) oracle database 12c on windows
Create non-cdb (traditional) oracle database 12c on windowsCreate non-cdb (traditional) oracle database 12c on windows
Create non-cdb (traditional) oracle database 12c on windows
 
group project
group projectgroup project
group project
 
Ahsay Backup Solution for Business End Users
Ahsay Backup Solution for Business End UsersAhsay Backup Solution for Business End Users
Ahsay Backup Solution for Business End Users
 
Windows server 2012 R2 private cloud virtualization and storage
Windows server 2012 R2 private cloud virtualization and storageWindows server 2012 R2 private cloud virtualization and storage
Windows server 2012 R2 private cloud virtualization and storage
 
Improving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using AlluxioImproving Memory Utilization of Spark Jobs Using Alluxio
Improving Memory Utilization of Spark Jobs Using Alluxio
 
Solution Brief HPE StoreOnce backup with Veeam
Solution Brief HPE StoreOnce backup with VeeamSolution Brief HPE StoreOnce backup with Veeam
Solution Brief HPE StoreOnce backup with Veeam
 

Similar to Alluxio: Unify Data at Memory Speed

Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Alluxio, Inc.
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
 
Workload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterWorkload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterCloudian
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory SpeedAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
Storage As A Service (StAAS)
Storage As A Service (StAAS)Storage As A Service (StAAS)
Storage As A Service (StAAS)Shreyans Jain
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaOpenNebula Project
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.
 
Fluid Data Storage:Driving Flexibility in the Data Center
Fluid Data Storage:Driving Flexibility in the Data Center Fluid Data Storage:Driving Flexibility in the Data Center
Fluid Data Storage:Driving Flexibility in the Data Center Kingfin Enterprises Limited
 

Similar to Alluxio: Unify Data at Memory Speed (20)

Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data Analytics
 
Data EcoSystem 2.0
Data EcoSystem 2.0Data EcoSystem 2.0
Data EcoSystem 2.0
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Workload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterWorkload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation Datacenter
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Storage As A Service (StAAS)
Storage As A Service (StAAS)Storage As A Service (StAAS)
Storage As A Service (StAAS)
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
 
Fluid Data Storage:Driving Flexibility in the Data Center
Fluid Data Storage:Driving Flexibility in the Data Center Fluid Data Storage:Driving Flexibility in the Data Center
Fluid Data Storage:Driving Flexibility in the Data Center
 

More from Alluxio, Inc.

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 

Recently uploaded

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Eraconfluent
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...WSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 

Recently uploaded (20)

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in Uganda
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital Businesses
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 

Alluxio: Unify Data at Memory Speed

  • 1. Alluxio: Unify Data at Memory Speed • Yupeng Fu
  • 2. 2 Outline Why we built Alluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4
  • 3. 3 Data Ecosystem Develops • One Compute Framework • Single Storage System • Co-located ETL ETL ETL
  • 4. 4 Data Ecosystem Explodes … • Many Compute Frameworks • Many Storage Systems • Most not co-located …
  • 5. 5 Data Ecosystem Issues • Each app manages multiple data sources • Data source changes require global updates • Storage optimizations requires app change • Poor performance due to lack of locality … …
  • 6. 6 Data Ecosystem Challenges 2 Data Freshness • Cross-network movement is slow • Copies create lag • Data quality suffers with copies 4 Security & Governance • Data security & governance is increasingly complex 1 Speed & Complexity • Integration and interoperability issues (on prem, hybrid, cloud) • Many departments & groups 3 Cost • Many-to-many integrations are expensive • Data duplication 6 Heavy integrations create painful organizational drag
  • 7. 7 Data Ecosystem with Alluxio • Apps only talk to Alluxio • Simple Add/Remove • No App Changes • Highest performance in Memory Java File API HDFS Interface Amazon S3 Interface REST Web Service HDFS Interface Amazon S3 Interface Swift Interface NFS Interface … …
  • 8. 8 Alluxio Design Principles 2 Data Sharing • Don’t own the data • Multiple apps sharing common data • Data stored in multiple, hybrid systems 4 Enterprise Class • Distributed architecture • Commodity hardware • Service-oriented • High availability • Security 1 Big Data & Machine Learning • Interoperability with leading projects • Large scale data sets • High IO 3 High Speed Data Access • Remote data • Hot/warm/cold data • Temporary data • Read/write support 8
  • 9. 9 Outline Why we built Alluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4 • Demo• 5
  • 10. 10 Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to Native Storage Interface HDFS Interface / S3 Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  • 11. 11 Alluxio Innovation: Server-side API Translation Convert between different versions of HDFS HDFS 2.7 Interface HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
  • 12. 12 Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming
  • 13. 13 Alluxio Innovation: Unified Namespace Create a catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  • 14. 14 Alluxio Innovation: Intelligent Cache Local performance from remote data using native multi-tier storage RAM SSD HDD Hot Warm Cold
  • 15. 15 Where to use Alluxio Finding high-fit Alluxio use-cases Compute Zone Standalone or managed with Mesos or Yarn Storage in Different Availability Zone Either on-prem or cloud Alluxio is installed with or near compute to unify data stores, stage remote data, and improve system performance. Spark Tensorflow Presto HDFS Guidelines  Cloud deployment  Compute separated from storage  I/O or network latency exists  Unification of many storage systems  Applications sharing long lived data More checks result in higher fit applications
  • 16. 16 100+ known production deployments AND MORE!
  • 17. 17 Machine Learning Case Study Challenge – Slow training of model for algorithmic trading in $46B data driven Hedge Fund Data access was slow, costing them $$ in compute cost and lower modeler productivity SPARK HDFS SPARK HDFS Solution – With Alluxio, data access are 10-30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio MESOS MES OS Public Internet Public Internet
  • 18. 18 Consumer Intelligence Use Case – Top 3 Telco Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On- prem & cloud. Integration through heavy ETL. HADOOP Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS
  • 19. 20 Outline Why we built Alluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4 • Demo• 5
  • 20. 21 Alluxio Architecture Alluxio Master Zookeeper Standby Master Alluxio Worker Alluxio Worker Under Store Under Store Alluxio Client Application RAM / SSD / HDD RAM / SSD / HDD Control Path Data Path
  • 21. Alluxio Master - Master responsible for managing metadata - Secondary masters used for journal checkpoints and fault tolerance - Performs distributed storage metadata operations 222017 Alluxio, Inc. All Rights Reserved File System Metadata Block Metadata Worker Metadata RPC Servic e Journal Storage Primary Master File System Metadata Block Metadata Worker Metadata Secondary Master
  • 22. Alluxio Worker - Worker responsible for managing block data - Each worker manages metadata for the block data it stores - Workers store block data on various local storage mediums - Performs distributed storage data operations 232017 Alluxio, Inc. All Rights Reserved Block Metadata RPC Servic e Data Transfe r Service RAM Under Storage SSD HDD
  • 23. Data Flow In Alluxio Applications Read/Write data via the Alluxio Client Ideally, Alluxio deployed on same nodes as compute so Alluxio Client and Alluxio Workers on same node Different Read Scenarios Read data in Alluxio, on same node as client Read data in Alluxio, not on same node as client Read data not in Alluxio Different Write Scenarios Write data only to Alluxio Write data to Alluxio and Under Store synchronously
  • 24. 25 Read data in Alluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 25. 26 Read data not in Alluxio + Caching 26 RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 26. 27 Read data in Alluxio, not on same node as client + Caching RAM / SSD / HDD Network Speed Read of Data Application Alluxio Client Alluxio Master Alluxio Worker RAM / SSD / HDD Alluxio Worker
  • 27. 28 Write data only to Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 28. 29 Write data to Alluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 29. 31 Outline Why we built Alluxio Alluxio’s innovations Alluxio’s Architecture What’s new in 1.7.0 1 2 3 4 • Demo• 5
  • 30. 32 New features in 1.7.0 Async caching Kubernates integration Tiered locality Under store synchronization FUSE improvement
  • 31. 33 Partial Caching Alluxio Worker Alluxio Client Machine A Application Under Storage RAM block
  • 32. 34 Async Partial Caching Alluxio Worker Alluxio Client Machine A Application Under Storage RAM block
  • 33. Twitter.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media 41 • Thank you! • Yupeng Fu yupeng@alluxio.com • Github: yupeng9 • wechat: richbird9 • Twitter.com/alluxio Linkedin.com/alluxio Website www.alluxio.org E-mail info@alluxio.com @ Social Media

Editor's Notes

  1. A technology ecosystem developed around so called, big data; Hadoop, being one of the most important. The idea was to allow large clusters of cheap computers mine huge volumes of low value data, to extract valuable insights. To do that, Hadoop was designed to make compute and storage tightly-coupled.
  2. As enterprises matured and more big data systems developed, they adopted new technologies across more datacenters. Compute began to be separated from storage to take advantage of lower cost hosting options.
  3. Enterprises now need to manage a bird’s nest of integrations, and as a result development and operations are complicated. Performance also suffers as large volumes of data cross slow pipes that connect networks.
  4. As folks in the trenches know, Gartner says 60% of data projects will fail in 2017. That’s largely due to the complexity, cost, bottlenecks and governance surrounding these projects.
  5. Alluxio is a virtual distributed file system that unifies data access between storage and compute, and offers memory speed performance when working on remote data. How does it do that? Our software connects with dozens of the leading storage platforms and hosting providers like Amazon, Google and Microsoft Alluxio unifies all your storage systems into a single global namespace that Spark, Presto, MapReduce, and other frameworks can access Because most big data processes are run on only a subset of data, teams have the ability to place the most recent data into Alluxio memory And as data flows through the system, Alluxio’s intelligent cache can place other frequently data into memory And Alluxio is read/write, so entire processing pipelines can be managed through the system Benefits Application development and data science is simpler – abstracts the complexity of storage Architecture is more flexible – separate compute from storage, improve interoperability and choose lower-cost storage Improves data access speed across networks – commonly 2-10x
  6. Alluxio is deeply informed by our unique design principles: Work with large-scale applications Don’t own the data Don’t hurt performance, and try to address common performance issues like remote data access and write throughput Make sure it can operate within the largest enterprises These principles informed our 3 enabling innovations…
  7. While an infrastructure doesn’t need to meet all the following guidelines to be a high-fit application, meeting more of these guidelines will lead to a higher fit
  8. We have been very fortunate to have been rapidly adopted by some of the largest and most respected technology organizations in the world.
  9. We’re working with one early adopter of technology, called Two Sigma, which is leading tech-focused hedge fund with over $40 billion AUM. This team had a real problem in training their models across their 10,000 node Spark cluster. Their Spark nodes were in AWS and their source data was in their local storage. Because so much data was being transported across the relatively slow public network, they saw massive network bottlenecks. This meant they could only run their process twice a day. When they added Alluxio, they were able to increase that to 8-10x cycles per day, which meant big money for the hedge fund. As they’ve matured in the cloud, they’ve realized they’ve locked themselves into S3, and S3 prices are very unpredictable, they’ve wanted to take advantage of lower cost cloud providers like Google. With Alluxio, they have the ability to lift and shift their nodes, without impacting their application layer, taking advantage of more favorable pricing.
  10. One of the largest telcos in the US uses Alluxio as a virtual data lake – providing a single unified access layer across all their data systems. This enables them to integrate new systems more quickly and improves the data freshness and responsiveness of applications built on top of the data lake.
  11. We ran this experiment when Spark 2.0 came out with the latest version of Alluxio at the time. Since then we have kept track of improvements to Spark and Alluxio, but have not seen enough of a difference to rerun this performance evaluation. We compare several storage types in Spark with not storing the data in Spark but in Alluxio instead.
  12. Let me explain what this graph is presenting. We first tested with a cached RDD, meaning the data was already in Spark or Alluxio, and we varied the size of the RDD and recorded how much time it took to run a scan on the RDD. The disk only line in green is as expected, much worse than all the others which are using memory. A more interesting observation is the performance difference between the Alluxio text file and object file. Text file is almost strictly better, this is because of the object serialization overhead, the file itself is all plaintext so using an object file was not helpful. I would recommend to use text file when possible. Using spark’s memory only is the best for small files, but abruptly performs worse once the data cannot be completely cached in memory. This happens as well for mem_only_ser the purlple line, but much last because of the small size of serialized objects. Alluxio scales linearly throughout the test and actually outperforms Spark caches for a single task after 32 GB or so.
  13. The previous comparison was if the data was on SSD, which is still a relatively fast storage. If we instead put the data in S3, we see a much larger speed up. This is more representative of architectures because this allows compute and storage to be decoupled. In this case, we see 16x speed up even in this simple job, which is similar to some of the performance use cases I previously presented.
  14. We also did a similar test with a parquet file using the data frame API. Here we did a simple aggregation which would access all the rows. The behavior is the same as before, using Spark’s native caching has an abrupt turning point where the performance degrades. These tests are run with default spark configurations which they suggest not to change. If we optimize for additional storage, we can move the point of bend, but it would still be present.
  15. This is average of 7 runs. Range of s3 is 1132.765125 Range of Alluxio is 10.5890684 We also ran the same example against S3, and the variation in the test was fairly large. This is similar to what some users see in their storage either due to the storage itself, or their workload and sometimes both. Alluxio performance is much more consistent and provided on average 10x and up to 17x performance improvement.
  16. Easy to use with spark