The Architecture of Decoupling Compute
and Storage with Alluxio 
Haoyuan Li, Alluxio Inc.
December 2017 @ Strata Singapore
Confidential © Alluxio, Inc.All Rights Reserved. 2
About Me
•  Haoyuan (H.Y.) Li 
•  Founder and CEO at Alluxio, Inc.
•  Created Alluxio (as name Tachyon) at UC Berkeley AMPLab, as Ph.D.
candidate.
•  Google, Cornell University, Peking University
Confidential © Alluxio, Inc.All Rights Reserved. 3
Decoupling Compute From Storage
Benefits –
Different compute and storage hardware requirements
Scale compute and storage resources independently
Traditional filers/SANs and cost effective object stores (Amazon S3, Google GCS,
Microsoft Azure Blob Store) are inherently decoupled
Fast-evolving big data eco-system
Challenges –
Accessing data requires remote I/O
Confidential © Alluxio, Inc.All Rights Reserved. 4
Remote I/O
Spark
Amazon S3
Every data operation
requires data transfer,
sometimes over the
WAN
High latency, network
throughput
Confidential © Alluxio, Inc.All Rights Reserved. 5
Data Operations with Alluxio
Spark
Amazon S3
Alluxio
Low latency, memory
throughput
High latency, network
throughput
Keeping data in Alluxio
accelerates data
access
Confidential © Alluxio, Inc.All Rights Reserved. 6
Data Ecosystem Develops
• One Compute
Framework
• Single Storage System
• Co-locatedETL
ETL
ETL
Confidential © Alluxio, Inc.All Rights Reserved. 7
Data Ecosystem Explodes
•  Many Compute
Frameworks
•  Many Storage Systems
•  Most not co-located
Confidential © Alluxio, Inc.All Rights Reserved. 8
Data Ecosystem Issues
•  Each app manages
multiple data sources
•  Data source changes
require global updates
•  Storage optimizations
requires app change
•  Poor performance due
to lack of locality
Confidential © Alluxio, Inc.All Rights Reserved. 9
Data Ecosystem with Alluxio
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  Highest performance
in Memory
•  No Lock in
Native File System
Hadoop Compatible
File System
Amazon S3 Interface REST Web Service
HDFS Interface Amazon S3 Interface Swift Interface NFS Interface
Confidential © Alluxio, Inc.All Rights Reserved. 10
Use Case – Multi -Cluster Federated Query
Confidential © Alluxio, Inc.All Rights Reserved. 11
11
Confidential © Alluxio, Inc.All Rights Reserved. 12
History
Started at UC Berkeley AMPLab In Summer 2012
Originally named as Tachyon
Rebranded to Alluxio in early 2016
Open Sourced in 2013
Apache License 2.0
Latest Stable Release:Alluxio 1.6.1 (Nov 2017)
Confidential © Alluxio, Inc.All Rights Reserved. 13
Fastest Growing Big Data Open Source Project
Fastest Growing open-source
project in the big data
ecosystem
Running world’s largest
production clusters
600+ Contributors from
100+ organizations
0
100
200
300
400
500
600
700
800
0 10 20 30 40 45 50 55
NumberofContributors
Open Source Contributors by Month (Github)
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
Confidential © Alluxio, Inc.All Rights Reserved. 14
Non-persistent
data-storage
software.
What’s Alluxio
Memory-Speed
 Virtual
 Distributed
 Storage
Scale out
architecture.
Virtualized across different
storage types under a
unified namespace.
Memory-speed
access to data.
Confidential © Alluxio, Inc.All Rights Reserved. 15
Alluxio Innovation:
Unified Namespace
Enables effective data management across different Under Stores
Uses Mounting with Transparent Naming
Confidential © Alluxio, Inc.All Rights Reserved. 16
Alluxio Innovation:
Unified Namespace
Create a catalog of available data sources for Data Scientists
/finance/customer-transactions/	
/finance/vendor-transactions/	
/operations/device-logs/	
/operations/phone-call-recordings/	
/operations/check-images/	
/research/us-economic-data/	
/research/intl-economic-data/	
/marketing/advertising-dataset/	
/marketing/marketing-funnel-dataset/	
	
alluxio://
Confidential © Alluxio, Inc.All Rights Reserved. 17
Alluxio Innovation:
Server-side API Translation
Convert from Client-side Interface to native Storage Interface
HDFS Interface
HDFS Interface S3A Interface Swift Interface
Google Cloud
Interface
Confidential © Alluxio, Inc.All Rights Reserved. 18
Alluxio Innovation:
Intelligent Cache
Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read &Write Buffering
Transparent to App
Policies for pinning, promotion/
demotion,TTL
Confidential © Alluxio, Inc.All Rights Reserved. 19
Alluxio Benefits
Unification
New workflows across
any data in any
storage system
Orders of magnitude
improvement in run
time
Choice in compute
and storage – grow
each independently,
buy only what is
needed
Performance
 Flexibility
Confidential © Alluxio, Inc.All Rights Reserved. 20
100+ known production deployments
AND MORE!
Confidential © Alluxio, Inc.All Rights Reserved. 21
Big Data Case Study –
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
Use Case: http://bit.ly/2oMx95W
Confidential © Alluxio, Inc.All Rights Reserved. 22
Big Data Case Study – Top 3 Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
HDFS
SPARK
HDFS
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
Confidential © Alluxio, Inc.All Rights Reserved. 23
Consumer Intelligence Use Case – Top 3 Telco
Challenge –
Desired a central view of
consumer information in near real
time for proactive support.
Many HDFS, different distributions,
many incompatible versions. On-
prem & cloud. Integration through
heavy ETL.
HADOOP
Solution –
Alluxio integrates data into central
catalog for fast access to consumer
interaction records.
Impact –
Reduced integration time
Faster data speed & freshness
ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS
Confidential © Alluxio, Inc.All Rights Reserved. 24
Big Data Case Study –
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution –
With Alluxio, data queries are 30X
faster
Impact –
Higher operational efficiency
http://bit.ly/2pDHS3O
Confidential © Alluxio, Inc.All Rights Reserved. 25
Big Data Case Study –
Challenge –
Gain end to end view of
business with data distributed
across geographies
Data ETL was not possible due
to regulatory concerns
SPARK SPARK
Solution –
With Alluxio, data can be accessed
without storing locally
Impact –
Higher operational efficiency and
solved regulatory concerns
HDFS HDFS HDFS
HDFS HDFS HDFS
Confidential © Alluxio, Inc.All Rights Reserved. 26
Big Data Case Study –
Challenge –
Gain end to end view of
business with large volume of
data for $5B Travel Site
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
HDFS
Solution –
With Alluxio, 300x improvement in
performance
Impact –
Increased revenue from immediate
response to user behavior
Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
Confidential © Alluxio, Inc.All Rights Reserved. 27
Machine Learning Case Study –
Challenge –
Disparate Data both on Prem
and Cloud. Heterogeneous
types of data.
Scaling of Exabyte size data.
Slow due to disk based
approach.
SPARK
HDFS
SPARK
MINIO
Solution –
Using Alluxio to prevent I/O
bottlenecks
Impact –
Orders of magnitude higher
performance than before.
http://bit.ly/2p18ds3
MESOS
Confidential © Alluxio, Inc.All Rights Reserved. 28
Visualizing the Stack with Alluxio
FAST 
104 - 105 MB/s
Application
Remote Storage
MODERATE 
103 - 104 MB/s
SLOW
102 - 103 MB/s
Alluxio MEM
Often
Only When
Necessary

Alluxio SSD/HDD
Limited
Confidential © Alluxio, Inc.All Rights Reserved. 29
1.6.0/1 Release HIGHLIGHTS
Ecosystem Integrations
S3 client
Python client
Performance Improvement
Avoid unnecessary read when closing a file with partial caching on
Improve data distribution when using DeterministicHashPolicy
Usability Improvements
Audit logging
Third party UFS connector management
Dynamically adjusting log levels
Remote logging
And Many More!
http://www.alluxio.org/releases
Twi$er.com/alluxio	
Linkedin.com/alluxio	
	
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
Confidential © Alluxio, Inc.All Rights Reserved. 30
Contact: haoyuan@alluxio.com
Websites: www.alluxio.com and www.alluxio.org
Demo:
Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis
Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE

The Architecture of Decoupling Compute and Storage with Alluxio

  • 1.
    The Architecture ofDecoupling Compute and Storage with Alluxio Haoyuan Li, Alluxio Inc. December 2017 @ Strata Singapore
  • 2.
    Confidential © Alluxio,Inc.All Rights Reserved. 2 About Me •  Haoyuan (H.Y.) Li •  Founder and CEO at Alluxio, Inc. •  Created Alluxio (as name Tachyon) at UC Berkeley AMPLab, as Ph.D. candidate. •  Google, Cornell University, Peking University
  • 3.
    Confidential © Alluxio,Inc.All Rights Reserved. 3 Decoupling Compute From Storage Benefits – Different compute and storage hardware requirements Scale compute and storage resources independently Traditional filers/SANs and cost effective object stores (Amazon S3, Google GCS, Microsoft Azure Blob Store) are inherently decoupled Fast-evolving big data eco-system Challenges – Accessing data requires remote I/O
  • 4.
    Confidential © Alluxio,Inc.All Rights Reserved. 4 Remote I/O Spark Amazon S3 Every data operation requires data transfer, sometimes over the WAN High latency, network throughput
  • 5.
    Confidential © Alluxio,Inc.All Rights Reserved. 5 Data Operations with Alluxio Spark Amazon S3 Alluxio Low latency, memory throughput High latency, network throughput Keeping data in Alluxio accelerates data access
  • 6.
    Confidential © Alluxio,Inc.All Rights Reserved. 6 Data Ecosystem Develops • One Compute Framework • Single Storage System • Co-locatedETL ETL ETL
  • 7.
    Confidential © Alluxio,Inc.All Rights Reserved. 7 Data Ecosystem Explodes •  Many Compute Frameworks •  Many Storage Systems •  Most not co-located
  • 8.
    Confidential © Alluxio,Inc.All Rights Reserved. 8 Data Ecosystem Issues •  Each app manages multiple data sources •  Data source changes require global updates •  Storage optimizations requires app change •  Poor performance due to lack of locality
  • 9.
    Confidential © Alluxio,Inc.All Rights Reserved. 9 Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  Highest performance in Memory •  No Lock in Native File System Hadoop Compatible File System Amazon S3 Interface REST Web Service HDFS Interface Amazon S3 Interface Swift Interface NFS Interface
  • 10.
    Confidential © Alluxio,Inc.All Rights Reserved. 10 Use Case – Multi -Cluster Federated Query
  • 11.
    Confidential © Alluxio,Inc.All Rights Reserved. 11 11
  • 12.
    Confidential © Alluxio,Inc.All Rights Reserved. 12 History Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in early 2016 Open Sourced in 2013 Apache License 2.0 Latest Stable Release:Alluxio 1.6.1 (Nov 2017)
  • 13.
    Confidential © Alluxio,Inc.All Rights Reserved. 13 Fastest Growing Big Data Open Source Project Fastest Growing open-source project in the big data ecosystem Running world’s largest production clusters 600+ Contributors from 100+ organizations 0 100 200 300 400 500 600 700 800 0 10 20 30 40 45 50 55 NumberofContributors Open Source Contributors by Month (Github) Alluxio Spark Kafka Redis HDFS Cassandra Hive
  • 14.
    Confidential © Alluxio,Inc.All Rights Reserved. 14 Non-persistent data-storage software. What’s Alluxio Memory-Speed Virtual Distributed Storage Scale out architecture. Virtualized across different storage types under a unified namespace. Memory-speed access to data.
  • 15.
    Confidential © Alluxio,Inc.All Rights Reserved. 15 Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming
  • 16.
    Confidential © Alluxio,Inc.All Rights Reserved. 16 Alluxio Innovation: Unified Namespace Create a catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  • 17.
    Confidential © Alluxio,Inc.All Rights Reserved. 17 Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to native Storage Interface HDFS Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  • 18.
    Confidential © Alluxio,Inc.All Rights Reserved. 18 Alluxio Innovation: Intelligent Cache Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read &Write Buffering Transparent to App Policies for pinning, promotion/ demotion,TTL
  • 19.
    Confidential © Alluxio,Inc.All Rights Reserved. 19 Alluxio Benefits Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility
  • 20.
    Confidential © Alluxio,Inc.All Rights Reserved. 20 100+ known production deployments AND MORE!
  • 21.
    Confidential © Alluxio,Inc.All Rights Reserved. 21 Big Data Case Study – Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK TERADATA SPARK TERADATA Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W
  • 22.
    Confidential © Alluxio,Inc.All Rights Reserved. 22 Big Data Case Study – Top 3 Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS SPARK HDFS Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh
  • 23.
    Confidential © Alluxio,Inc.All Rights Reserved. 23 Consumer Intelligence Use Case – Top 3 Telco Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On- prem & cloud. Integration through heavy ETL. HADOOP Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS
  • 24.
    Confidential © Alluxio,Inc.All Rights Reserved. 24 Big Data Case Study – Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK Baidu File System SPARK Baidu File System Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
  • 25.
    Confidential © Alluxio,Inc.All Rights Reserved. 25 Big Data Case Study – Challenge – Gain end to end view of business with data distributed across geographies Data ETL was not possible due to regulatory concerns SPARK SPARK Solution – With Alluxio, data can be accessed without storing locally Impact – Higher operational efficiency and solved regulatory concerns HDFS HDFS HDFS HDFS HDFS HDFS
  • 26.
    Confidential © Alluxio,Inc.All Rights Reserved. 26 Big Data Case Study – Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq CEPH HDFS CEPH FLINK SPARK FLINK
  • 27.
    Confidential © Alluxio,Inc.All Rights Reserved. 27 Machine Learning Case Study – Challenge – Disparate Data both on Prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach. SPARK HDFS SPARK MINIO Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3 MESOS
  • 28.
    Confidential © Alluxio,Inc.All Rights Reserved. 28 Visualizing the Stack with Alluxio FAST 104 - 105 MB/s Application Remote Storage MODERATE 103 - 104 MB/s SLOW 102 - 103 MB/s Alluxio MEM Often Only When Necessary Alluxio SSD/HDD Limited
  • 29.
    Confidential © Alluxio,Inc.All Rights Reserved. 29 1.6.0/1 Release HIGHLIGHTS Ecosystem Integrations S3 client Python client Performance Improvement Avoid unnecessary read when closing a file with partial caching on Improve data distribution when using DeterministicHashPolicy Usability Improvements Audit logging Third party UFS connector management Dynamically adjusting log levels Remote logging And Many More! http://www.alluxio.org/releases
  • 30.
    Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ Confidential ©Alluxio, Inc.All Rights Reserved. 30 Contact: haoyuan@alluxio.com Websites: www.alluxio.com and www.alluxio.org Demo: Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE