Alluxio (formerly Tachyon)
Open Source Memory Speed
Virtual Distributed Storage System
Gene Pang @ Alluxio, Inc.
July 9, 2016 @ Big Data Day LA
About Me
•  Software Engineer @ Alluxio, Inc.
•  One of the core maintainers of Alluxio Open Source Project
•  Ph.D. @ AMPLab, UC Berkeley
•  Worked at Google before UC Berkeley
•  Twitter: @unityxx
2
About Alluxio, Inc.
•  Founded by creators and top committers of Alluxio open
source project (formerly named Tachyon)
•  Series A by Andreessen Horowitz
•  http://www.alluxio.com
•  We are hiring!
3
What I’ll be Covering
•  Brief overview of Alluxio
•  Motivation for Alluxio
•  Alluxio Use Cases
4
5
What Is Alluxio?
6
Alluxio
Open Source
Memory Speed
Virtual
Distributed Storage System
•  Open Source. One of the fastest growing project
communities
•  Memory Speed. Memory-centric architecture designed for
memory I/O
•  Virtual. Unified Namespace abstracts storage from
applications
•  Distributed. Designed to scale out with commodity
hardware
7
What Does That Mean?
8
Alluxio Ecosystem
•  Flexibility. Unified namespace enable new workloads
across storage systems
•  Agility. Quickly adapt to frameworks and storage systems of
your choice
•  Performance. Architecture supports fast, memory-speed
access to data
•  Cost. Grow storage and compute resources independently
9
Alluxio Benefits
Any application can access any data from
any storage at memory speed
10
Alluxio is Open Source
•  Started at UC Berkeley AMPLab, Summer 2012
–  The same lab that produced Apache Mesos and Apache
Spark
•  Open sourced as Tachyon, April 2013
–  Apache License 2.0
–  Renamed to Alluxio in February 2016
–  Latest Release: Version 1.1.1 (July 2016)

11
The Beginnings
•  Over 250 Contributors
•  3x growth over the last year
12
Contributor Growth
Alluxio Open Source Community
13
Over 3x increase
from 1 year ago!
Contributors and Users
14
15
Alluxio is Memory Speed
16
Why Use Memory for Storage?
•  RAM throughput increasing exponentially
•  Disk throughput increasing slowly
•  Memory-locality key to interactive response times
17
Why Memory? Performance Trend
•  DRAM is becoming inexpensive (source: jcmit.com)
18
Why Memory? Price Trend
19
What if memory capacity is
still not enough?
Alluxio Manages Tiered Storage
20
MEM
SSD
HDD
Faster
Higher Capacity
Configurable Storage Tiers
21
MEM only
MEM + HDD
SSD only
Pluggable Tier Management Policies
22
Evict stale data to
slower tier
Promote hot data
to faster tier
23
Alluxio is a 
Virtual Distributed Storage
System
24
The Big Data Ecosystem Today
25
This is Problematic
•  Costly Ecosystem Integrations
•  Costly ETL and Data Duplication
•  Data Silos
•  Long Cycle from Data to Value
26
What are the Problems?
27
Alluxio Unifies Access to Data
28
How to use Alluxio?
•  Accelerate access to remote storage
•  Share data across jobs/applications at memory speed
•  Transparently manage data across different storage systems
29
Alluxio Common Use Cases
30
Accelerating Access to
Remote Storage
31
Remote I/O to Data
Spark
Amazon S3
every data operation
requires data transfer,
sometimes over the
WAN
high latency, network
throughput
32
Local I/O with Alluxio
Spark
Amazon S3
Alluxio
low latency, memory
throughput
high latency, network
throughput
Keeping data in Alluxio
accelerates data access
33
Sharing Data at
Memory Speed
34
Sharing Data Slowly
Spark
Amazon S3
MapReduce
 Flink
Network I/O
Disk I/O
I/O slows
down sharing
35
Sharing Data Memory Speed with Alluxio
Spark
Amazon S3
MapReduce
 Flink
Alluxio
Share data via
memory
36
Managing Data Across
Different Storage Systems
37
Simple World
Application 1
HDFS
38
Adding a Storage System
Application 1
HDFS
 Amazon S3
39
Adding a Storage System
Application 1
Google GCS
 HDFS
 Amazon S3
40
Adding an Application
Application 1
Google GCS
 HDFS
 Amazon S3
Application 2
41
Adding an Application
Application 1
Google GCS
 HDFS
 Amazon S3
Application 2
Application 3
complex,
inflexible
42
With Alluxio
Application 1
HDFS
Alluxio
43
New Storage Systems and Applications
Application 1
Google GCS
 HDFS
 Amazon S3
Application 2
Application 3
Alluxio
Flexible,
simple
no application
changes,
new mount
point
44
Alluxio in the Wild!
45
Use Case
•  Framework: Spark SQL
•  Under Storage: Baidu’s File System
•  Storage Media: MEM + HDD
•  200+ nodes deployment
•  2PB+ managed space
46
at
47
Use Case
•  Framework: Spark
•  Storage Media: MEM
•  Improvement from Hours to Seconds
48
at
49
Use Case
•  Framework: Spark Streaming + Flink Streaming + Spark +
Flink
•  Under Storage: Multiple HDFS clusters
•  Storage Media: MEM + HDD
•  200+ nodes deployment
•  Alluxio enables previously impossible jobs to finish
•  300x Performance Improvement during peak load
50
at
51
•  Alluxio Project: www.alluxio.org
•  Alluxio, Inc: www.alluxio.com
•  Development: www.github.com/Alluxio/alluxio
•  Meet Friends: www.meetup.com/Alluxio
•  Email: gene@alluxio.com
52
To Get More Information

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon): An Open Source Memory Speed Virtual Distributed Storage - Gene Pang, Software Engineer, Alluxio