Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Alluxio 2.0
Dipti Borkar | Product @ Alluxio

The Alluxio Story
Originated as Tachyon project, at the UC Berkley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
2018 20192018

4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store

Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE

Co-located
Big data journey and innovation options for enterprises
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Support Presto, Spark
and other computes
without app changes
Enable & accelerate
big data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
§ Typically compute-bound
clusters over 100% capacity
§ Compute & I/O need to be
scaled together even when
not needed
§ Compute & I/O can be
scaled independently but
I/O still needed on HDFS
which is expensive
1 2
3
4
5

Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Alluxio 2.0 - Powering multi-cloud analytics & AI
Breakthrough
data orchestration
for multi-cloud
Amazon AWS
integration
Compute optimized
data access for
cloud analytics
Architectural
foundations using
Open Source

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

Breakthrough
data orchestration
for multi-cloud
§ Policy-driven data management
§ Cross cloud storage efficient
data movement via job service
§ Improved administration of data
access policies

Data orchestration for Multi-cloud
Most
recent
Oldest
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Policy-driven data management
Continuous and
automated data
movement based
on date created
policy

Spark
Alluxio
Any public /
private cloud
PrestoHive
Data orchestration for Multi-cloud
Cross cloud storage efficient data movement
Efficient and simplified
data movement across
storage systems

§ POSIX support for AI and
Machine Learning
§ Compute-focused Cluster
Partitioning
§ Integration with external data
sources over REST
Compute optimized
data access for
cloud analytics

Data Orchestration for the Cloud
HDFS Driver Swift Driver S3 Driver NFS Driver
POSIX API for AI & Machine Learning

Spark
Any public /
private cloud
PrestoHive
Compute optimized data access for cloud analytics
Compute-focused Cluster Partitioning
Single Alluxio Cluster
partitioned to prevent
pollution of datasets
for each compute job

Compute optimized data access for cloud analytics
Integration with external data sources over REST
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
data.gov
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Tensorflow
Alluxio
data.gov

Amazon AWS
integration
§ AWS Elastic Map Reduce (EMR)
service integration
§ Availability of Alluxio Amazon
Machine Image

Architectural
foundations using
Open Source
§ RocksDB for tiering of metadata
of data managed by Alluxio to
scale to billions of files
§ GRPC for efficient
communication within the
cluster to scale to 1000s of
nodes

§ S3 performance is variable and consistent
query SLAs are hard to achieve
§ S3 metadata operations are expensive
making workloads run longer
§ S3 egress costs add up making the
solution expensive
§ S3 is eventually consistent making it hard
to predict query results
Challenges with running Big Data workloads on S3 & Alluxio Solution
Compute caching for S3 Accelerate big data frameworks
on the public cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark

Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
§ Accessing data over WAN too slow
§ Copying data to compute cloud time
consuming and complex
§ Using another storage system like S3
means expensive application changes
§ Using S3 via HDFS connector leads
to extremely low performance
Challenges with Hybrid Cloud & Alluxio Solution
HDFS for Hybrid Cloud
Hive
Alluxio
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Solution Benefits
§ Same performance as local
§ Same end-user experience
§ 100% of I/O is offloaded

Challenges with supporting more frameworks & Alluxio Solution
§ Running new frameworks on existing an
HDFS cluster can dramatically affect
performance of existing workloads
§ In a disaggregate environment, copying data
to multiple compute clouds time
consuming and error prone
§ Migrating applications for new storage
systems is complex & time consuming
§ Storing and managing multiple copies of
the data becomes expensive
Support more frameworks
Any object store or HDFS
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
or
Spark
Alluxio Alluxio

Alluxio
Presto
Alluxio
Presto
Challenges running Big Data on Object Stores & Alluxio Solution
§ Object stores performance for big
data workloads can be very poor
§ No native support for popular
frameworks
§ Expensive metadata operations
reduce performance even more
§ No support for hybrid environments
directly
Transition to Object store
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Solution Benefits
§ Same performance as HDFS
§ Uses HDFS APIs
§ Same end-user experience
§ Storage at fraction of the
cost of HDFS
Alluxio
Presto
Alluxio
Presto

DATA ORCHESTRATION
SPARK
HDFS
SPARK
HDFS
Public Cloud
Public Cloud
§ Compute scales elastically independent of storage
§ Faster time to insights with seamless data
orchestration
§ Accelerated workloads with memory-first data
approach
Two Sigma
Fastest growing big hedge fund managing $46 billion for investors
Use case | Cloud bursting on-premise data

Use case | Data orchestration for agility
DATA ORCHESTRATION
SPARK
HDFS
SPARK
Kubernetes
OBJECT HBASE
ETLSPARK
HDFS OBJECT HBASE
§ Single namespace to access & address all data
§ Data local to compute accelerates workloads
China Unicom
Leading ChineseTelco serving 320 million subscribers

Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack

Questions?
Join the Alluxio Community
www.alluxio.io| @alluxio

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL

Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming

Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Similar to Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads