Achieving Separation of Compute and Storage in a Cloud World

Achieving Separation of Compute and Storage
in a Cloud World
Dipti Borkar |Vice President, Product | Alluxio

Dipti Borkar,
VP of Product at Alluxio
Dipti has over 15 years experience in data and database
technologies across relational and non-relational data. Prior
to Alluxio, Dipti was VP of Product Marketing at Kinetica and
Couchbase. At Couchbase she held several leadership
positions there including Head of Global Technical Sales and
Head of Product Management.
Earlier in her career Dipti managed development teams at
IBM DB2 where she started her career as a database software
engineer.
Dipti holds a M.S. in Computer Science from the UC San Diego,
and an MBA from the Haas School of Business at UC Berkeley.
Today’s Speaker

Agenda
Why storage-independent compute?
AlluxioTechnology Overview
Real-world Use Cases

From mainframes to Big Data
Moving from tightly integrated to loosely integrated architectures
Application, processing, data
storage and hardware -
All-in-one tightly coupled
Client server architecture
drives application separation.
Processing and data storage
still tightly coupled
Data growth drives
distributed MPP architectures
but processing and data
storage still tightly coupled
Further data growth drives
distributed file system
architecture. Processing and
data storage co-located but
loosely coupled

The Big Data Ecosystem
Co-located compute and storage for big data workloads
§ More defined and loosely coupled
compute layer compared with relational
databases
§ But compute / data processing still runs
on the same node as where the data is
stored. MapReduce runs on HDFS across
the cluster
§ Compute layer and storage layer must be
scaled out by the same factor

CLOUD DATA
Mega trends driving the need for a new architecture

The Big Data Ecosystem Explodes
Moving from tightly integrated to loosely integrated architectures
STORAGE
COMPUTE

Why independently scale compute and storage for data-driven
applications?
Flexible compute scaling based
on application demands
Flexible storage scaling based
on data growth patterns
Compute is CPU bound Storage is I/O bound

Why independently scale compute and storage for data-driven
applications?
X
Reduced data duplication by
using same storage for
multiple compute frameworks S3
Leverage cheaper and newer
storage like object stores for
big data / AI workloads
Orchestrate & automate
compute for greater
operational efficiency
Protect & control your data on
premises and leverage public
cloud for compute

STORAGE
COMPUTE
An independently scaling Big Data Stack?

The challenges of independent scaling for data-driven workloads
Data Locality
Data Accessibility
Data Abstraction
Data is no more local to compute and
workload processing time will increase
particularly in hybrid cloud deployments
Data is in multiple storage systems in multiple
locations. Highly complex when all compute
frameworks talk to all storage systems
Data can still only be accessed using the
specific storage system APIs

STORAGE
COMPUTE
Truly independent scaling of the data stack
Data Locality Data AccessibilityData Abstraction
A new layer emerges between Compute & Storage

The Alluxio Story
Project started asTachyon, at the UC Berkley’s AMP Lab by
then Ph.D. student & now Alluxio CEO, Haoyuan (H.Y.) Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Unify Data at Memory Speed for data driven
applications such as Big Data Analytics, ML and AI.
2018
Top10 Hottest Data Storage Startup

Virtual Unified File System
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver

Unified
Namespace
Bring all files into a
single interface
Interact with data
using any API
Accelerate & tier
data transparently
API
Translation
Intelligent
Multi-tiering
Key Innovations of theVirtual Unified File System

Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming

Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2

Server-side API Translation: From legacy to modern
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Intelligent Multi-tiering: Get high-value data faster
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL

Alluxio Reference Architecture
Alluxio
Master
Zookeeper /
RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio
Client
RAM / SSD / HDD
RAM / SSD / HDD
Under Store 1
Under Store 2
Application
WAN
Alluxio
Client
Application

Data Flow In Alluxio
1. Applications Read/Write data via the Alluxio Client
2. Read Scenarios
• Data not in Alluxio (i.e. first time, or no cache)
• Data on same node as client
• Data on different node from client
3. Write Scenarios
• Write only to Alluxio
• Write only to Under Store
• Write synchronously to Alluxio and Under Store
• Write to Alluxio and asynchronously write to Under Store
25

Read data in Alluxio, on same node as client
26
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master

Read data not in Alluxio
27
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store

Write data only to Alluxio on same node as client
28
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master

Write data to Alluxio and Under Store synchronously
29
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store

Metadata Path: Familiar Semantics
• Alluxio also provides a local metadata to the compute
• Listing / renaming on object store can be expensive
• Alluxio speeds up these operations
• Alluxio loads and manages metadata in master
• Apps can continue assuming HDFS-like semantics
3131

Virtual
Data Lake
§ Accelerate batch, micro-
batch & streaming jobs
§ Slowly transition to
lower cost object stores
§ Run in hybrid cloud
environment with
compute in the cloud
§ Accelerate ML jobs
running on object stores
or file systems
§ Provide consistent
performance to data
scientists
§ Provide unified interface
to access all data
§ Accelerate & tier data
transparently across
storage tiers
§ Co-locate remote data
with compute for
performance
Machine Learning
Productivity
Self-service data
across hybrid cloud
Popular Technical Use Cases

China Unicom
Challenge
Desired a central view of business
data across multiple systems for big
data workloads
Solution
Alluxio integrates data across multiple storage system to be
accessed by Spark in a hybrid environment
Impact
Significantly faster workloads and faster innovation

Machine Learning Case Study
Challenge –
Gain end to end view of business
with large volume of data while
complying with regional data
regulations
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
Use Case: http://bit.ly/2oMx95W
SPARK
TERADATA
SPARK
TERADATA

Analytics Use Case – Top Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
SPARK
HDFS
SPARK
HDFS

Incredible Open Source Momentum with growing community
900+ contributors &
growing
3760+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads

ThankYou
Questions? Email me: dipti@alluxio.com
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | Twitter: @Alluxio | Slack

Achieving Separation of Compute and Storage in a Cloud World

More Related Content

What's hot

Similar to Achieving Separation of Compute and Storage in a Cloud World

More from Alluxio, Inc.

Recently uploaded

Achieving Separation of Compute and Storage in a Cloud World