Unify Data at Memory Speed
Bin Fan | Founding Engineer&VP of Open Source| Alluxio
binfan@Alluxio.com
Agenda
Why storage-independent compute?
Alluxio Overview
Demo: Spark + Alluxio + S3
Real-world Use Cases
From mainframes to Big Data
Moving from tightly integrated to loosely integrated architectures
Application, processing, data
storage and hardware -
All-in-one tightly coupled
Client server architecture
drives application separation.
Processing and data storage
still tightly coupled
Data growth drives
distributed MPP architectures
but processing and data
storage still tightly coupled
Further data growth drives
distributed file system architecture.
Processing and data storage co-
located but loosely coupled
1970s 1980s 2000s 2010s
The Big Data Ecosystem Explodes
Moving from tightly integrated to loosely integrated architectures
STORAGE
COMPUTE
Why independently scale compute and storage for data-driven
applications?
Flexible compute scaling based
on application demands
Flexible storage scaling based
on data growth patterns
Compute is CPU bound Storage is I/O bound
Why independently scale compute and storage for data-driven
applications?
Flexibility of using hardware /
instances that match the workload
CPU / GPU + IO + Memory S3
Leverage cheaper and newer
storage like object stores for
big data / AI workloads
Orchestrate & automate
compute for greater
operational efficiency
Protect & control your data on
premises and leverage public
cloud for compute
The challenges of independent scaling for data-driven workloads
Data Locality
Data Accessibility
Data Abstraction
Data is no more local to compute and
workload processing time will increase
particularly in hybrid cloud deployments
Data is in multiple storage systems in multiple
locations. Highly complex when all compute
frameworks talk to all storage systems
Data can still only be accessed using the
specific storage system APIs
Virtual Unified File System
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Alluxio Key Innovations
Unified
Namespace
Bring all files into a
single interface
Interact with data
using any API
Accelerate & tier
data transparently
API
Translation
Intelligent
Multi-tiering
Key Innovations of theVirtual Unified File System
Data Locality via Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Data Accessibility via Server-side API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
Demo: Spark + Alluxio + S3
Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto
Alluxio
Master
Zookeeper /
RAFT
Standby
Master
Under Store 1
Under Store 2
WANAlluxio
Client
Application Object Store
Alluxio
Client
Application
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
Data Flow In Alluxio
Read Scenarios
• Data not in Alluxio (i.e. first
time, or no cache)
• Data on same node as client
• Data on different node from
client
19
Write Scenarios
• Write only to Alluxio
• Write only to Under Store
• Write synchronously to Alluxio and
Under Store
• Write to Alluxio and
asynchronously write to Under
Store
• Write to Alluxio and replicate to N
other workers
• Write to Alluxio and async write to
multiple Under stores
Application have great flexibility to read / write data with many options
Real world Use cases
• Huya: 1300s (100% Hadoop nodes)
• Sogou: 1000s (60% Hadoop nodes)
• Momo: 850 nodes
• JD: 600 nodes
• Tencent: 400 nodes
• Huawei Cloud: 150 nodes
• China Unicom: 100 nodes
• …
Truly independent scaling of the data stack
Alluxio Production Deployments
Virtual
Data Lake
§ Accelerate batch, micro-
batch & streaming jobs
§ Slowly transition to
lower cost object stores
§ Run in hybrid cloud
environment with
compute in the cloud
§ Accelerate ML jobs
running on object stores
or file systems
§ Provide consistent
performance to data
scientists
§ Provide unified interface
to access all data
§ Accelerate & tier data
transparently across
storage tiers
§ Co-locate remote data
with compute for
performance
Machine Learning
Productivity
Self-service data
across hybrid cloud
Popular Technical Use Cases
Elastic Model Training
SPARK
HDFS
SPARK
HDFS
Challenge –
Algorithmic trading in $46B data
driven Hedge Fund. Model
training in cloud for bursty
workloads
Data access was slow, costing
them $$ in compute cost and
lower modeler productivity
Solution –
With Alluxio, data access are 10-
30X faster
Impact –
Increased efficiency on training of
ML algorithm, lowered compute cost
and increased modeler productivity,
resulting in 14 day ROI of Alluxio
Public Cloud
Public Cloud
Leading Hedge Fund
Analytics Use Case – Top Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
SPARK
HDFS
SPARK
HDFS
Incredible Open Source Momentum with growing community
900+ contributors &
growing
3900+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
https://www.alluxio.org/slack
ThankYou
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | Twitter: @Alluxio
Read data in Alluxio, on same node as client
27
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Read data not in Alluxio
28
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
Write data only to Alluxio on same node as
client
29
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Write data to Alluxio and Under Store
synchronously
30
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
Metadata Path
Metadata Path: Familiar Semantics
• Alluxio also provides a local metadata to the compute
• Listing / renaming on object store can be expensive
• Alluxio speeds up these operations
• Alluxio loads and manages metadata in master
• Apps can continue assuming HDFS-like semantics
3232
Demo: Spark + Alluxio + S3
Deploying Alluxio with Spark
Spark
Alluxio
Storage
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Deploy Alluxio as standalone cluster
between Spark and Storage
Demo: Alluxio with Spark
Spark
AWS S3
Spark
Alluxio
AWS S3
With Alluxio: Spark collocated with
Alluxio on S3
Baseline: Spark on S3

Alluxio @ Uber Seattle Meetup

  • 1.
    Unify Data atMemory Speed Bin Fan | Founding Engineer&VP of Open Source| Alluxio binfan@Alluxio.com
  • 2.
    Agenda Why storage-independent compute? AlluxioOverview Demo: Spark + Alluxio + S3 Real-world Use Cases
  • 3.
    From mainframes toBig Data Moving from tightly integrated to loosely integrated architectures Application, processing, data storage and hardware - All-in-one tightly coupled Client server architecture drives application separation. Processing and data storage still tightly coupled Data growth drives distributed MPP architectures but processing and data storage still tightly coupled Further data growth drives distributed file system architecture. Processing and data storage co- located but loosely coupled 1970s 1980s 2000s 2010s
  • 4.
    The Big DataEcosystem Explodes Moving from tightly integrated to loosely integrated architectures STORAGE COMPUTE
  • 5.
    Why independently scalecompute and storage for data-driven applications? Flexible compute scaling based on application demands Flexible storage scaling based on data growth patterns Compute is CPU bound Storage is I/O bound
  • 6.
    Why independently scalecompute and storage for data-driven applications? Flexibility of using hardware / instances that match the workload CPU / GPU + IO + Memory S3 Leverage cheaper and newer storage like object stores for big data / AI workloads Orchestrate & automate compute for greater operational efficiency Protect & control your data on premises and leverage public cloud for compute
  • 7.
    The challenges ofindependent scaling for data-driven workloads Data Locality Data Accessibility Data Abstraction Data is no more local to compute and workload processing time will increase particularly in hybrid cloud deployments Data is in multiple storage systems in multiple locations. Highly complex when all compute frameworks talk to all storage systems Data can still only be accessed using the specific storage system APIs
  • 8.
    Virtual Unified FileSystem Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  • 9.
  • 10.
    Unified Namespace Bring all filesinto a single interface Interact with data using any API Accelerate & tier data transparently API Translation Intelligent Multi-tiering Key Innovations of theVirtual Unified File System
  • 11.
    Data Locality viaIntelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 12.
    Data Accessibility viaServer-side API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 13.
    Data Abstraction viaUnified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 14.
    Unified Namespace: GlobalData Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  • 15.
    Demo: Spark +Alluxio + S3
  • 16.
    Deployment Approaches Spark Alluxio Storage Co-locate AlluxioWorkers with Spark for optimal I/O performance Any Cloud Same instance / container Spark Alluxio Storage Deploy Alluxio as standalone cluster between Spark and Storage Any Cloud Same data center / region Presto
  • 17.
    Alluxio Master Zookeeper / RAFT Standby Master Under Store1 Under Store 2 WANAlluxio Client Application Object Store Alluxio Client Application Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture
  • 18.
    Data Flow InAlluxio Read Scenarios • Data not in Alluxio (i.e. first time, or no cache) • Data on same node as client • Data on different node from client 19 Write Scenarios • Write only to Alluxio • Write only to Under Store • Write synchronously to Alluxio and Under Store • Write to Alluxio and asynchronously write to Under Store • Write to Alluxio and replicate to N other workers • Write to Alluxio and async write to multiple Under stores Application have great flexibility to read / write data with many options
  • 19.
  • 20.
    • Huya: 1300s(100% Hadoop nodes) • Sogou: 1000s (60% Hadoop nodes) • Momo: 850 nodes • JD: 600 nodes • Tencent: 400 nodes • Huawei Cloud: 150 nodes • China Unicom: 100 nodes • … Truly independent scaling of the data stack Alluxio Production Deployments
  • 21.
    Virtual Data Lake § Acceleratebatch, micro- batch & streaming jobs § Slowly transition to lower cost object stores § Run in hybrid cloud environment with compute in the cloud § Accelerate ML jobs running on object stores or file systems § Provide consistent performance to data scientists § Provide unified interface to access all data § Accelerate & tier data transparently across storage tiers § Co-locate remote data with compute for performance Machine Learning Productivity Self-service data across hybrid cloud Popular Technical Use Cases
  • 22.
    Elastic Model Training SPARK HDFS SPARK HDFS Challenge– Algorithmic trading in $46B data driven Hedge Fund. Model training in cloud for bursty workloads Data access was slow, costing them $$ in compute cost and lower modeler productivity Solution – With Alluxio, data access are 10- 30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio Public Cloud Public Cloud Leading Hedge Fund
  • 23.
    Analytics Use Case– Top Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh SPARK HDFS SPARK HDFS
  • 24.
    Incredible Open SourceMomentum with growing community 900+ contributors & growing 3900+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads https://www.alluxio.org/slack
  • 25.
    ThankYou Join the AlluxioCommunity www.alluxio.org | www.alluxio.com | Twitter: @Alluxio
  • 26.
    Read data inAlluxio, on same node as client 27 Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 27.
    Read data notin Alluxio 28 RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 28.
    Write data onlyto Alluxio on same node as client 29 Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 29.
    Write data toAlluxio and Under Store synchronously 30 RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 30.
  • 31.
    Metadata Path: FamiliarSemantics • Alluxio also provides a local metadata to the compute • Listing / renaming on object store can be expensive • Alluxio speeds up these operations • Alluxio loads and manages metadata in master • Apps can continue assuming HDFS-like semantics 3232
  • 32.
    Demo: Spark +Alluxio + S3
  • 33.
    Deploying Alluxio withSpark Spark Alluxio Storage Spark Alluxio Storage Co-locate Alluxio Workers with Spark for optimal I/O performance Deploy Alluxio as standalone cluster between Spark and Storage
  • 34.
    Demo: Alluxio withSpark Spark AWS S3 Spark Alluxio AWS S3 With Alluxio: Spark collocated with Alluxio on S3 Baseline: Spark on S3