Alluxio Data Orchestration Platform for the Cloud

Data Orchestration Platform for the Cloud
Dipti Borkar | VP, Products | Alluxio

The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven apps
such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3-based data lakes or warehouses
Hot top 10 Big Data
2020
Impact 50
2019
Trend-setting product
2019
Trend-setting product
2019

Consumer Travel & TransportationTelco & Media
Alluxio: Data-Driven Innovation Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics Services

4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store

Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business

Alluxio Data Orchestration for the Cloud
Structured
Data Catalog
Intelligent
Caching
Data
Transformation
Data
Management
Global
Namespace

Data Orchestration for the Cloud
Cross-platform Security & Governance
Authentication
Kerberos, Delegation token, LDAP, AD
Authorization
FS security model, AWS IAM model, Ranger integration
Encryption
On the wire with TLS, at rest with client-side encryption
Audit Logging
Track accesses to all data

Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Cloud Data Orchestration
Solution: Consistent High Performance
• Performance increases range from 1.5X
to 10X
• Data orchestration of multiple S3 buckets
• AWS EMR & Google Dataproc
integrations
• Fewer copies of data means lower costs
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
§ SLAs are hard to achieve
§ S3 metadata operations are expensive
§ Copied data storage costs add up making
the solution expensive
§ S3 is eventually consistent making it hard
to predict query results

Takeaways
• Nearly 2x performance
reduction for small range
queries
• Much more concurrency
with Alluxio
• This means ½ the
compute costs or 2x
more capacity with the
same environment

Using Alluxio with AWS EMR
Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRFS EMRFS
Compute-driven
Continuous sync
Compute-driven
Continuous sync

Using Alluxio with Google Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
Google Cloud Store Google Cloud Store
Single command initialization action brings up Alluxio in dataproc
Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio

Compute
Storage
2–5 Mins
2–5 Mins
Elastic
P
Elastic
P
Enterprise Cloud Compute & Storage is Great…
but Data got left behind
2–4 Weeks
Request
Data
Request Review Find
Dataset
Code
Script/Job
Run
ETL jobs
Grant
Permissions
Not Elastic
!
Dataset

Goal: Enable data workloads in the cloud on existing
on-prem data
Restrictions
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting

Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud

AWS
Alluxio Cloud Data Orchestration
Datacenter
GCP Azure
Step 3:.
Multicloud On-Demand Data Platform
• Orchestrates compute access to on-
prem and cloud data
• High performance
• Scales elastically

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

RAM
SSD
Disk
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again

RAM
Framework
Read file /trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Variable latency
with throttling
Read file /trades/us again

RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier

RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday

Alluxio Structured Data Management Preview
26
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
Alluxio Transformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage

Alluxio Catalog Service
27
Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment

28
How to Use Alluxio Catalog CLI
alluxio table attachdb <udb type> <udb uri> <udb db name>
associate an Alluxio database with an UDB database
alluxio table detachdb <db name>
remove the association from “attach”
alluxio table ls [<db name> [<table name>]]
display information in the catalog
alluxio table sync <db name>
synchronize the Alluxio catalog with the UDB metadata

29
Alluxio Presto Connector
Tighter integration with Presto
New plugin based on the Presto Hive connector
Available in Alluxio 2.1.0 distribution
Future: Merge connector into Presto codebase

Transformation Service
30
Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquetcsv

31
How to Use Transformation CLI
alluxio table transform <db name> <table name>
initiate a transformation on a table
alluxio table transformStatus <transform id>
display the status for a transformation

32
Example
2 isolated AWS 10-node clusters
Presto + Hive Metastore + S3 Data
Presto + Alluxio + Hive Metastore + S3 Data
TPCDS dataset on S3
CSV format
~10,000 files

33
Example Results
Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio Transformations
With Caching
3s

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Use Cases Data Orchestration Enables
Hive
Alluxio
Burst big data workloads in
hybrid cloud environments
On premise
Same instance
/ container
Alluxio
On-premise
PrestoSpark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
In the cloud

§ S3 performance is variable and consistent
query SLAs are hard to achieve
§ S3 metadata operations are expensive making
workloads run longer
§ S3 egress costs add up making the
solution expensive
§ S3 is eventually consistent making it hard
to predict query results
Challenges with running Big Data workloads on S3 & Alluxio Solution
Compute caching for S3
Spark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container

§ Accessing data over WAN too slow
§ Copying data to compute cloud time
consuming and complex
§ Using another storage system like S3
means expensive application changes
§ Using S3 via HDFS connector leads
to extremely low performance
Challenges with Hybrid Cloud & Alluxio Solution
HDFS for Hybrid Cloud
Hive
Alluxio
Burst big data workloads in
hybrid cloud environments
On premise
Same instance
/ container
In the cloud
3
Solution Benefits
§ Same performance as local
§ Same end-user experience
§ 100% of I/O is offloaded

Challenges with supporting more frameworks & Alluxio Solution
§ Running new frameworks on existing an HDFS cluster
can dramatically affect performance of existing
workloads
§ In a disaggregate environment, copying data to multiple
compute clouds time consuming and error prone
§ Migrating applications for new storage systems is
complex & time consuming
§ Storing and managing multiple copies of the data
becomes expensive
Support more frameworks
Any object store or HDFS
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
or
Spark
Alluxio Alluxio
4

Challenges running Big Data on Object Stores & Alluxio Solution
§ Object stores performance for big
data workloads can be very poor
§ No native support for popular
frameworks
§ Expensive metadata operations
reduce performance even more
§ No support for hybrid environments
directly
Transition to Object store
Alluxio
On-premise
Presto
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
5
Solution Benefits
§ Same performance as HDFS
§ Uses HDFS APIs
§ Same end-user experience
§ Storage at fraction of the
cost of HDFS

Alluxio Data Orchestration Platform for the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Alluxio Data Orchestration Platform for the Cloud

Similar to Alluxio Data Orchestration Platform for the Cloud (20)

More from Shubham Tagra

More from Shubham Tagra (11)

Recently uploaded

Recently uploaded (20)

Alluxio Data Orchestration Platform for the Cloud