Accelerate Analytics and ML in the Hybrid Cloud Era

Accelerating Analytics and ML in the Hybrid Cloud Era
Peter Behrakis and Alex Ma - Alluxio

Agenda
• Market
• AlluxioVision
• What is Data Orchestration
• How can Alluxio help you?

Enterprises have organically created a legacy of data silos through short term focused projects,
mergers & acquisitions!
Data Lakes and Silos Abound
▪ Data lakes and critical data are often in a silo and challenging to access
▪ Consolidation of data lakes and silos are expensive and slow to complete
▪ Compute is everywhere
Teradata POSIX
file
Internal
apps
Public
Clouds
S3 Object HDFS 1
HDFS 2

4 BigTrends Driving the Need for a New Architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store

▪ Data volume, velocity and variety are avalanching - data doubles every two years*
▪ The business knows that data analytics/ML models allow them to compete
effectively*
▪ Object is becoming the new data lake
▪ The enterprise is a multi- site - cloud world and will remain so for some time
▪ Technical leadership wants the agility to run applications anywhere
▪ IT wants to offer a cloud like experience to their users
▪ Technical organizations struggle to keep up with data ingest and business demands
* “The Fourth Industrial Revolution”, by Klaus Schwab
Market Summary

Alluxio’sVision
"Orchestrate data for analytics and machine learning to enable
companies to grow and be agile regardless of where their data
and compute are located."
Quick start cloud adoption that optimizes cost that yields 2X –
6X analytics acceleration for –
● Fraud protection
● Research for treatments for diseases like COVID-19
● Uptime for all industrial and digital technologies we depend on

What is Data Orchestration?
A platform that brings your data closer to compute across
clusters, regions and clouds.

Alluxio
Companies use Alluxio to -
• Gain faster research, analytic and ML results that matter to the business by 2X
– 6X using Alluxio advanced caching technology for multi-site/hybrid cloud
• Enable agility with no programming to use different compute or storage – API
translations - Hadoop to cloud or on prem S3
• Dramatically lower OpEx by eliminating data management and egress costs –
Alluxio unified namespace,API translations and policy driven data movement
• Drop into existing on prem and clouds with zero programming

ALLUXIO 9
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE

Hybrid Data Lake with Alluxio
A Data Orchestration Approach
10

Alluxio Common Use Cases
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Hive/Spark/Presto,
TensorFlow
Alluxio
Alluxio
Spark
Alluxio
Alluxio
Spark
Alluxio
Spark
Spark

Problem: HDFS cluster is compute-
bound & complex to maintain
Google Cloud Platform
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity
• Offload on-prem cluster (both compute & I/O)
• Manage working set, not FULL set of data
• Local performance
• Automatic synchronization with on-prem changes
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
GCS
Our Solution: “Zero-Copy Burst”
12

Alluxio at Walmart
14
Architectural Components
● 2x Performance
For range queries
● High Concurrency
With Alluxio
● Cost Reduction
With Half the compute costs or 2x
compute capacity for the same
environment
● Auto-Scaling
To maintain a min number of Alluxio
workers

Alluxio at Adobe
Primary DC with large Hadoop Cluster out
of space, ad hoc SQL workloads
exponentially growing as analyst
headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
15
Cross Data Center Access

Alluxio at Electronic Arts (EA)
Single Cloud with AWS
Learn More
Upto 6x Performance
When handling a large
number of small files
Elastic Compute
To Reduce Infrastructure
Costs
Reduce S3 Costs
By eliminating S3 access
operations

Machine Learning - Alibaba
Learn More
97% of theoretical upper
limit of training
performance
30,000 images/second
with Alluxio. 13,478
images/second with SSD
41% costs savings

Core Features
Enable a Hybrid Data Lake
18

Unified
Namespace
Bring all files and
objects into a single
interface
Interact with data
using any API Accelerate & tier
data transparently
API
Translation
Intelligent
Caching
Multi-tiering
Alluxio - Key Innovations

Data Accessibility (via popular APIs and API Translation)
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST API
FUSE Interface
HDFS Driver Swift Driver
S3 Driver NFS Driver

Data Locality with Intelligent Multi-tiering
Local Performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
21

Uniﬁed Namespace
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
22

Policy Driven Data Migration
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
23

Reference Architecture
Alluxio
Master
Zookeeper
/ RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
Under Store 1
Under Store 2
24
Control Path
Data Path

Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
25
Alluxio Catalog Service

Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquet
csv
26
Transformation Service

Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio
Transformations With
Caching
3s
27
Example Results

How can Alluxio help you?
• Did you learn what Alluxio Data Orchestration is?
• Do you have a use case Alluxio can accelerate?
For follow up questions and to discuss your situation, please contact Peter at
peter@alluxio.com

Accelerate Analytics and ML in the Hybrid Cloud Era

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerate Analytics and ML in the Hybrid Cloud Era

Similar to Accelerate Analytics and ML in the Hybrid Cloud Era (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Accelerate Analytics and ML in the Hybrid Cloud Era