Using “zero-copy” hybrid bursting on remote data to solve data lake analytics capacity and performance problems.
Data scientists want answers on demand. But in today’s enterprise architectures, the reality is that most data remains on-prem, despite the promise of cloud-based analytics. Moving all that data to the cloud has typically not been possible for many reasons including cost, latency, and technical difficulty. So, what if there was a technology that would connect these on-prem environments to any major cloud platform, enabling high-powered computing without the need to move massive amounts of data?
Join us for this webinar where Alex Ma of Alluxio, an open-source data orchestration platform, will discuss how a data orchestration approach offers a solution for connecting traditional on-prem data centers and cloud data lakes with other clouds and data centers. With Alluxio’s “zero-copy” burst solution, companies can bridge remote data centers and data lakes with computing frameworks in other locations, enabling them to offload, compute, and leverage the flexibility, scalability, and power of the cloud for their remote data.
2. Why Hybrid Cloud?
▪ Time to production
▪ When you need compute capacity, expand cloud footprint with
significantly lower lag, compared with provisioning on-prem
▪ Leverage cloud flexibility for bursty workloads
▪ Reduce overload on existing infrastructure by moving ephemeral or
workloads with unpredictable resource utilization on Hadoop
▪ Intermediate step before migrating to the cloud
▪ Lower risk of a full cloud data migration and start with compute in the
cloud and data on prem. Full migration can take years.
Hybrid Cloud Drivers
3. Approaches to Hybrid Cloud
● Simple tools available like distCP
● Works for workloads with easily
identifiable datasets
Issues
● Datasets for many workloads
cannot always be identified easily
● Significantly more data may be
transferred than the workload
requires
● Additional copies are very hard
to sync back with master data
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
● Migration may seem easier as no
application re-architecture needed
Issues
● If workloads are not made
cloud-native and elastic,
infrastructure cost can skyrocket
● If on-prem data copy needs to be
maintained, syncing cloud and
on-prem data can be hard
● Performance can be dramatically
impacted due to cloud storage
limitations
● Data pulled into cloud based on
compute requests
● Data is cached locally to reduce
I/O on remote clusters and is
automatically synced
Issues
● Less helpful for workloads that
don’t read data set more than once
4. Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
Solution: “Zero-copy” bursting to scale to the cloud
6. Alluxio at DBS
Mount HDFS from other
platforms into common Alluxio
cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot data to
speed up Presto and Spark
jobs
Extend Alluxio cluster into AWS
VPC
Run EMR for model training
and bring the results back to
on-prem
8. Walmart Use case
Why Walmart chose Alluxio’s
“Zero-Copy” burst solution:
• No requirement to
persist data into the cloud
• Improved query
performance and no
network hops on recurrent
queries
• Lower costs without the
need for creating data copies
9. ● No Need to Re-configure 2 Data Centers ● No Large Scale Investment ● Ability handle new analysts with no impact
and increase response times
11. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
12. Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
13. New Technologies: Persistent Memory
Persistent Memory:
• PMEM represents a new class of memory and storage technology
architected specifically for data center usage
• Combination of high-capacity, affordability and persistence.
RDMA: Remote Direct Memory Access
• Accessing (i.e. reading from or writing to) memory on
a remote machine without interrupting the processing
of the CPU(s) on that system.
• Zero-copy - applications perform data transfer
without the network software stack involvement,
data is being send received directly to the
buffers without being copied between the
network layers.
• Kernel bypass - applications perform data
transfer directly from userspace, no context
switches.
• No CPU involvement - applications can access
remote memory without consuming any CPU in
the remote machine.
Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture
13
14. Persistent Memory Operations Mode
IM
C
Cascade Lake
IM
C
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
Flexible, Usage Specific Partitions
Non-Volatile Memory Pool
DDR4 DRAM*
DCPMM*
AppDirect
Storage
Memory
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
DRAM, or
DRAM as
cache
* DIMM population shown as an example
only.
1 MEMORY mode
Storage over APP DIRECT
● Large memory at lower cost
● Low latency persistent
memory
● Fast direct-attach storage
● Persistent data for rapid recovery2
APP DIRECT mode
14
15. 15
Alluxio DCPMM Tier architecture
Alluxio PMEM tier
• A new PMEM tier layer introduced to
provide higher performance with lower cost
• Large Capacity -> Cache more data
• Higher performance compared with NVMe SSD
• Leverage PMDK lib to bypass filesystem
overhead and context switches
• Deliver dedicated SLA to mission critical
applications
DRAM
DCPMM
SSD
HDD Under Storage
Application
sAlluxio
Worker
Alluxio
Master
Alluxio
Client
16. Data Accessibility via popular APIs and API
Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
17. Unified Namespace: Global Data
Accessibility
Transparent access to understorage makes all enterprise data
available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
18. Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting with Transparent Naming
19. Policy Driven under File System Migration
hdfs://host:port/directory/
Reports Sales
21. APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
Application have great flexibility to read / write data with many
options
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
23. Next steps - Try it out!
•Getting Started - http://bit.ly/3396r9I
•Running Alluxio on Docker - http://bit.ly/2MLcDPw
•Running Alluxio on AWS EMR - http://bit.ly/2OI5HoO
•Running the Alluxio/Presto Sandbox - http://bit.ly/2OJ3FoC
•Spark and Alluxio in 5 minutes - http://bit.ly/2KC35Uu