Accelerating Queries on Cloud Data Lakes
Alex Ma - Director, Solutions Engineering
May 2020
Why Hybrid Cloud?
▪ Time to production
▪ When you need compute capacity, expand cloud footprint with
significantly lower lag, compared with provisioning on-prem
▪ Leverage cloud flexibility for bursty workloads
▪ Reduce overload on existing infrastructure by moving ephemeral or
workloads with unpredictable resource utilization on Hadoop
▪ Intermediate step before migrating to the cloud
▪ Lower risk of a full cloud data migration and start with compute in the
cloud and data on prem. Full migration can take years.
Hybrid Cloud Drivers
Approaches to Hybrid Cloud
● Simple tools available like distCP
● Works for workloads with easily
identifiable datasets
Issues
● Datasets for many workloads
cannot always be identified easily
● Significantly more data may be
transferred than the workload
requires
● Additional copies are very hard
to sync back with master data
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
● Migration may seem easier as no
application re-architecture needed
Issues
● If workloads are not made
cloud-native and elastic,
infrastructure cost can skyrocket
● If on-prem data copy needs to be
maintained, syncing cloud and
on-prem data can be hard
● Performance can be dramatically
impacted due to cloud storage
limitations
● Data pulled into cloud based on
compute requests
● Data is cached locally to reduce
I/O on remote clusters and is
automatically synced
Issues
● Less helpful for workloads that
don’t read data set more than once
Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
Solution: “Zero-copy” bursting to scale to the cloud
Production Examples
Alluxio at DBS
Mount HDFS from other
platforms into common Alluxio
cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot data to
speed up Presto and Spark
jobs
Extend Alluxio cluster into AWS
VPC
Run EMR for model training
and bring the results back to
on-prem
High Level Architecture
Walmart Use case
Why Walmart chose Alluxio’s
“Zero-Copy” burst solution:
• No requirement to
persist data into the cloud
• Improved query
performance and no
network hops on recurrent
queries 
• Lower costs without the
need for creating data copies
● No Need to Re-configure 2 Data Centers ● No Large Scale Investment ● Ability handle new analysts with no impact
and increase response times
Alluxio Overview
Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
New Technologies: Persistent Memory
Persistent Memory:
• PMEM represents a new class of memory and storage technology
architected specifically for data center usage
• Combination of high-capacity, affordability and persistence. 
RDMA: Remote Direct Memory Access
• Accessing (i.e. reading from or writing to) memory on
a remote machine without interrupting the processing
of the CPU(s) on that system.
• Zero-copy - applications perform data transfer
without the network software stack involvement,
data is being send received directly to the
buffers without being copied between the
network layers.
• Kernel bypass - applications perform data
transfer directly from userspace, no context
switches.
• No CPU involvement - applications can access
remote memory without consuming any CPU in
the remote machine.
Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture
13
Persistent Memory Operations Mode
IM
C
Cascade Lake
IM
C
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
Flexible, Usage Specific Partitions
Non-Volatile Memory Pool
DDR4 DRAM*
DCPMM*
AppDirect
Storage
Memory
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
DRAM, or
DRAM as
cache
* DIMM population shown as an example
only.
1 MEMORY mode
Storage over APP DIRECT
● Large memory at lower cost
● Low latency persistent
memory
● Fast direct-attach storage
● Persistent data for rapid recovery2
APP DIRECT mode
14
15
Alluxio DCPMM Tier architecture
Alluxio PMEM tier
• A new PMEM tier layer introduced to
provide higher performance with lower cost
• Large Capacity -> Cache more data
• Higher performance compared with NVMe SSD
• Leverage PMDK lib to bypass filesystem
overhead and context switches
• Deliver dedicated SLA to mission critical
applications
DRAM
DCPMM
SSD
HDD Under Storage
Application
sAlluxio
Worker
Alluxio
Master
Alluxio
Client
Data Accessibility via popular APIs and API
Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Unified Namespace: Global Data
Accessibility
Transparent access to understorage makes all enterprise data
available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting with Transparent Naming
Policy Driven under File System Migration
hdfs://host:port/directory/
Reports Sales
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store
1
Under Store
2
APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
Application have great flexibility to read / write data with many
options
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Questions?
Next steps - Try it out!
•Getting Started - http://bit.ly/3396r9I
•Running Alluxio on Docker - http://bit.ly/2MLcDPw
•Running Alluxio on AWS EMR - http://bit.ly/2OI5HoO
•Running the Alluxio/Presto Sandbox - http://bit.ly/2OJ3FoC
•Spark and Alluxio in 5 minutes - http://bit.ly/2KC35Uu

Slides: Accelerating Queries on Cloud Data Lakes

  • 1.
    Accelerating Queries onCloud Data Lakes Alex Ma - Director, Solutions Engineering May 2020
  • 2.
    Why Hybrid Cloud? ▪Time to production ▪ When you need compute capacity, expand cloud footprint with significantly lower lag, compared with provisioning on-prem ▪ Leverage cloud flexibility for bursty workloads ▪ Reduce overload on existing infrastructure by moving ephemeral or workloads with unpredictable resource utilization on Hadoop ▪ Intermediate step before migrating to the cloud ▪ Lower risk of a full cloud data migration and start with compute in the cloud and data on prem. Full migration can take years. Hybrid Cloud Drivers
  • 3.
    Approaches to HybridCloud ● Simple tools available like distCP ● Works for workloads with easily identifiable datasets Issues ● Datasets for many workloads cannot always be identified easily ● Significantly more data may be transferred than the workload requires ● Additional copies are very hard to sync back with master data Lift and Shift Data copy by workload Compute-driven Data Caching ● Migration may seem easier as no application re-architecture needed Issues ● If workloads are not made cloud-native and elastic, infrastructure cost can skyrocket ● If on-prem data copy needs to be maintained, syncing cloud and on-prem data can be hard ● Performance can be dramatically impacted due to cloud storage limitations ● Data pulled into cloud based on compute requests ● Data is cached locally to reduce I/O on remote clusters and is automatically synced Issues ● Less helpful for workloads that don’t read data set more than once
  • 4.
    Problem: HDFS clusteris compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity• Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days Solution: “Zero-copy” bursting to scale to the cloud
  • 5.
  • 6.
    Alluxio at DBS MountHDFS from other platforms into common Alluxio cluster Unified Namespace Object store Analytics Hybrid cloud bursting Caching layer for hot data to speed up Presto and Spark jobs Extend Alluxio cluster into AWS VPC Run EMR for model training and bring the results back to on-prem
  • 7.
  • 8.
    Walmart Use case WhyWalmart chose Alluxio’s “Zero-Copy” burst solution: • No requirement to persist data into the cloud • Improved query performance and no network hops on recurrent queries  • Lower costs without the need for creating data copies
  • 9.
    ● No Needto Re-configure 2 Data Centers ● No Large Scale Investment ● Ability handle new analysts with no impact and increase response times
  • 10.
  • 11.
    Data Elasticity with aunified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  • 12.
    Data Locality withIntelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL
  • 13.
    New Technologies: PersistentMemory Persistent Memory: • PMEM represents a new class of memory and storage technology architected specifically for data center usage • Combination of high-capacity, affordability and persistence.  RDMA: Remote Direct Memory Access • Accessing (i.e. reading from or writing to) memory on a remote machine without interrupting the processing of the CPU(s) on that system. • Zero-copy - applications perform data transfer without the network software stack involvement, data is being send received directly to the buffers without being copied between the network layers. • Kernel bypass - applications perform data transfer directly from userspace, no context switches. • No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture 13
  • 14.
    Persistent Memory OperationsMode IM C Cascade Lake IM C • 128, 256, 512GB DIMM Capacity • 2666 MT/sec Speed • 3TB (not including DRAM) Capacity per CPU Flexible, Usage Specific Partitions Non-Volatile Memory Pool DDR4 DRAM* DCPMM* AppDirect Storage Memory • DDR4 electrical & physical • Close to DRAM latency • Cache line size access DRAM, or DRAM as cache * DIMM population shown as an example only. 1 MEMORY mode Storage over APP DIRECT ● Large memory at lower cost ● Low latency persistent memory ● Fast direct-attach storage ● Persistent data for rapid recovery2 APP DIRECT mode 14
  • 15.
    15 Alluxio DCPMM Tierarchitecture Alluxio PMEM tier • A new PMEM tier layer introduced to provide higher performance with lower cost • Large Capacity -> Cache more data • Higher performance compared with NVMe SSD • Leverage PMDK lib to bypass filesystem overhead and context switches • Deliver dedicated SLA to mission critical applications DRAM DCPMM SSD HDD Under Storage Application sAlluxio Worker Alluxio Master Alluxio Client
  • 16.
    Data Accessibility viapopular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 17.
    Unified Namespace: GlobalData Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  • 18.
    Data Elasticity viaUnified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming
  • 19.
    Policy Driven underFile System Migration hdfs://host:port/directory/ Reports Sales
  • 20.
    Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM /SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 21.
    APIs to Interactwith data in Alluxio Spark Presto POSIX Java Application have great flexibility to read / write data with many options > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  • 22.
  • 23.
    Next steps -Try it out! •Getting Started - http://bit.ly/3396r9I •Running Alluxio on Docker - http://bit.ly/2MLcDPw •Running Alluxio on AWS EMR - http://bit.ly/2OI5HoO •Running the Alluxio/Presto Sandbox - http://bit.ly/2OJ3FoC •Spark and Alluxio in 5 minutes - http://bit.ly/2KC35Uu