Slides: Accelerating Queries on Cloud Data Lakes

Accelerating Queries on Cloud Data Lakes
Alex Ma - Director, Solutions Engineering
May 2020

Why Hybrid Cloud?
▪ Time to production
▪ When you need compute capacity, expand cloud footprint with
significantly lower lag, compared with provisioning on-prem
▪ Leverage cloud flexibility for bursty workloads
▪ Reduce overload on existing infrastructure by moving ephemeral or
workloads with unpredictable resource utilization on Hadoop
▪ Intermediate step before migrating to the cloud
▪ Lower risk of a full cloud data migration and start with compute in the
cloud and data on prem. Full migration can take years.
Hybrid Cloud Drivers

Approaches to Hybrid Cloud
● Simple tools available like distCP
● Works for workloads with easily
identifiable datasets
Issues
● Datasets for many workloads
cannot always be identified easily
● Significantly more data may be
transferred than the workload
requires
● Additional copies are very hard
to sync back with master data
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
● Migration may seem easier as no
application re-architecture needed
Issues
● If workloads are not made
cloud-native and elastic,
infrastructure cost can skyrocket
● If on-prem data copy needs to be
maintained, syncing cloud and
on-prem data can be hard
● Performance can be dramatically
impacted due to cloud storage
limitations
● Data pulled into cloud based on
compute requests
● Data is cached locally to reduce
I/O on remote clusters and is
automatically synced
Issues
● Less helpful for workloads that
don’t read data set more than once

Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
Solution: “Zero-copy” bursting to scale to the cloud

Alluxio at DBS
Mount HDFS from other
platforms into common Alluxio
cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot data to
speed up Presto and Spark
jobs
Extend Alluxio cluster into AWS
VPC
Run EMR for model training
and bring the results back to
on-prem

Walmart Use case
Why Walmart chose Alluxio’s
“Zero-Copy” burst solution:
• No requirement to
persist data into the cloud
• Improved query
performance and no
network hops on recurrent
queries
• Lower costs without the
need for creating data copies

● No Need to Re-configure 2 Data Centers ● No Large Scale Investment ● Ability handle new analysts with no impact
and increase response times

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL

New Technologies: Persistent Memory
Persistent Memory:
• PMEM represents a new class of memory and storage technology
architected specifically for data center usage
• Combination of high-capacity, affordability and persistence.
RDMA: Remote Direct Memory Access
• Accessing (i.e. reading from or writing to) memory on
a remote machine without interrupting the processing
of the CPU(s) on that system.
• Zero-copy - applications perform data transfer
without the network software stack involvement,
data is being send received directly to the
buffers without being copied between the
network layers.
• Kernel bypass - applications perform data
transfer directly from userspace, no context
switches.
• No CPU involvement - applications can access
remote memory without consuming any CPU in
the remote machine.
Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture
13

Persistent Memory Operations Mode
IM
C
Cascade Lake
IM
C
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
Flexible, Usage Specific Partitions
Non-Volatile Memory Pool
DDR4 DRAM*
DCPMM*
AppDirect
Storage
Memory
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
DRAM, or
DRAM as
cache
* DIMM population shown as an example
only.
1 MEMORY mode
Storage over APP DIRECT
● Large memory at lower cost
● Low latency persistent
memory
● Fast direct-attach storage
● Persistent data for rapid recovery2
APP DIRECT mode
14

15
Alluxio DCPMM Tier architecture
Alluxio PMEM tier
• A new PMEM tier layer introduced to
provide higher performance with lower cost
• Large Capacity -> Cache more data
• Higher performance compared with NVMe SSD
• Leverage PMDK lib to bypass filesystem
overhead and context switches
• Deliver dedicated SLA to mission critical
applications
DRAM
DCPMM
SSD
HDD Under Storage
Application
sAlluxio
Worker
Alluxio
Master
Alluxio
Client

Data Accessibility via popular APIs and API
Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Unified Namespace: Global Data
Accessibility
Transparent access to understorage makes all enterprise data
available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2

Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting with Transparent Naming

Policy Driven under File System Migration
hdfs://host:port/directory/
Reports Sales

Alluxio
MasterZookeeper
/ RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store
1
Under Store
2

APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
Application have great flexibility to read / write data with many
options
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Next steps - Try it out!
•Getting Started - http://bit.ly/3396r9I
•Running Alluxio on Docker - http://bit.ly/2MLcDPw
•Running Alluxio on AWS EMR - http://bit.ly/2OI5HoO
•Running the Alluxio/Presto Sandbox - http://bit.ly/2OJ3FoC
•Spark and Alluxio in 5 minutes - http://bit.ly/2KC35Uu

Slides: Accelerating Queries on Cloud Data Lakes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Slides: Accelerating Queries on Cloud Data Lakes

Similar to Slides: Accelerating Queries on Cloud Data Lakes (20)

More from DATAVERSITY

More from DATAVERSITY (20)

Recently uploaded

Recently uploaded (20)

Slides: Accelerating Queries on Cloud Data Lakes