This document discusses using a hybrid cloud approach with data orchestration to enable analytics workloads on data stored both on-premises and in the cloud. It outlines reasons for a hybrid approach including reducing time to production and leveraging cloud flexibility. It then describes alternatives like lift-and-shift or compute-driven approaches and their issues. Finally, it introduces a data orchestration platform that can cache and tier data intelligently while enabling analytics frameworks to access both on-premises and cloud-based data with low latency.
2. Why Hybrid Cloud?
§ Time to production
§ When you need compute capacity, expand cloud footprint with
significantly lower lag, compared with provisioning on-prem
§ Leverage cloud flexibility for bursty workloads
§ Reduce overload on existing infrastructure by moving ephemeral or
workloads with unpredictable resource utilization on Hadoop
§ Intermediate step before migrating to the cloud
§ Lower risk of a full cloud data migration and start with compute in the
cloud and data on prem. Full migration can take years.
Hybrid Cloud Drivers
3. Goal: Enable data workloads in the cloud on existing
on-prem data
Possible Requirements
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
4. Approaches to Hybrid Cloud
§ Simple tools available like distCP
§ Works for workloads with easily
identifiable datasets
Issues
§ Datasets for many workloads
cannot always be identified easily
§ Significantly more data may be
transferred than the workload
requires
§ Additional copies are very hard to
sync back with master data
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
§ Migration may seem easier as no
application re-architecture needed
Issues
§ If workloads are not made cloud-
native and elastic, infrastructure cost
can skyrocket
§ If on-prem data copy needs to be
maintained, syncing cloud and on-
prem data can be hard
§ Performance can be dramatically
impacted due to cloud storage
limitations
§ Data pulled into cloud based on
compute requests
§ Data is cached locally to reduce I/O
on remote clusters and is
automatically synced
Issues
§ Less helpful for workloads that don’t
read data set more than once
6. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business
7. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
8. Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to the cloud
9. Using Alluxio with AWS EMR
Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync
10. Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
11. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
12. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
HDFS
13. APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
Application have great flexibility to read / write data with many options
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
16. About DBS
•Headquartered in Singapore
•Largest bank in South East Asia
•Present in 18 markets globally, including 6
priority markets
•Singapore, Hong Kong, China, India, Indonesia
and Taiwan
•We have a very cool digiBank app
•And lots lots lots of data systems
17. AWS EnginesOnprem Engines
HDFS
Object Store
Evolution of Data Platforms at DBS
Generation 1
• Boxed data
• Monolithic/Closed Systems
• Proprietary HW/SW
• Data for Targeted Use Cases
Generation 2
• Big Data Explosion
• Hadoop Data Lakes
• Commodity HW and Hadoop
Ecosystem
• Compute tied to Storage
Generation 3
• Data Democratization
• Cloud Native platform… Hybrid! Multi!
• Open Source Engines
• Burst compute in the cloud with data
on-prem for compliance
• AI/ML Centric
Teradata
Informatica
SAS
HadoopTeradata
Informatica
SAS
Teradata
Informatica
SAS
Hadoop
18. Challenges
1.Data Lake built on local Object Store
Expensive rename operation
Object listing is slow
Variable performance
Data locality is gone
2.Multiple Data Silos
3.Limited on-premise compute capacity
Legacy ITIL processes for Infra provisioning
No dynamic scale out/in
19. Alluxio at DBS
Mount HDFS from other
platforms into common Alluxio
cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot data to
speed up Presto and Spark jobs
Extend Alluxio cluster into
AWS VPC
Run EMR for model training
and bring the results back to
on-prem
20. Alluxio at DBS
Mount HDFS from other
platforms into common Alluxio
cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot data to
speed up Presto and Spark jobs
Extend Alluxio cluster into
AWS VPC
Run EMR for model training
and bring the results back to
on-prem
21. Burst processing into the Cloud
The Use Case
Call Center project
Millions of calls annually
Why do our customers call us?
What do they do before picking up the phone?
Reconstruct customer journey
Predict the reason for the call
The Challenges
Transcript quality
Need lots of compute
>30TB of clickstream, transaction, customer, and product data
>20TB of audio files
Need dynamic compute for training and analysis
Data needs to reside on-prem for compliance