From limited Hadoop compute capacity to increased data scientist efficiency

Data Orchestration Platform for the Cloud
September 2019

Why Hybrid Cloud?
§ Time to production
§ When you need compute capacity, expand cloud footprint with
significantly lower lag, compared with provisioning on-prem
§ Leverage cloud flexibility for bursty workloads
§ Reduce overload on existing infrastructure by moving ephemeral or
workloads with unpredictable resource utilization on Hadoop
§ Intermediate step before migrating to the cloud
§ Lower risk of a full cloud data migration and start with compute in the
cloud and data on prem. Full migration can take years.
Hybrid Cloud Drivers

Goal: Enable data workloads in the cloud on existing
on-prem data
Possible Requirements
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching

Approaches to Hybrid Cloud
§ Simple tools available like distCP
§ Works for workloads with easily
identifiable datasets
Issues
§ Datasets for many workloads
cannot always be identified easily
§ Significantly more data may be
transferred than the workload
requires
§ Additional copies are very hard to
sync back with master data
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
§ Migration may seem easier as no
application re-architecture needed
Issues
§ If workloads are not made cloud-
native and elastic, infrastructure cost
can skyrocket
§ If on-prem data copy needs to be
maintained, syncing cloud and on-
prem data can be hard
§ Performance can be dramatically
impacted due to cloud storage
limitations
§ Data pulled into cloud based on
compute requests
§ Data is cached locally to reduce I/O
on remote clusters and is
automatically synced
Issues
§ Less helpful for workloads that don’t
read data set more than once

Data Orchestration for the Cloud

Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business

Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere

Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to the cloud

Using Alluxio with AWS EMR
Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync

RAM
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again

RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier

RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
HDFS

APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
Application have great flexibility to read / write data with many options
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

About DBS
•Headquartered in Singapore
•Largest bank in South East Asia
•Present in 18 markets globally, including 6
priority markets
•Singapore, Hong Kong, China, India, Indonesia
and Taiwan
•We have a very cool digiBank app
•And lots lots lots of data systems

AWS EnginesOnprem Engines
HDFS
Object Store
Evolution of Data Platforms at DBS
Generation 1
• Boxed data
• Monolithic/Closed Systems
• Proprietary HW/SW
• Data for Targeted Use Cases
Generation 2
• Big Data Explosion
• Hadoop Data Lakes
• Commodity HW and Hadoop
Ecosystem
• Compute tied to Storage
Generation 3
• Data Democratization
• Cloud Native platform… Hybrid! Multi!
• Open Source Engines
• Burst compute in the cloud with data
on-prem for compliance
• AI/ML Centric
Teradata
Informatica
SAS
HadoopTeradata
Informatica
SAS
Teradata
Informatica
SAS
Hadoop

Challenges
1.Data Lake built on local Object Store
Expensive rename operation
Object listing is slow
Variable performance
Data locality is gone
2.Multiple Data Silos
3.Limited on-premise compute capacity
Legacy ITIL processes for Infra provisioning
No dynamic scale out/in

Alluxio at DBS
Mount HDFS from other
platforms into common Alluxio
cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot data to
speed up Presto and Spark jobs
Extend Alluxio cluster into
AWS VPC
Run EMR for model training
and bring the results back to
on-prem

Burst processing into the Cloud
The Use Case
Call Center project
Millions of calls annually
Why do our customers call us?
What do they do before picking up the phone?
Reconstruct customer journey
Predict the reason for the call
The Challenges
Transcript quality
Need lots of compute
>30TB of clickstream, transaction, customer, and product data
>20TB of audio files
Need dynamic compute for training and analysis
Data needs to reside on-prem for compliance

High Level Architecture
Data Orchestration Data Orchestration

Environment Topology
Alluxio
US-East-1 EU-West-1
65ms
100mbps
US-East-1
Presto
Hive

§ M4.XL Instances.
§ 4x vCPU
§ 16gb RAM
§ EBS
§ High Network
§ EMR 5.23.0
§ Hadoop 2.8.5
§ Alluxio 2.0
Testing Environment
Amazon EC2
§ 1x Node running Alluxio Master and
Presto Coordinator
§ 10x Nodes running:
§ Alluxio Client
§ Alluxio Worker
§ Presto Worker
§ US Alluxio cluster configured to mount:
§ EU HDFS
§ US S3
§ Tables live in EU HDFS
US ClusterTopology
§ EU HDFS Cluster
§ 1x Namenode
§ 8x Datanode
§ 40gb compressed dataset. 100gb uncompressed
EU ClusterTopology

DATA ORCHESTRATION SUMMIT
November 7, 2019 | Computer History Museum | Mountain View, CA
Organized by
Register Here! Discount Code: “WELCOME”

Questions?
§ Edit Master text styles
§ Second level
• Third level
• Fourth level

From limited Hadoop compute capacity to increased data scientist efficiency

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From limited Hadoop compute capacity to increased data scientist efficiency

Similar to From limited Hadoop compute capacity to increased data scientist efficiency (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

From limited Hadoop compute capacity to increased data scientist efficiency