Accelerating Analytics with EMR on your S3 Data Lake

Accelerating analytics with EMR
on your S3 data lake
Dipti Borkar | Product | dipti@alluxio.com

AWS S3:The Default Data Lake on AWS
Why build your data lake on S3?
High available
Simple API
Fully managed
Really large scale
Cost-effective
Integration with many different services and frameworks

Analytics with AWS EMR on S3
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S

HDFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS
Manual distcp / No SyncManual distcp / No Sync
EMRF
S

Using EMRFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
No data
caching
No data
caching

Challenges with Analytics on S3
Big data frameworks on the
public cloud
SparkSparkSparkSpark
Challenges with S3
§ Expensive metadata operations like list, rename
§ Eventual consistency in some cases
§ Performance inconsistent / throttling / limits

§ Provides a data caching layer for Spark
§ Provides strong consistency for for metadata
operations and faster performance
§ Provides API compatibility with HDFS & S3
§ S3 is eventually consistent making it hard to
predict query results
§ Allows for data outside of S3 to be analyzed
as well – like remote HDFS
Spark workloads on S3 with Alluxio
Compute caching for Spark on S3
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark

Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage

Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Using Alluxio with AWS EMR

Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Alluxio – Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again
Instance

RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Alluxio - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
Instance

RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Alluxio – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
Instance

Flexible APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
13

Demo: Bootstrapping Alluxio with AWS EMR
aws emr create-cluster
--release-label ${RELEASE_LABEL}
--instance-count ${NUM_INSTANCES}
--custom-ami-id ami-0a53794238d399ab6
--instance-type ${INSTANCE_TYPE}
--applications Name=Presto Name=Hive Name=Spark
--name "${CLUSTER_NAME}"
--bootstrap-actions
Path=${BOOTSTRAP_PATH},
Args=[${ROOT_UFS_URI},-p,${ADDITIONAL_PROPERTIES},-s,","]
--configurations
https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json
--ec2-attributes KeyName=${KEY_PAIR}

Bootstrapping Alluxio with AWS EMR
RELEASE_LABEL="emr-5.23.0"
NUM_INSTANCES=5
INSTANCE_TYPE="m4.xlarge"
CLUSTER_NAME="emr-v12"
BOOTSTRAP_PATH="s3://dipti-alx-2019/emr/alluxio-emr.sh"
ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/fil
es/2.0.0/alluxio-2.0.0-RC3-bin.tar.gz"
ROOT_UFS_URI="s3a://dipti-alx-2019/emr/ufs/"
ADDITIONAL_PROPERTIES="alluxio.underfs.s3.owner.id.to.username.m
apping=${S3_ID}=hadoop;alluxio.user.file.writetype.default=ASYNC
_THROUGH"

Alluxio for Spark
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
16
Example:Alluxio for Spark

In-Memory
Storage
block 1
block 3
In-Memory
Storage
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Jobs
Inter-process sharing slowed down by network I/O
17
Data sharing between jobs

Data Sharing Between Jobs
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
storage &
execution engine
separated
Inter-process sharing can happen at memory speed
18
Data Sharing Between JobsData sharing between jobs

Data Resilience during Crashes
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
19
Data Sharing Between JobsData resilience during crashes

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL

Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming

Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2

Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.io/slack

Questions?
Join the Alluxio Community
http://alluxio.io/ | @alluxio

Accelerating Analytics with EMR on your S3 Data Lake

More Related Content

What's hot

Similar to Accelerating Analytics with EMR on your S3 Data Lake

More from Alluxio, Inc.

Recently uploaded

Accelerating Analytics with EMR on your S3 Data Lake