Accelerating analytics with EMR
on your S3 data lake
Dipti Borkar | Product | dipti@alluxio.com
AWS S3:The Default Data Lake on AWS
Why build your data lake on S3?
High available
Simple API
Fully managed
Really large scale
Cost-effective
Integration with many different services and frameworks
Analytics with AWS EMR on S3
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
HDFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS
Manual distcp / No SyncManual distcp / No Sync
EMRF
S
Using EMRFS on AWS EMR
Presto Hive
HDFS EMRF
S
Instances
Presto Hive
HDFS EMRF
S
No data
caching
No data
caching
Challenges with Analytics on S3
Big data frameworks on the
public cloud
SparkSparkSparkSpark
Challenges with S3
§ Expensive metadata operations like list, rename
§ Eventual consistency in some cases
§ Performance inconsistent / throttling / limits
§ Provides a data caching layer for Spark
§ Provides strong consistency for for metadata
operations and faster performance
§ Provides API compatibility with HDFS & S3
§ S3 is eventually consistent making it hard to
predict query results
§ Allows for data outside of S3 to be analyzed
as well – like remote HDFS
Spark workloads on S3 with Alluxio
Compute caching for Spark on S3
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage
Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRF
S
EMRF
S
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Using Alluxio with AWS EMR
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Alluxio – Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Instance
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Alluxio - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
Instance
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Alluxio – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
Instance
Flexible APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
13
Demo: Bootstrapping Alluxio with AWS EMR
aws emr create-cluster 
--release-label ${RELEASE_LABEL} 
--instance-count ${NUM_INSTANCES} 
--custom-ami-id ami-0a53794238d399ab6 
--instance-type ${INSTANCE_TYPE} 
--applications Name=Presto Name=Hive Name=Spark 
--name "${CLUSTER_NAME}" 
--bootstrap-actions
Path=${BOOTSTRAP_PATH},
Args=[${ROOT_UFS_URI},-p,${ADDITIONAL_PROPERTIES},-s,","] 
--configurations
https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json 
--ec2-attributes KeyName=${KEY_PAIR}
Bootstrapping Alluxio with AWS EMR
RELEASE_LABEL="emr-5.23.0"
NUM_INSTANCES=5
INSTANCE_TYPE="m4.xlarge"
CLUSTER_NAME="emr-v12"
BOOTSTRAP_PATH="s3://dipti-alx-2019/emr/alluxio-emr.sh"
ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/fil
es/2.0.0/alluxio-2.0.0-RC3-bin.tar.gz"
ROOT_UFS_URI="s3a://dipti-alx-2019/emr/ufs/"
ADDITIONAL_PROPERTIES="alluxio.underfs.s3.owner.id.to.username.m
apping=${S3_ID}=hadoop;alluxio.user.file.writetype.default=ASYNC
_THROUGH"
Alluxio for Spark
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
16
Example:Alluxio for Spark
In-Memory
Storage
block 1
block 3
In-Memory
Storage
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Jobs
Inter-process sharing slowed down by network I/O
17
Data sharing between jobs
Data Sharing Between Jobs
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
storage &
execution engine
separated
Inter-process sharing can happen at memory speed
18
Data Sharing Between JobsData sharing between jobs
Data Resilience during Crashes
In-Memory Storage
block 1
block 3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network I/O to re-read the data
19
Data Sharing Between JobsData resilience during crashes
Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.io/slack
Questions?
Join the Alluxio Community
http://alluxio.io/ | @alluxio

Accelerating Analytics with EMR on your S3 Data Lake

  • 1.
    Accelerating analytics withEMR on your S3 data lake Dipti Borkar | Product | dipti@alluxio.com
  • 2.
    AWS S3:The DefaultData Lake on AWS Why build your data lake on S3? High available Simple API Fully managed Really large scale Cost-effective Integration with many different services and frameworks
  • 3.
    Analytics with AWSEMR on S3 Presto Hive HDFS EMRF S Instances Presto Hive HDFS EMRF S
  • 4.
    HDFS on AWSEMR Presto Hive HDFS EMRF S Instances Presto Hive HDFS Manual distcp / No SyncManual distcp / No Sync EMRF S
  • 5.
    Using EMRFS onAWS EMR Presto Hive HDFS EMRF S Instances Presto Hive HDFS EMRF S No data caching No data caching
  • 6.
    Challenges with Analyticson S3 Big data frameworks on the public cloud SparkSparkSparkSpark Challenges with S3 § Expensive metadata operations like list, rename § Eventual consistency in some cases § Performance inconsistent / throttling / limits
  • 7.
    § Provides adata caching layer for Spark § Provides strong consistency for for metadata operations and faster performance § Provides API compatibility with HDFS & S3 § S3 is eventually consistent making it hard to predict query results § Allows for data outside of S3 to be analyzed as well – like remote HDFS Spark workloads on S3 with Alluxio Compute caching for Spark on S3 Accelerate big data frameworks on the public cloud Same instance / container Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark
  • 8.
    Data Orchestration forthe Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Independent scaling of compute & storage
  • 9.
    Presto Hive Instances Metadata & Datacache Presto Hive Metadata & Data cache HDFS HDFSEMRF S EMRF S Compute-driven Continuous sync Compute-driven Continuous sync Using Alluxio with AWS EMR
  • 10.
    Spark Presto HiveTensorFlow RAM SSD Disk Framework Read file /trades/us Bucket Trades Bucket Customers Data requests Alluxio – Data Caching for faster compute Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Instance
  • 11.
    Spark Presto HiveTensorFlow RAM SSD Disk Framework Bucket Trades Bucket Customers Data requests Alluxio - Intelligent Tiering for resource efficiency Read file /customers/145 Out of memory Variable latency with throttling Data moved to another tier Instance
  • 12.
    Spark Presto HiveTensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Alluxio – Policy-driven Data Management S3 Standard Policy interval : Every day Policy applied everyday Instance
  • 13.
    Flexible APIs toInteract with data in Alluxio Spark Presto POSIX Java > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput")); 13
  • 14.
    Demo: Bootstrapping Alluxiowith AWS EMR aws emr create-cluster --release-label ${RELEASE_LABEL} --instance-count ${NUM_INSTANCES} --custom-ami-id ami-0a53794238d399ab6 --instance-type ${INSTANCE_TYPE} --applications Name=Presto Name=Hive Name=Spark --name "${CLUSTER_NAME}" --bootstrap-actions Path=${BOOTSTRAP_PATH}, Args=[${ROOT_UFS_URI},-p,${ADDITIONAL_PROPERTIES},-s,","] --configurations https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json --ec2-attributes KeyName=${KEY_PAIR}
  • 15.
    Bootstrapping Alluxio withAWS EMR RELEASE_LABEL="emr-5.23.0" NUM_INSTANCES=5 INSTANCE_TYPE="m4.xlarge" CLUSTER_NAME="emr-v12" BOOTSTRAP_PATH="s3://dipti-alx-2019/emr/alluxio-emr.sh" ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/fil es/2.0.0/alluxio-2.0.0-RC3-bin.tar.gz" ROOT_UFS_URI="s3a://dipti-alx-2019/emr/ufs/" ADDITIONAL_PROPERTIES="alluxio.underfs.s3.owner.id.to.username.m apping=${S3_ID}=hadoop;alluxio.user.file.writetype.default=ASYNC _THROUGH"
  • 16.
    Alluxio for Spark •Data sharing between jobs • Data resilience during application crashes • Consolidate memory usage and alleviate GC issues 16 Example:Alluxio for Spark
  • 17.
    In-Memory Storage block 1 block 3 In-Memory Storage block1 block 3 block 2 block 4 storage engine & execution engine same process Data Sharing Between Jobs Inter-process sharing slowed down by network I/O 17 Data sharing between jobs
  • 18.
    Data Sharing BetweenJobs block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 In-Memory block 1 block 3 block 4 storage & execution engine separated Inter-process sharing can happen at memory speed 18 Data Sharing Between JobsData sharing between jobs
  • 19.
    Data Resilience duringCrashes In-Memory Storage block 1 block 3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Process crash requires network I/O to re-read the data 19 Data Sharing Between JobsData resilience during crashes
  • 20.
    Data Elasticity with aunified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  • 21.
    Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM /SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 22.
    Data Locality withIntelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 23.
    Data Accessibility viapopular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 24.
    Data Elasticity viaUnified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 25.
    Unified Namespace: GlobalData Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  • 26.
    Incredible Open SourceMomentum with growing community 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads Join the conversation on Slack alluxio.io/slack
  • 27.
    Questions? Join the AlluxioCommunity http://alluxio.io/ | @alluxio