AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Elastic MapReduce:
Deep Dive and Best Practices
Ian Meyers, AWS (meyersi@)
October 29th, 2014

Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS

Map-Reduce Engine Vibrant Ecosystem
Hadoop-as-a-Service
Massively Parallel
Cost Effective AWS Wrapper
Integrated to AWS services
What is EMR?

EMRfs
HDFS
Amazon EMR
Amazon S3 Amazon
DynamoDB

EMRfs
HDFS
Data management Analytics languages
Amazon EMR
Amazon S3 Amazon
DynamoDB

EMRfs
HDFS
Amazon EMR Amazon
RDS
Amazon S3 Amazon
DynamoDB

EMRfs
HDFS
Amazon
Redshift
Amazon EMR Amazon
RDS
Amazon S3 Amazon
DynamoDB
AWS Data Pipeline

Amazon EMR Introduction
Launch clusters of any size in a matter of minutes
Use variety of different instance sizes that match
your workload
Don’t get stuck with hardware
Don’t deal with capacity planning
Run multiple clusters with different sizes, specs
and node types

Elastic MapReduce & Amazon S3
EMR has an optimised driver for Amazon S3
64MB Range Offset Reads to increase performance
Elastic MapReduce Consistent View further
Increases Performance
Addresses Consistency
S3 Cost - $.03/GB - Volume Based Price Tiering

Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 & HDFS

Pattern #1: Transient vs. Alive Clusters

Pattern #1: Transient Clusters
Cluster lives for the duration of the job
Shut down the cluster when the job is done
Data persist on
Amazon S3
Input & Output
Data on
Amazon S3

Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
• Cluster goes away when job is done
3. Practice cloud architecture
• Pay for what you use
• Data processing as a workflow

Alive Clusters
Very similar to traditional Hadoop deployments
Cluster stays around after the job is done
Data persistence model:
Amazon S3
Amazon S3 Copy To HDFS
HDFS and Amazon S3 as
backup

Alive Clusters
Always keep data safe on Amazon S3 even if you’re
using HDFS for primary storage
Get in the habit of shutting down your cluster and start a
new one, once a week or month
Design your data processing workflow to account for failure
You can use workflow managements such as AWS Data
Pipeline

Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)

Core Nodes
Can add core
nodes
More HDFS
space
More
CPU/memory
Amazon EMR cluster
Core instance group
HDFS HDFS HDFS

Core Nodes
Can’t remove
core nodes
because of
HDFS
Core instance group
HDFS HDFS HDFS
Amazon EMR cluster

Amazon EMR Task Nodes
Run TaskTrackers
No HDFS
Reads from core
node HDFS Task instance group
Core instance group
HDFS HDFS
Amazon EMR cluster

Can add
task nodes
Task instance group
Core instance group
HDFS HDFS
Amazon EMR cluster

More CPU
power
More
memory
Task instance group
Core instance group
HDFS HDFS
Amazon EMR cluster

You can
remove task
nodes when
processing
is completed
Task instance group
Core instance group
HDFS HDFS
Amazon EMR cluster

Task Node Use-Cases
Speed up job processing using Spot market
Run task nodes on Spot market
Get discount on hourly price
Nodes can come and go without interruption to your cluster
When you need extra horsepower for a short amount of time
Example: Need to pull large amount of data from Amazon S3

Option 1: Amazon S3 as HDFS
Use Amazon S3 as your
permanent data store
HDFS for temporary storage
data between jobs
No additional step to copy
data to HDFS
Amazon EMR Cluster
Task Instance
Group
Core Instance
Group
HDFS HDFS
Amazon S3

Benefits: Amazon S3 as HDFS
Ability to shut down your cluster
HUGE Benefit!!
Use Amazon S3 as your durable storage
11 9s of durability

No need to scale HDFS
Capacity
Replication for durability
Amazon S3 scales with your data
Both in IOPs and data storage

Ability to share data between multiple clusters
Hard to do with HDFS
Amazon S3
EMR
EMR

Take advantage of Amazon S3 features
Amazon S3 Server Side Encryption
Amazon S3 Lifecycle Policies
Amazon S3 versioning to protect against corruption
Build elastic clusters
Add nodes to read from Amazon S3
Remove nodes with data safe on Amazon S3

EMR Consistent View
Provides a ‘consistent view’ of data on S3 within a
Cluster
Ensures that all files created by a Step are available to
Subsequent Steps
Index of data from S3, managed by Dynamo DB
Configurable Retry & Metastore
New Hadoop Config File emrfs-site.xml
fs.s3.consistent* System Properties

EMR Consistent View
EMRfs
HDFS
Amazon EMR
Amazon S3 Amazon
DynamoDB
File Data Processed Files Registry

EMR Consistent View
Manage data in EMRFS using the emrfs client:
emrfs
describe-metadata, set-metadata-capacity, delete-metadata,
create-metadata, list-metadata-stores - work
with Metadata Stores
diff - Show what in a bucket is missing from the index
delete - Remove Index Entries
sync - Ensure that the Index is Synced with a bucket
import - Import Bucket Items into Index

What About Data Locality?
Run your job in the same region as your Amazon
S3 bucket
Amazon EMR nodes have high speed connectivity
to Amazon S3
If your job Is CPU/memory-bound, locality doesn’t
make a huge difference

Amazon S3 provides near linear scalability
S3 Streaming
Performance
100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
34 secs per terabyte
GB/Second
Reader Connections
Performance & Scalability

When HDFS is a Better Choice…
Iterative workloads
If you’re processing the same dataset more than once
Disk I/O intensive workloads

Option 2: Optimise for Latency with HDFS
1. Data persisted on Amazon S3

2. Launch Amazon EMR and
copy data to HDFS with
S3distcp
S3DistCp

3. Start processing data on
HDFS
S3DistCp

Benefits: HDFS instead of S3
Better pattern for I/O-intensive workloads
Amazon S3 as system of record
Durability
Scalability
Cost
Features: lifecycle policy, security

Amazon EMR Nodes and Size
Use M1.Small Instances for functional testing
Use XLarge + nodes for production workloads
Use CC2/C3 for memory and CPU intensive
jobs
HS1, HI1, I2 instances for HDFS workloads
Prefer a smaller cluster of larger nodes

Holy Grail Question
How many nodes do I need?

Instance Resource Allocation
• Hadoop 1 - Static Number of Mappers/Reducers
configured for the Cluster Nodes
• Hadoop 2 - Variable Number of Hadoop
Applications based on File Splits and Available
Memory
• Useful to understand Old vs New Sizing

Instance Resources
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
0
50
100
150
200
250
300
Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)

Cluster Sizing Calculation
1. Estimate the number of tasks your job requires.
2. Pick an instance and note down the number of tasks it
can run in parallel
3. We need to pick some sample data files to run a test
workload. The number of sample files should be the
same number from step #2.
4. Run an Amazon EMR cluster with a single Core node
and process your sample files from #3.
Note down the amount of time taken to process your
sample files.

Total Tasks * Time To Process Sample Files
Instance Task Capacity * Desired Processing Time
Estimated Number Of Nodes:

Example: Cluster Sizing Calculation
1. Estimate the number of tasks your job requires
150
2. Pick an instance and note down the number of
tasks it can run in parallel
m1.xlarge with 8 task capacity per instance

3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test

4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files

Total Tasks For Your Job * Time To Process Sample Files
Per Instance Task Capacity * Desired Processing Time
Estimated number of nodes:
150 * 3 min
8 * 5 min
= 11 m1.xlarge

File Best Practices
Avoid small files at all costs (smaller than
100MB)
Use Compression

Holy Grail Question
What if I have small file issues?

Dealing with Small Files
Use S3DistCP to
combine smaller files
together
S3DistCP takes a
pattern and target file
to combine smaller
input files to larger
ones
./elastic-mapreduce –jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-
[0-9]+-[0-9]+).*,
--targetSize,128,

Compression
Always Compress Data Files On Amazon S3
Reduces Bandwidth Between Amazon S3 and
Amazon EMR
Speeds Up Your Job
Compress Task Output

Compression
Compression Types:
Some are fast BUT offer less space reduction
Some are space efficient BUT Slower
Some are splitable and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s

Changing Compression Type
You May Decide To Change Compression Type
Use S3DistCP to change the compression types of
your files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--outputCodec,lzo’

M1/C1 Instance Families
Heavily used by EMR Customers
However, HDFS Utilisation is typically very
Low
M3/C3 Offers better Performance/$

M1 vs M3
Instance Cost / Map Task Cost / Reduce Task
m1.large $0.08 $0.15
m1.xlarge $0.06 $0.15
m3.xlarge $0.04 $0.07
m3.2xlarge $0.04 $0.07

C1 vs C3
Instance Cost / Map Task Cost / Reduce Task
c1.medium $0.13 $0.13
c1.xlarge $0.35 $0.70
c3.xlarge $0.05 $0.11
c3.2xlarge $0.05 $0.11

Orc vs Parquet
File Formats designed for SQL/Data Warehousing
on Hadoop
Columnar File Formats
Compress Well
High Row Count, Low Cardinality

OrcFile Format
Optimised Row Columnar Format
Zlibor Snappy External Compression
250MB Stripe of 1 Column and Index
RunLengthor Dictionary Encoding
1 Output File per Container Task

Parquet File Format
Gzipor Snappy External Compression
Array Data Structures
Limited Data Type Support for Hive
Batch Creation
1GB Files

Orc vs Parquet
Depends on the Tool you are using
Consider Future Architecture & Requirements
Test Test Test

In Summary
• Practice Cloud Architecture with Transient Clusters
• Utilize S3 as the system of record for durability
• Utilize Task Nodes on Spot for Increased performance and
Lower Cost
• Move to new Instance Families for Better Performance/$
• Exciting Developments around Columnar File Formats

AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Similar to AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Editor's Notes