(BDT208) A Technical Introduction to Amazon Elastic MapReduce

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Abhishek Sinha, Amazon Web Services
Gaurav Agrawal, AOL Inc
October 2015
BDT208
A Technical Introduction to
Amazon EMR

What to Expect from the Session
• Technical introduction to Amazon EMR
• Basic tenets
• Amazon EMR feature set
• Real-Life experience of moving a 2-PB, on-premises
Hadoop cluster to the AWS cloud
• Is not a technical introduction to Apache Spark, Apache
Hadoop, or other frameworks

Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Open source distribution and MapR
distribution
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize

Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud

What Do I Need to Build a Cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method

An Example EMR Cluster
Master Node
r3.2xlarge
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge
Slave Group – Task
m3.2xlarge (EC2 Spot)
HDFS (DataNode).
YARN (NodeManager).
NameNode (HDFS)
ResourceManager
(YARN)

Choice of Multiple Instances
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS

Choose Your Software (Quick Bundles)

Choose Your Software – Custom

Hadoop Applications Available in Amazon EMR

Choose Security and Access Control

You Are Up and Running!
Master Node DNS

Information about the software you are
running, logs and features

Infrastructure for this cluster

Security Groups and Roles

Use the CLI
aws emr create-cluster
--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK

Programmatic Access to Cluster Provisioning

Now that I have a cluster, I need to process
some data

Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis

On an On-premises Environment
Tightly coupled

Compute and Storage Grow Together
Tightly coupled
Storage grows along with
compute
Compute requirements vary

Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state

0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized capacity
Provisioned capacity

Contention for Same Resources
Compute
bound
Memory
bound

Separation of Resources Creates Data Silos
Team A

Replication Adds to Cost
3x
Single datacenter

So how does Amazon EMR solve these problems?

Amazon S3 is Your Persistent Data Store
11 9’s of durability
$0.03 / GB / month in US-East
Lifecycle policies
Versioning
Distributed by default
EMRFSAmazon S3

The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system
• Streams data directly from Amazon S3
• Uses HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Consistent view – consistency for read after write
• Support for encryption
• Fast listing of objects

Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'

Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-
apache/input/'

Benefit 1: Switch Off Clusters
Amazon S3Amazon S3 Amazon S3

Or You Can Use AWS Data Pipeline
Input data
Use Amazon EMR to
transform unstructured
data to structured
Push to
Amazon S3
Ingest into
Amazon
Redshift

Run Transient or Long-Running Clusters

Run a Long-Running Cluster
Amazon EMR cluster

Benefit 2: Resize Your Cluster

Resize the Cluster
Scale Up, Scale Down, Stop a resize,
issue a resize on another

How do you scale up and save cost ?

Spot Instance
Bid
Price
OD
Price

Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Spot Integration with Amazon EMR
• Can provision instances from the Spot market
• Replaces a Spot instance incase of interruption
• Impact of interruption
• Master node – Can lose the cluster
• Core node – Can lose intermediate data
• Task nodes – Jobs will restart on other nodes (application
dependent)

Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances
Add 10 more nodes on Spot

20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105

50 % less run-time ( 14  7)
25% less cost (140  105)

Scaling Hadoop Jobs with Spot
http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/
1500 to 2000 clusters
6000 Jobs

For each instance_type in (Availability Zone, Region)
{
cpuPerUnitPrice = instance.cpuCores/instance.spotPrice
if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
optimalInstanceType = instance_type;
}
}
Source: Github /Bloomreach/ Briefly

Effectively Utilize Clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical Separation of Jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3

Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone

Amazon S3 as a Data Lake
Nate Sammons, Principal Architect – NASDAQ
Reference – AWS Big Data Blog

Re-cap
Rapid provisioning of clusters
Hadoop, Spark, Presto, and other applications
Standard open-source packaging
De-couple storage and compute and scale them
independently
Resize clusters to manage demand
Save costs with Spot instances

How AOL Inc. moved a 2 PB Hadoop
cluster to the AWS cloud
Gaurav Agrawal
Senior Software Engineer, AOL Inc.
AWS Certified Associate Solutions Architect

AOL Data Platforms Architecture 2014

Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months

Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute

1
2
2
3
4
56

Migration
• Web Console vs. CLI

Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!

Migration
• Copy Existing Data to S3

bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3

Migration
• EMR Design options

EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes

Migration
• EMR Jobs Submission - CLI

EMR Jobs Submission - CLI
In-house scheduler
Common Utilities
Provision EMR
Push/Pull Data to S3
Job submission to Scheduler
Database Load
JSON Files
Applications, Steps, Bootstrap,EC2 attributes, Instance Groups
Future : Event Driven Design – Lambda, SQS

EMR Jobs Submission - CLI
aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08"
--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav"
"Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1"
--visible-to-all-users
--ec2-attributes file://omni_awssot.generic.ec2_attributes.json
--ami-version "3.7.0"
--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/
--enable-debugging
--instance-groups file://omni_awssot.generic.instance_groups.json
--auto-terminate
--applications file://omni_awssot.generic.applications.json
--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json
--steps file://omni_awssot.generic.steps.json

Migration
• EMR Jobs Submission – CLI
• Monitoring

Monitoring
EMR WatchDog : Node.js
Duplicate Clusters
Failed Clusters
Long-running Clusters
Long-provisioning Clusters
CloudWatch Alarms
Monthly Billing
S3 Bucket Size
SNS Email Notifications
Amazon CloudWatch
Amazon SNS

Migration
• Monitoring
• Elasticity

Elasticity
Why be Elastic?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 09/05/2015 Cores Nodes
Daily Processes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Core Nodes Demand - 09/20/2015 Core Nodes
No Clusters
Spike in Demand
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores Nodes
Major Restatement
Demand > 10K EC2

Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types

Migration
• Monitoring
• Elasticity
• Cost Management & BCDR

Cost Management & BCDR
Multi Region Deployment
Best AZ for pricing
Design for failure
Global. BC-DR.

Migration
• Monitoring
• Elasticity
• Cost Management & BCDR
• Optimization

Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
How many nodes?
Based on Data Volume
Complete hour for pricing
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security

Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?

AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.

AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS

Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions

Remember to complete
your evaluations!

Thank you!
Photo Credits
• Key Board : http://bit.ly/1LRQMdR
• Compression : http://bit.ly/1MtT3Pa
• Optimization : http://bit.ly/1FlidQD
• WatchDog : http://bit.ly/1OX50j6
• Elasticity : http://bit.ly/1YFfCr4
• Fish Bowl : http://bit.ly/1VjrcJd
• Blank Cheque : http://bit.ly/1RkTgGe

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Similar to (BDT208) A Technical Introduction to Amazon Elastic MapReduce (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce