How to run your Hadoop Cluster in 10 minutes

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vladimir Simek, Solutions Architect @ AWS
22/03/2016
Amazon Elastic MapReduce
How to run your Hadoop Cluster in 10 minutes

Agenda
• Two different companies – 2 stories
• Challenges with Big Data on premises
• Technical introduction to Amazon EMR
• Amazon EMR features and benefits
• Use case of AOL – moving 2 PB on-prem Hadoop
cluster to the AWS cloud
• Short demos

In the beginning – 2 different
stories

• In 2007 New York Times has decided create a digital
archive on the web – all articles from 1851-1922
• 11 million articles (4 TB of data) composed of:
• 405,000 large TIFF images
• 405,000 XML files
• 3.3 million SGML files
• Used Amazon EC2 and Hadoop to process the data

Time to process?
Less than 24 hours
Costs?
About $240

(Undisclosed international company) –
subsidiary in France
• In 2014 - has decided to run a POC on Big Data
analytics
• What was the 1st step they did?
Invested €7M into server purchase

“Want to increase innovation?
Lower the cost of failure.”
Joi Ito, Director of MIT Media Lab

How many big ticket
technology ideas can
your budget tolerate?

(Big) Data for Competitive Advantage
Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real-time bidding
Clickstream analysis
Fraud detection
Security threat detection

Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute

What is Amazon EMR and how
it addresses such issues?

Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Open source distribution and MapR
distribution
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize

Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud

What Do I Need to Build a Cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method

Choice of Multiple Instances
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS

Choose Your Software (Quick Bundles)

Choose Your Software – Custom

Hadoop Applications Available in Amazon EMR

Choose Security and Access Control

You Are Up and Running!
Master Node DNS

Information about the software you are
running, logs and features

Infrastructure for this cluster

Security Groups and Roles

Use the CLI
aws emr create-cluster
--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK

Now that I have a cluster, I need to process
some data

Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis

On an On-premises Environment
Tightly coupled

Compute and Storage Grow Together
Tightly coupled
Storage grows along with
compute
Compute requirements vary

Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state

Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized capacity
Provisioned capacity

Contention for Same Resources
Compute
bound
Memory
bound

Separation of Resources Creates Data Silos
Team A

Replication Adds to Cost
3x
Single datacenter

So how does Amazon EMR solve these problems?

Amazon S3 is Your Persistent Data Store
Designed for 11 9’s durability
$0.03 / GB / month in Ireland
Lifecycle policies
Versioning
Distributed by default
EMRFSAmazon S3

The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system
• Streams data directly from Amazon S3
• Uses HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Consistent view – consistency for read after write
• Support for encryption
• Fast listing of objects

Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'

Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION
's3://elasticmapreduce.samples/pig-
apache/input/'

Benefit 1: Switch Off Clusters
Amazon S3Amazon S3 Amazon S3

Run Transient or Long-Running Clusters

Benefit 2: Resize Your Cluster

Resize the Cluster
Scale Up, Scale Down, Stop a resize,
issue a resize on another

How do you scale up and save cost ?

Spot Instance
Bid
Price
OD
Price

Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Spot Integration with Amazon EMR
• Can provision instances from the Spot market
• Impact of interruption
• Master node – Can lose the cluster
• Core node – Can lose intermediate data
• Task nodes – Jobs will restart on other nodes (application
dependent)

Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances
Add 10 more nodes on Spot

20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105

50 % less run-time ( 14  7)
25% less cost (140  105)

Effectively Utilize Clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical Separation of Jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3

Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone

Case study: How AOL moved a
2 PB cluster to the AWS cloud

AOL Data Platforms Architecture 2014
AOL
Source Systems In-house Hadoop
Cluster
Database
Reporting Tools
Users

Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months

AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Source
Systems
Amazon S3
Amazon EMR
Cluster
Watchdog
Amazon SNS
Amazon IAM
AOL
AWS Direct
Connect
Reporting
Tools
Database
Users

EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes

AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Service
Cost Comparison
0 2 4 6
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS

How to run your Hadoop Cluster in 10 minutes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to run your Hadoop Cluster in 10 minutes

Similar to How to run your Hadoop Cluster in 10 minutes (20)

More from Vladimir Simek

More from Vladimir Simek (16)

Recently uploaded

Recently uploaded (20)

How to run your Hadoop Cluster in 10 minutes

Editor's Notes