Amazon Elastic Map Reduce - Ian Meyers

Deep experience in
building and
operating global web
scale systems
About
Amazon

Web
Services

?
…get into cloud computing?
How did Amazon…

Utility computing
On demand Pay as you go
Uniform Available

Utility computing
On demand Pay as you go
Uniform Available
Compute

Storage

Security

Scaling

Database

Networking

Monitoring

Messaging

Workﬂow

DNS

Load
Balancing

Backup
CDN

No
Up-‐Front

Capital
Expense

Pay
Only
for

What
You
Use

Self-‐Service

Infrastructure

Easily
Scale
Up

and
Down

Improve
Agility
&

Time-‐to-‐Market

Low
Cost

Deploy
Cloud computing benefits

Traditional IT
capacity
ElasNc
capacity

Capacity
Time
Your IT needs

On
and
Oﬀ
Fast
Growth

Variable
peaks
Predictable
peaks

ElasNc
capacity

ElasNc
capacity

On
and
Oﬀ
Fast
Growth

Predictable
peaks
Variable
peaks

WASTE
CUSTOMER DISSATISFACTION

ElasNc
capacity

Fast
Growth
On
and
Oﬀ

Predictable
peaks
Variable
peaks

NumberofEC2Instances
4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008
40
servers
to
5000
in
3
days

EC2 scaled to peak of 5000
instances
“Techcrunched”
Launch of Facebook
modification
Steady state of ~40
instances

Compute
Storage

AWS
Global
Infrastructure

Database

App
Services

Deployment
&
AdministraNon

Networking

Global Infrastructure

Region
US-WEST (N. California)
EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (Oregon)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA PAC
(Sydney)

Availability Zone

Customer Needs
•  Store
Any
Amount
of
Data

–  Without
Capacity
Planning

•  Perform
Complex
Analysis
on
Any
Data

–  Scale
on
Demand

•  Store
Data
Securely

•  Decrease
Time
to
Market

–  Build
Environments
Quickly

•  Reduce
Costs

–  Reduce
Capital
Expenditure

•  Enable
Global
Reach

ElasNc
Block
Store

High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Availability
99.99%
Durability
99.999999999%
Is a Web Store
Not a file system
No Single Points of Failure
Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.095/GB/month
Typical use
case
Write once, read many
Limits 100 Buckets, Unlimited
Storage, 5TB Objects
Simple
Storage
Service

Highly
scalable
object
storage
for
the
internet

1
byte
to
5TB
in
size

99.999999999%
durability

Peak Requests: 830,000+ per second
Total Number of Objects Stored in Amazon S3
14 Billion
40 Billion
102 Billion
762 Billion
262 Billion
1.3 Trillion
Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012
Objects in S3

Glacier

Long
term
object
archive

Extremely
low
cost
per
gigabyte

99.999999999%
durability

ElasNc
Block
Store

High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Durability
99.999999999%
Designed for Archival
Not a file system
Vaults & Archives
3-5 Hour Retrieval Time
Paradigm Archive Store
Performance Configurable - Low
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.011/GB/month
Typical use
case
Write once, read
infrequently
< 10% / Month

Simple
Storage
Service

Highly
scalable
object
storage

1
byte
to
5TB
in
size

99.999999999%
durability

Glacier

Long
term
object
archive

Extremely
low
cost
per
gigabyte

99.999999999%
durability

Storage
Lifecycle
IntegraNon

Structured
Data
Management

Compute
Storage

AWS
Global
Infrastructure

Database

App
Services

Deployment
&
AdministraNon

Networking

Database
Relational Database Service
Managed Oracle, MySQL & SQL Server
Dynamo DB
Managed NOSQL Database
Amazon Redshift
Massively Parallel Petabyte Scale Data Warehouse
RDS Dynamo
DB
Redshift

Compute
Storage

AWS
Global
Infrastructure

Database

App
Services

Deployment
&
AdministraNon

Networking

Database
Relational Database Service
Database-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant conﬁgurations
Integration with Data Pipeline
RDS Dynamo
DB
Redshift

Compute
Storage

AWS
Global
Infrastructure

Database

App
Services

Deployment
&
AdministraNon

Networking

Database
DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, conﬁgurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
RDS Dynamo
DB
Redshift

Compute
Storage

AWS
Global
Infrastructure

Database

App
Services

Deployment
&
AdministraNon

Networking

Database
Redshift
Managed Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Extensive Security
2 TB -> 1.6 PB
RDS Dynamo
DB
Redshift

Unstructured
Data

…

Parallel
ETL

Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Support for Spot Instances
Integrated HBase NOSQL Database
Compute
Storage

AWS
Global
Infrastructure

Database

App
Services

Deployment
&
AdministraNon

Networking

Application Services
Elastic
MapReduce

•  AWS Web Console
•  Command Line
elastic-‐mapreduce
-‐-‐create
-‐-‐key-‐pair
micro
-‐-‐region
eu-‐
west-‐1
-‐-‐name
IanMM-‐Test1
-‐-‐num-‐instances
5
-‐-‐instance-‐
type
m2.4xlarge
–alive
-‐-‐log-‐uri
s3n://meyersi-‐ire/EMR/
log

Launching Clusters

•  Enabling Tools
-‐-‐create
-‐-‐key-‐pair
micro
-‐-‐region
eu-‐west-‐1
-‐-‐
name
IanMM-‐Test1
-‐-‐num-‐instances
5
-‐-‐instance-‐type
m2.4xlarge
-‐-‐
alive

-‐-‐pig-‐interactive
-‐-‐pig-‐versions
latest

-‐-‐hive-‐interactive
–-‐hive-‐versions
latest

-‐-‐hbase

-‐-‐log-‐uri
s3n://meyersi-‐ire/EMR/log

Launching Clusters

•  Hadoop Configuration Bootstrap Action
-‐-‐create
-‐-‐bootstrap-‐action

s3://elasticmapreduce/bootstrap-‐
actions/configure-‐hadoop
-‐-‐args
"-‐
s,dfs.block.size=1048576”
-‐-‐key-‐pair
micro

-‐-‐region
eu-‐west-‐1
-‐-‐name
IanMM-‐Test-‐3
-‐-‐instance-‐group

core
-‐-‐instance-‐count
2
m2.4xlarge
-‐-‐
instance-‐group
task
-‐-‐instance-‐count
2

m2.4xlarge
-‐-‐alive
-‐-‐pig-‐interactive
-‐-‐hive-‐interactive

-‐-‐log-‐uri
s3n://meyersi-‐ire/EMR/log

Launching Clusters

Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.

Activity: This is a data aggregation,
manipulation, or copy that runs on a user-
configured schedule.
Output Datanode: This supports all the same
datasources as the input datanode, but they don’t
have to be the same type.

Amazon Data Pipeline

Output:
S3
file

Path:
s3://trend-‐data/#{year-‐month-‐day}.csv

AcNvity:
EMR
Transform

Hive
Query:
user-‐metrics.hql

Frequency:
Daily

Input:
RDS
Table

Table:
User-‐Demographics

SQL
PrecondiNon:

“Select
last_update
from
table“
>
#{YY-‐MM-‐DD}

Input:
DynamoDB
Table

Table:
User-‐Event-‐Data-‐#{year-‐month}

Success
NoNficaNon:
metrics@example.com

Failure
NoNficaNon:
emr-‐admin@example.com

Delay
NoNficaNon:
:
emr-‐admin@example.com

Orchestration with Data Pipeline

Analytics Pipeline
Redshift
S3
RDS
EMR
Data Pipeline
…collect & store
…orchestrate
…process & analyse
Dynamo DB

Benefits only possible in the Cloud
Pay as you
Go
Lower
Overall
Costs
Stop
Guessing
Capacity
Agility /
Speed /
Innovation
Avoid
Undifferentiated
Heavy Lifting
Go Global
in Minutes
✔ ✔ ✔ ✔ ✔ ✔
“Private
Cloud” /
On
Premises
X X X X X X

Agility & Global Reach

at the Core of EMR

Ease of Operation
Compute
Infrastructure

Hadoop
ConfiguraNon
Local
Disk
OperaNng
System
Config

HDFS

Networking

Hive
Pig
HBase

User
Defined
Sogware
InstallaNon

Ease of Operation
Compute
Infrastructure

Hadoop

ConfiguraNon

Local
Disk

OperaNng

System
Config

HDFS

Networking

Hive

Pig

HBase

User
Defined
Sogware
InstallaNon

Multiple Hadoop
Distributions - Open Source
& MapR
Clusters Launched with 1
Command
Up in 5 Minutes
Hard Partitioned per
Customer on CPU, Memory
and Disk
Dynamic Cluster Resizing
In any of 8 Regions around
the Globe

Lower Overall Costs

Cheaper | Spot Market Management

Lower TCO
June
2013
Study
by
Accenture

Technology
Labs

Not
Sponsored
or
Funded
by
Amazon

“Accenture
assessed
the
price-‐
performance
raJo
between
bare-‐metal

Hadoop
clusters
and
Hadoop-‐as-‐a-‐Service

on
Amazon
Web
Services…[and]
revealed

that
Hadoop-‐as-‐a-‐Service
oﬀers
bePer

price-‐performance
raJo…”

hkp://www.accenture.com/us-‐en/Pages/insight-‐hadoop-‐
deployment-‐comparison.aspx

•  Spot allows customers
to bid on unused EC2
capacity
•  Spot price based on
supply/demand of
instance types in an
Availability Zone
•  Customers are fulfilled
when their bid price is
higher than the Spot
Price
•  Instances will be
interrupted when the
Spot price exceed the
bid price
Spot 101 - What are Spot Instances

elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4

Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Other EMR + Spot Use Cases
§ Run entire cluster on Spot for biggest cost savings
§ Reduce the cost of application testing
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50%
Cost Savings: ~20%
Reducing Hadoop Costs with Spot

Stop Guessing Capacity

Dynamic Clusters

Extend on-premise environments…

Populate as demand dictates…

Connect over dedicated links…

And turn it off when you are done

EMR is Hadoop…

…cheaper, easier, and more agile

What’s New?
•  MapR M7 Introduction
•  Optimised for HBase Clusters
•  Failure Recovery
•  Point in Time Recovery
Snapshotting
•  Low Latency Hadoop Optimisations
•  HBase Mirroring
•  NFS + HDFS
•  MapR M5 Price Drop
•  Support for Pig 0.11.1
•  RANK, CUBE & ROLLUP capability
•  Groovy UDF’s
•  Support for Guava Functions
•  Performance Improvements
•  Spark/Shark Bootstrap
Action
•  In Memory Hadoop
•  Spark Scripting (similar to Pig)
•  Shark Shell with Hive
Interoperability

Amazon Elastic Map Reduce - Ian Meyers

More Related Content

What's hot

Similar to Amazon Elastic Map Reduce - Ian Meyers

More from huguk

Recently uploaded

Amazon Elastic Map Reduce - Ian Meyers