(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

November 13th, 2014 | Las Vegas, NV
Ian Meyers, Amazon Web Services

Compute
Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Amazon Elastic MapReduce
Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with Amazon S3, Amazon DynamoDB, Amazon Kinesis and Amazon Redshift
Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user tools automatically
Native support for Spot Instances
Integrated HBaseNoSQL database
Amazon EMR

--bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop
--keyword-config-file–merge values in new configto existing
--keyword-key-value–override values provided
Configuration File Name
Configuration File Keyword
File Name Shortcut
Key-Value Pair Shortcut
core-site.xml
core
C
c
hdfs-site.xml
hdfs
H
h
mapred-site.xml
mapred
M
m
yarn-site.xml
yarn
Y
y

Set number of mappers per task tracker
Useful for small memory footprint map tasks
More work done with a given instance

Set HDFS block size to 1MB
Useful for smaller files when HDFS is used

Reuse mappers
Mapper startup time ~ 2-20 seconds
Useful for tasks with large number of mappers
Mappers must be “clean” after run (relevant for Java)

Configure process heap size, Java opts, and allow for replacing the hadoop- user-env.sh
Hadoop 1
Hadoop 2
--bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-daemons
--args–{namenode}-heap-size=2048,
--{namenode}-opts=-XX:GCTimeRatio=19

EMRfs
HDFS
Amazon EMR
Amazon S3 Amazon
DynamoDB
Processed Files
Registry
File Data

aws emradd-steps --cluster-id <cluster>
--steps Name=GroupSmallFiles,
Type=CUSTOM_JAR,
Args=files,home/hadoop/lib/emr- s3distcp-1.0.jar,
src,s3://myawsbucket/cf,
dest,hdfs:///local,
groupBy,.*(i-w.log).*,
targetSize,128…

Algorithm
%Space Remaining
Encoding Speed
Decoding Speed
GZIP
13%
21MB/s
118MB/s
LZO
20%
135MB/s
410MB/s
Snappy
22%
172MB/s
409MB/s

Amazon EMR Cluster
Task Instance Group
Core Instance Group
HDFS
HDFS
Amazon S3

Hive 0.13.1
•Support for ORC
•Window functions
•Decimal types
•TRUNCATE command
•Better optimiser (less need for hinting)
Pig 0.12.0
•Streaming UDF’s not written in Java
•Native supportfor Avro
•Native support for Parquet
•Improved data types
Impala 1.1
•In-memory SQL engine
•Support for HBasetables
•Support for Parquet – column-oriented file format
•Query and interactive shells
HBase 0.94.18
•Database Snapshotting
•Improved read caching and seek optimisation
•Improved transactions

Read Data Directly into Hive, Pig, Streaming and Cascading from Kinesis Streams
No Intermediate Data Persistence Required
Simple way to introduce real time sources into Batch Oriented Systems
Multi-Application Support & Automatic Checkpointing
Amazon EMR Integration with Amazon Kinesis

drop table call_data_records;
CREATE TABLE call_data_records(
start_timebigint,
end_timebigint,
phone_numberSTRING,
carrier STRING,
recorded_durationbigint,
calculated_durationbigint,
latdouble,
long double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
STORED BY
'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream");
Amazon EMR Integration with Amazon Kinesis

EC2 Instance
MapTasks
ReduceTasks
m1.small
2
1
m1.large
3
1
m1.xlarge
8
3
m2.xlarge
3
1
m2.2xlarge
6
2
m2.4xlarge
14
4
m3.xlarge
6
1
m3.2xlarge
12
3
cg1.4xlarge
12
3
cc2.8xlarge
24
6
c3.4xlarge
24
6
hi1.4xlarge
24
6
hs1.8xlarge
24
6
cr1.8xlarge&
c3.8xlarge
48
12
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
0
50
100
150
200
250
300
Memory (GB)
Mappers*
Reducers*
CPU (ECU Units)
Local Storage (GB)

Instance
Cost / MapTask
Cost / ReduceTask
m1.large
$0.08
$0.15
m1.xlarge
$0.06
$0.15
m3.xlarge
$0.04
$0.07
m3.2xlarge
$0.04
$0.07

Instance
Cost / MapTask
Cost / ReduceTask
c1.medium
$0.13
$0.13
c1.xlarge
$0.35
$0.70
c3.xlarge
$0.05
$0.11
c3.2xlarge
$0.05
$0.11

Total tasks * Time to process sample files
Instance task capacity * Desired processing time
Estimated number of nodes:

1.Estimate the number of tasks your job requires
150
2.Pick an instance and note down the number of Tasks it can run in parallel
m1.xlarge with 8 task capacity per instance

3.We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
8 files selected for our sample test

4.Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process this dataset.
3 min to process 8 files

Total tasks for your job * Time to process sample files
Per instance task capacity * Desired processing time
Estimated number of nodes:
150 * 3 min
8 * 5 min
=
11 m1.xlarge

Master instance group
Amazon EMR cluster
HDFS
HDFS
Run TaskTrackers(Compute)
Run DataNode(HDFS)
Core instance group

Can add core nodes
More HDFS space
More CPU/memory
Amazon EMR cluster
HDFS
HDFS
HDFS
Core instance group

Can’t remove core nodes because of HDFS
HDFS
HDFS
HDFS
Amazon EMR cluster
Core instance group

Run TaskTrackers
No HDFS
Reads from core node HDFS
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group

Can add task nodes
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group

More CPU power
More memory
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group

You can remove task nodes when processing is completed
Task instance group
Core instance group
HDFS
HDFS
Amazon EMR cluster

You can remove task nodes when processing is completed
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Similar to (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014