Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
6. --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop
--keyword-config-file–merge values in new configto existing
--keyword-key-value–override values provided
Configuration File Name
Configuration File Keyword
File Name Shortcut
Key-Value Pair Shortcut
core-site.xml
core
C
c
hdfs-site.xml
hdfs
H
h
mapred-site.xml
mapred
M
m
yarn-site.xml
yarn
Y
y
7. Set number of mappers per task tracker
Useful for small memory footprint map tasks
More work done with a given instance
8. Set HDFS block size to 1MB
Useful for smaller files when HDFS is used
9. Reuse mappers
Mapper startup time ~ 2-20 seconds
Useful for tasks with large number of mappers
Mappers must be “clean” after run (relevant for Java)
10. Configure process heap size, Java opts, and allow for replacing the hadoop- user-env.sh
Hadoop 1
Hadoop 2
--bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-daemons
--args–{namenode}-heap-size=2048,
--{namenode}-opts=-XX:GCTimeRatio=19
42. Hive 0.13.1
•Support for ORC
•Window functions
•Decimal types
•TRUNCATE command
•Better optimiser (less need for hinting)
Pig 0.12.0
•Streaming UDF’s not written in Java
•Native supportfor Avro
•Native support for Parquet
•Improved data types
Impala 1.1
•In-memory SQL engine
•Support for HBasetables
•Support for Parquet – column-oriented file format
•Query and interactive shells
HBase 0.94.18
•Database Snapshotting
•Improved read caching and seek optimisation
•Improved transactions
43. Read Data Directly into Hive, Pig, Streaming and Cascading from Kinesis Streams
No Intermediate Data Persistence Required
Simple way to introduce real time sources into Batch Oriented Systems
Multi-Application Support & Automatic Checkpointing
Amazon EMR Integration with Amazon Kinesis
44. drop table call_data_records;
CREATE TABLE call_data_records(
start_timebigint,
end_timebigint,
phone_numberSTRING,
carrier STRING,
recorded_durationbigint,
calculated_durationbigint,
latdouble,
long double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
STORED BY
'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream");
Amazon EMR Integration with Amazon Kinesis
58. Total tasks * Time to process sample files
Instance task capacity * Desired processing time
Estimated number of nodes:
59. 1.Estimate the number of tasks your job requires
150
2.Pick an instance and note down the number of Tasks it can run in parallel
m1.xlarge with 8 task capacity per instance
60. 3.We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
8 files selected for our sample test
61. 4.Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process this dataset.
3 min to process 8 files
62. Total tasks for your job * Time to process sample files
Per instance task capacity * Desired processing time
Estimated number of nodes:
150 * 3 min
8 * 5 min
=
11 m1.xlarge
63.
64.
65. Master instance group
Amazon EMR cluster
HDFS
HDFS
Run TaskTrackers(Compute)
Run DataNode(HDFS)
Core instance group
66. Can add core nodes
More HDFS space
More CPU/memory
Master instance group
Amazon EMR cluster
HDFS
HDFS
HDFS
Core instance group
67. Can’t remove core nodes because of HDFS
Master instance group
HDFS
HDFS
HDFS
Amazon EMR cluster
Core instance group
68. Run TaskTrackers
No HDFS
Reads from core node HDFS
Master instance group
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group
69. Can add task nodes
Master instance group
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group
70. More CPU power
More memory
Master instance group
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group
71. You can remove task nodes when processing is completed
Task instance group
Master instance group
Core instance group
HDFS
HDFS
Amazon EMR cluster
72. You can remove task nodes when processing is completed
Master instance group
HDFS
HDFS
Amazon EMR cluster
Task instance group
Core instance group