SlideShare a Scribd company logo
1 of 409
Download to read offline
Amazon Elastic MapReduce 
Developer Guide 
API Version 2009-11-30
Amazon Elastic MapReduce Developer Guide 
Amazon Web Services
Amazon Elastic MapReduce Developer Guide 
Amazon Elastic MapReduce: Developer Guide 
Amazon Web Services 
Copyright © 2012 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. 
The following are trademarks or registered trademarks of Amazon: Amazon, Amazon.com, Amazon.com 
Design, Amazon DevPay, Amazon EC2, Amazon Web Services Design, AWS, CloudFront, EC2, Elastic 
Compute Cloud, Kindle, and Mechanical Turk. In addition, Amazon.com graphics, logos, page headers, 
button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or other 
countries. Amazon's trademarks and trade dress may not be used in connection with any product or service 
that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner 
that disparages or discredits Amazon. 
All other trademarks not owned by Amazon are the property of their respective owners, who may or may 
not be affiliated with, connected to, or sponsored by Amazon.
Amazon Elastic MapReduce Developer Guide 
Welcome ................................................................................................................................................. 1 
Understand Amazon EMR ...................................................................................................................... 2 
Overview of Amazon EMR ...................................................................................................................... 2 
Architectural Overview of Amazon EMR ....................................................................................... 3 
Elastic MapReduce Features ........................................................................................................ 4 
Amazon EMR Concepts ......................................................................................................................... 6 
Job Flows and Steps ..................................................................................................................... 6 
Hadoop and MapReduce .............................................................................................................. 7 
Associated AWS Product Concepts ...................................................................................................... 11 
Using Amazon EMR ............................................................................................................................. 15 
Setting Up Your Environment to Run a Job Flow .................................................................................. 17 
Create a Job Flow ................................................................................................................................. 23 
How to Create a Streaming Job Flow .......................................................................................... 24 
How to Create a Job Flow Using Hive ......................................................................................... 32 
How to Create a Job Flow Using Pig ........................................................................................... 40 
How to Create a Job Flow Using a Custom JAR ......................................................................... 48 
How to Create a Cascading Job Flow ......................................................................................... 56 
Launch an HBase Cluster on Amazon EMR ............................................................................... 64 
View Job Flow Details ........................................................................................................................... 72 
Terminate a Job Flow ............................................................................................................................ 77 
Customize a Job Flow .......................................................................................................................... 79 
Add Steps to a Job Flow ............................................................................................................. 79 
Wait for Steps to Complete ................................................................................................ 81 
Add More than 256 Steps to a Job Flow ............................................................................ 82 
Bootstrap Actions ........................................................................................................................ 84 
Resizing Running Job Flows ....................................................................................................... 96 
Calling Additional Files and Libraries ........................................................................................ 104 
Using Distributed Cache .................................................................................................. 104 
Running a Script in a Job Flow ........................................................................................ 109 
Connect to the Master Node in an Amazon EMR Job Flow ............................................................... 110 
Connect to the Master Node Using SSH ................................................................................... 111 
Web Interfaces Hosted on the Master Node ............................................................................. 115 
Open an SSH Tunnel to the Master Node ................................................................................. 116 
Configure Foxy Proxy to View Websites Hosted on the Master Node ....................................... 117 
Use Cases .......................................................................................................................................... 122 
Cascading ................................................................................................................................. 122 
Pig ............................................................................................................................................. 126 
Streaming .................................................................................................................................. 129 
Building Binaries Using Amazon EMR ................................................................................................ 131 
Using Tagging ..................................................................................................................................... 136 
Protect a Job Flow from Termination .................................................................................................. 136 
Lower Costs with Spot Instances ........................................................................................................ 141 
Choosing What to Launch as Spot Instances ........................................................................... 142 
Spot Instance Pricing in Amazon EMR ..................................................................................... 144 
Availability Zones and Regions ................................................................................................. 144 
Launching Spot Instances in Job Flows .................................................................................... 145 
Changing the Number of Spot Instances in a Job Flow ............................................................ 151 
Troubleshooting Spot Instances ................................................................................................ 154 
Store Data with HBase ....................................................................................................................... 155 
HBase Job Flow Prerequisites .................................................................................................. 155 
Launch an HBase Cluster on Amazon EMR ............................................................................. 156 
Connect to HBase Using the Command Line ............................................................................ 164 
Back Up and Restore HBase .................................................................................................... 165 
Terminate an HBase Cluster ..................................................................................................... 174 
Configure HBase ....................................................................................................................... 174 
Access HBase Data with Hive ................................................................................................... 178 
View the HBase User Interface ................................................................................................. 180 
View HBase Log Files ............................................................................................................... 180 
API Version 2009-11-30 
4
Amazon Elastic MapReduce Developer Guide 
Monitor HBase with CloudWatch ............................................................................................... 181 
Monitor HBase with Ganglia ...................................................................................................... 181 
Troubleshooting .................................................................................................................................. 183 
Things to Check When Your Amazon EMR Job Flow Fails ....................................................... 183 
Amazon EMR Logging .............................................................................................................. 187 
Enable Logging and Debugging ................................................................................................ 187 
Use Log Files ............................................................................................................................ 190 
Monitor Hadoop on the Master Node ........................................................................................ 199 
View the Hadoop Web Interfaces .............................................................................................. 200 
Troubleshooting Tips ................................................................................................................. 204 
Monitor Metrics with Amazon CloudWatch ......................................................................................... 209 
Monitor Performance with Ganglia ...................................................................................................... 220 
Distributed Copy Using S3DistCp ....................................................................................................... 227 
Export, Query, and Join Tables in Amazon DynamoDB ...................................................................... 234 
Prerequisites for Integrating Amazon EMR ............................................................................... 235 
Step 1: Create a Key Pair .......................................................................................................... 235 
Step 2: Create a Job Flow ......................................................................................................... 236 
Step 3: SSH into the Master Node ............................................................................................ 241 
Step 4: Set Up a Hive Table to Run Hive Commands ................................................................ 244 
Hive Command Examples for Exporting, Importing, and Querying Data .................................. 248 
Optimizing Performance ............................................................................................................ 255 
Use Third Party Applications With Amazon EMR ............................................................................... 258 
Parse Data with HParser ........................................................................................................... 258 
Using Karmasphere Analytics ................................................................................................... 259 
Launch a Job Flow on the MapR Distribution for Hadoop ......................................................... 260 
Write Amazon EMR Applications ........................................................................................................ 263 
Common Concepts for API Calls ........................................................................................................ 263 
Use SDKs to Call Amazon EMR APIs ................................................................................................ 265 
Using the AWS SDK for Java to Create an Amazon EMR Job Flow ......................................... 266 
Using the AWS SDK for .Net to Create an Amazon EMR Job Flow .......................................... 267 
Using the Java SDK to Sign a Query Request .......................................................................... 267 
Use Query Requests to Call Amazon EMR APIs ............................................................................... 268 
Why Query Requests Are Signed ............................................................................................. 269 
Components of a Query Request in Amazon EMR ................................................................... 269 
How to Generate a Signature for a Query Request in Amazon EMR ........................................ 270 
Configure Amazon EMR ..................................................................................................................... 274 
Configure User Permissions with IAM ................................................................................................ 274 
Set Policy for an IAM User ........................................................................................................ 277 
Configure IAM Roles for Amazon EMR .............................................................................................. 280 
Set Access Permissions on Files Written to Amazon S3 .................................................................... 285 
Using Elastic IP Addresses ................................................................................................................. 287 
Specify the Amazon EMR AMI Version ............................................................................................... 290 
Hadoop Configuration ......................................................................................................................... 299 
Supported Hadoop Versions ..................................................................................................... 300 
Configuration of hadoop-user-env.sh ........................................................................................ 302 
Upgrading to Hadoop 1.0 .......................................................................................................... 302 
Hadoop Version Behavior ................................................................................................ 303 
Hadoop 0.20 Streaming Configuration ...................................................................................... 304 
Hadoop Default Configuration (AMI 1.0) ................................................................................... 304 
Hadoop Configuration (AMI 1.0) ...................................................................................... 304 
HDFS Configuration (AMI 1.0) ......................................................................................... 307 
Task Configuration (AMI 1.0) ........................................................................................... 308 
Intermediate Compression (AMI 1.0) ............................................................................... 311 
Hadoop Memory-Intensive Configuration Settings (AMI 1.0) ................................................... 311 
Hadoop Default Configuration (AMI 2.0 and 2.1) ...................................................................... 314 
Hadoop Configuration (AMI 2.0 and 2.1) ......................................................................... 314 
HDFS Configuration (AMI 2.0 and 2.1) ............................................................................ 318 
Task Configuration (AMI 2.0 and 2.1) .............................................................................. 318 
API Version 2009-11-30 
5
Amazon Elastic MapReduce Developer Guide 
Intermediate Compression (AMI 2.0 and 2.1) .................................................................. 321 
Hadoop Default Configuration (AMI 2.2) ................................................................................... 322 
Hadoop Configuration (AMI 2.2) ...................................................................................... 322 
HDFS Configuration (AMI 2.2) ......................................................................................... 326 
Task Configuration (AMI 2.2) ........................................................................................... 326 
Intermediate Compression (AMI 2.2) ............................................................................... 329 
Hadoop Default Configuration (AMI 2.3) ................................................................................... 330 
Hadoop Configuration (AMI 2.3) ...................................................................................... 330 
HDFS Configuration (AMI 2.3) ......................................................................................... 334 
Task Configuration (AMI 2.3) ........................................................................................... 334 
Intermediate Compression (AMI 2.3) ............................................................................... 337 
File System Configuration ......................................................................................................... 338 
JSON Configuration Files .......................................................................................................... 340 
Multipart Upload ........................................................................................................................ 343 
Hadoop Data Compression ....................................................................................................... 344 
Setting Permissions on the System Directory ........................................................................... 345 
Hadoop Patches ........................................................................................................................ 346 
Hive Configuration .............................................................................................................................. 348 
Supported Hive Versions ........................................................................................................... 349 
Share Data Between Hive Versions ........................................................................................... 353 
Differences from Apache Hive Defaults .................................................................................... 353 
Interactive and Batch Modes ..................................................................................................... 355 
Creating a Metastore Outside the Hadoop Cluster ................................................................... 357 
Using the Hive JDBC Driver ...................................................................................................... 359 
Additional Features of Hive in Amazon EMR ............................................................................ 362 
Upgrade to Hive 0.8 .................................................................................................................. 368 
Upgrade the Configuration Files ...................................................................................... 368 
Upgrade the Metastore .................................................................................................... 369 
Upgrade to Hive 0.8 (MySQL on the Master Node) ................................................ 369 
Upgrade to Hive 0.8 (MySQL on Amazon RDS) ..................................................... 373 
Pig Configuration ................................................................................................................................ 377 
Supported Pig Versions ............................................................................................................. 377 
Pig Version Details .................................................................................................................... 379 
Performance Tuning ............................................................................................................................ 381 
Running Job Flows on an Amazon VPC ............................................................................................. 381 
Appendix: Compare Job Flow Types ................................................................................................... 389 
Appendix: Amazon EMR Resources ................................................................................................... 391 
Document History ............................................................................................................................... 396 
Glossary ............................................................................................................................................. 393 
Index ................................................................................................................................................... 401 
API Version 2009-11-30 
6
Welcome 
Amazon Elastic MapReduce Developer Guide 
How Do I...? 
This is the Amazon Elastic MapReduce (Amazon EMR) Developer Guide.This guide provides a conceptual 
overview of Amazon EMR, an overview of related AWS products, and detailed information on all 
functionality available from Amazon EMR. 
Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. Amazon 
EMR uses Hadoop processing combined with several AWS products to do such tasks as web indexing, 
data mining, log file analysis, machine learning, scientific simulation, and data warehousing. 
How Do I...? 
How Do I? Relevant Sections 
Decide whether Amazon EMR is right Amazon Elastic MapReduce detail page 
for my needs 
Get started with Amazon EMR Getting Started Guide 
Learn about troubleshooting job flows Troubleshooting (p. 183) 
Learn how to create a job flow Create a Job Flow (p. 23) 
Learn about bootstrap actions Bootstrap Actions (p. 84) 
Learn about Hadoop cluster Hadoop Configuration (p. 299) 
configuration 
Learn about the Amazon EMR API Write Amazon EMR Applications (p. 263) 
Compare different job flow types Appendix: Compare Job Flow Types (p. 389) 
API Version 2009-11-30 
1
Amazon Elastic MapReduce Developer Guide 
Overview of Amazon EMR 
Understand Amazon EMR 
Topics 
• Overview of Amazon EMR (p. 2) 
• Amazon EMR Concepts (p. 6) 
• Associated AWS Product Concepts (p. 11) 
This introduction to Amazon Elastic MapReduce (Amazon EMR) provides a summary of this web service. 
After reading this section, you should understand the service features, know how Amazon EMR interacts 
with other AWS products, and understand the basic functions of Amazon EMR. 
In this guide, we assume that you have read and completed the instructions described in the Getting 
Started Guide, which provides information on creating your Amazon Elastic MapReduce (Amazon EMR) 
account and credentials. 
You should be familiar with the following: 
• Hadoop. For more information go to http://hadoop.apache.org/core/. 
• Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and 
Amazon SimpleDB. For more information, see the Amazon Elastic Compute Cloud User Guide, the 
Amazon Simple Storage Service Developer Guide, and the Amazon SimpleDB Developer Guide, 
respectively. 
Overview of Amazon EMR 
Amazon Elastic MapReduce (Amazon EMR) is a data analysis tool that simplifies the set-up and 
management of a computer cluster, the source data, and the computational tools that help you implement 
sophisticated data processing jobs quickly. 
Typically, data processing involves performing a series of relatively simple operations on large amounts 
of data. In Amazon EMR, each operation is called a step and a sequence of steps is a job flow. A job flow 
that processes encrypted data might look like the following example. 
Step 1 Decrypt data 
Step 2 Process data 
API Version 2009-11-30 
2
Amazon Elastic MapReduce Developer Guide 
Architectural Overview of Amazon EMR 
Step 3 Encrypt data 
Step 4 Save data 
Amazon EMR uses Hadoop to divide up the work among the instances in the cluster, tracks status, and 
combine the individual results into one output. For an overview of Hadoop, see What Is Hadoop? (p. 8). 
Amazon EMR takes care of provisioning a Hadoop cluster, running the job flow, terminating the job flow, 
moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. Amazon EMR removes 
most of the cumbersome details of setting up the hardware and networking required by the Hadoop 
cluster, such as monitoring the setup, configuring Hadoop, and executing the job flow.Together, Amazon 
EMR and Hadoop provide all of the power of Hadoop processing with the ease, low cost, scalability, and 
power that Amazon S3 and Amazon EC2 offer. 
Architectural Overview of Amazon EMR 
Amazon Elastic MapReduce (Amazon EMR) works in conjunction with Amazon EC2 to create a Hadoop 
cluster, and with Amazon S3 to store scripts, input data, log files, and output results. The Amazon EMR 
process is outlined in the following figure and table. 
API Version 2009-11-30 
3
Amazon Elastic MapReduce Developer Guide 
Amazon EMR Process 
Elastic MapReduce Features 
Upload to Amazon S3 the data you want to process, as well as the mapper and reducer executables 
that process the data, and then send a request to Amazon EMR to start a job flow. 
1 
Amazon EMR starts a Hadoop cluster, which loads any specified bootstrap actions and then runs 
Hadoop on each node. 
2 
Hadoop executes a job flow by downloading data from Amazon S3 to core and task nodes. 
Alternatively, the data is loaded dynamically at run time by mapper tasks. 
3 
4 Hadoop processes the data and then uploads the results from the cluster to Amazon S3. 
5 The job flow is completed and you retrieve the processed data from Amazon S3. 
For details on mapping legacy job flows to instance groups, see Mapping Legacy Job Flows to Instance 
Groups (p. 102). 
Elastic MapReduce Features 
Topics 
• Bootstrap Actions (p. 4) 
• Configurable Data Storage (p. 4) 
• Hadoop and Step Logging (p. 5) 
• Hive Support (p. 5) 
• Resizeable Running Job Flows (p. 5) 
• Secure Data (p. 5) 
• Supports Hadoop Methods (p. 5) 
• Multiple Sequential Steps (p. 5) 
The following sections describe the features available in Amazon Elastic MapReduce (Amazon EMR). 
Bootstrap Actions 
A bootstrap action is a mechanism that lets you run a script on Elastic MapReduce cluster nodes before 
Hadoop starts. Bootstrap action scripts are stored in Amazon S3 and passed to Amazon EMR when 
creating a new job flow. Bootstrap action scripts are downloaded from Amazon S3 and executed on each 
node before the job flow is executed. 
By using bootstrap actions, you can install software on the node, modify the default Hadoop site 
configuration, or change the way Java parameters are used to run Hadoop daemons. 
Both predefined and custom bootstrap actions are available. The predefined bootstrap actions include 
Configure Hadoop, Configure Daemons, and Run-if.You can write custom bootstrap actions in any 
language already installed on the job flow instance, such as Ruby, Python, Perl, or bash. 
You can specify a bootstrap action from the command line interface, from the Amazon EMR console, or 
from the Amazon EMR API when starting a job flow. For more information, see Bootstrap Actions (p. 84). 
Configurable Data Storage 
Amazon EMR supports Hadoop Distributed Files System (HDFS). HDFS is fault-tolerant, scalable, and 
easily configurable. The default configuration is already optimized for most job flows. Generally, the 
API Version 2009-11-30 
4
Amazon Elastic MapReduce Developer Guide 
Elastic MapReduce Features 
configuration needs to be changed only for very large clusters. Configuration changes are accomplished 
using bootstrap actions. For more information, see Hadoop Configuration (p. 299). 
Hadoop and Step Logging 
Amazon EMR provides detailed logs you can use to debug both Hadoop and Amazon EMR. For more 
information on how to create logs, view logs, and use them to troubleshoot a job flow, see 
Troubleshooting (p. 183). 
Hive Support 
Amazon Elastic MapReduce (Amazon EMR) supports Apache Hive. Hive is an integrated data warehouse 
infrastructure built on top of Hadoop. It provides tools to simplify data summarization and provides ad 
hoc querying and analysis of large datasets stored in Hadoop files. Hive provides a simple query language 
called Hive QL, which is based on SQL. 
For more information on the supported versions of Hive, see Hive Configuration (p. 348). 
Resizeable Running Job Flows 
The ability to resize a running job flow lets you increase or decrease the number of nodes in a running 
cluster. Core nodes contain the Hadoop Distributed File System (HDFS). After a job flow is running, you 
can increase the number of core nodes. Task nodes also run your Hadoop, but do not contain HDFS. 
After a job flow is running you can also increase and decrease the number of task nodes. For more 
information, see Resizing Running Job Flows (p. 96). 
Secure Data 
Amazon EMR provides an authentication mechanism to ensure that data stored in Amazon S3 is secured 
against unauthorized access. By default, only the AWS Account owner can access the data uploaded to 
Amazon S3. Other users can access the data only if you explicitly edit security permissions. 
You can send data to Amazon S3 using the secure HTTPS protocol. Amazon EMR always uses a secure 
channel to send data between Amazon S3 and Amazon EC2. For added security, you can encrypt your 
data before uploading it to Amazon S3. For more information on AWS security, go to the AWS Security 
Center. 
Supports Hadoop Methods 
Amazon EMR supports job flows based on streaming, Hive, Pig, Custom JAR, and Cascading. Streaming 
enables you to write application logic in any language and to process large amounts of data using the 
Hadoop framework. Hive and Pig offer nonprogramming options with their SQL-like scripting languages. 
Custom JAR files enable you to write Java-based MapReduce functions. Cascading is an API with built-in 
MapReduce support that lets you create complex distributed processes. For more information, see Using 
Amazon EMR (p. 15). 
Multiple Sequential Steps 
Amazon EMR supports job flows with multiple, sequential steps, including the ability to add steps while 
a job flow runs. Individual steps can combine to create more sophisticated job flows. Additionally, you 
can incrementally add steps to a running job flow to help with debugging. For more information, see Add 
Steps to a Job Flow (p. 79). 
API Version 2009-11-30 
5
Amazon Elastic MapReduce Developer Guide 
Amazon EMR Concepts 
Amazon EMR Concepts 
Topics 
• Job Flows and Steps (p. 6) 
• Hadoop and MapReduce (p. 7) 
This section describes the concepts and terminology you need to understand and use Amazon Elastic 
MapReduce (Amazon EMR). 
Job Flows and Steps 
A job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data. 
A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data. 
Steps are executed in the order in which they are defined in the job flow. 
You can track the progress of a job flow by checking its state. The following diagram shows the life cycle 
of a job flow and how each part of the job flow process maps to a particular job flow state. 
A successful Amazon Elastic MapReduce (Amazon EMR) job flow follows this process: Amazon EMR 
first provisions a Hadoop cluster. During this phase, the job flow state is STARTING. Next, any user-defined 
bootstrap actions are run. During this phase, the job flow state is BOOTSTRAPPING. After all bootstrap 
actions are completed, the job flow state is RUNNING. The job flow sequentially runs all job flow steps 
during this phase. After all steps run, the job flow state transitions to SHUTTING_DOWN and the job flow 
shuts down the cluster. All data stored on a cluster node is deleted. Information stored elsewhere, such 
as in your Amazon S3 bucket, persists. Finally, when all job flow activity is complete, the job flow state 
is marked as COMPLETED. 
You can configure a job flow to go into a WAITING state once it completes processing of all steps. A job 
flow in the WAITING state continues running, waiting for you to add steps or manually terminate it.When 
you manually terminate a job flow, the Hadoop cluster shuts down and job flow state is SHUTTING_DOWN. 
When the job flow activity is complete, the final job flow state is TERMINATED. Creating a WAITING job 
flow is useful when troubleshooting. For more information on troubleshooting, see Debug Job Flows with 
Steps (p. 206). 
Any failure during the job flow process terminates the job flow and shuts down all cluster nodes. Any data 
stored on a cluster node is deleted. The job flow state is marked as FAILED. 
API Version 2009-11-30 
6
Amazon Elastic MapReduce Developer Guide 
Hadoop and MapReduce 
For a complete list of job flow states, see the JobFlowExecutionStatusDetail data type in the Amazon 
Elastic MapReduce (Amazon EMR) API Reference. 
You can also track the progress of job flow steps by checking their state. The following diagram shows 
the processing of job flow steps and how each step maps to a particular state. 
A job flow contains one or more steps. Steps are processed in the order in which they are listed in the 
job flow. Step are run following this sequence: all steps have their state set to PENDING. The first step is 
run and the step's state is set to RUNNING. When the step is completed, the step's state changes to 
COMPLETED. The next step in the queue is run, and the step's state is set to RUNNING. After each step 
completes, the step's state is set to COMPLETED and the next step in the queue is run. Step are run until 
there are no more steps. Processing flow returns to the job flow. 
If a step fails, the step state is FAILED and all remaining steps with a PENDING state are marked as 
CANCELLED. No further steps are run. and processing returns to the job flow. 
Data is normally communicated from one step to the next using files stored on the cluster's Hadoop 
Distributed File System (HDFS). Data stored on HDFS exists only as long as the cluster is running.When 
the cluster is shut down, all data is deleted. The final step in a job flow typically stores the processing 
results in an Amazon S3 bucket. 
For a complete list of step states, see the StepExecutionStatusDetail data type in the Amazon Elastic 
MapReduce (Amazon EMR) API Reference. 
Hadoop and MapReduce 
Topics 
• What Is Hadoop? (p. 8) 
• What Is MapReduce? (p. 8) 
• Instance Groups (p. 9) 
API Version 2009-11-30 
7
Amazon Elastic MapReduce Developer Guide 
Hadoop and MapReduce 
• Supported Hadoop Versions (p. 10) 
• Supported File Systems (p. 10) 
This section explains the roles of Apache Hadoop and MapReduce in Amazon Elastic MapReduce 
(Amazon EMR) and how these two methodologies work together to process data. 
What Is Hadoop? 
Apache Hadoop is an open-source Java software framework that supports massive data processing 
across a cluster of servers. Hadoop uses a programming model called MapReduce that divides a large 
data set into many small fragments. Hadoop distributes a data fragment and a copy of the MapReduce 
executable to each of the slave nodes in a Hadoop cluster. Each slave node runs the MapReduce 
executable on its subset of the data. Hadoop then combines the results from all of the nodes into a finished 
output. Amazon EMR enables you to upload that output into an Amazon S3 bucket you designate. 
For more information about Hadoop, go to http://hadoop.apache.org. 
What Is MapReduce? 
MapReduce is a combination of mapper and reducer executables that work together to process data. 
The mapper executable processes the raw data into key/value pairs, called intermediate results. The 
reducer executable combines the intermediate results, applies additional algorithms, and produces the 
final output, as described in the following process. 
MapReduce Process 
Amazon Elastic MapReduce (Amazon EMR) starts your instances in two security groups: one for 
the master node and another for the core node and task nodes. 
1 
Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a single 
cluster node. 
2 
Hadoop distributes the data files and the MapReduce executable to the core and task nodes of the 
cluster. 
Hadoop handles machine failures and manages network communication between the master, core, 
and task nodes. In this way, developers do not need to know how to perform distributed programming 
or handle the details of data redundancy and fail over. 
3 
The mapper function uses an algorithm that you supply to parse the data into key/value pairs.These 
key/value pairs are passed to the reducer. 
As an example, for a job flow that counts the number of times a word appears in a document, the 
mapper might take each word in a document and assign it a value of 1. Each word is a key in this 
case, and all values are 1. 
4 
The reducer function collects the results from all of the mapper functions in the cluster, eliminates 
redundant keys by combining values of all like keys, then performs the designated operation on all 
the values for each key, and then outputs the results. 
Continuing with the previous example, the reducer takes all of the word counts from all of the 
mappers functions running in a cluster, adds up the number of times each word was found, and 
then outputs that result to Amazon S3. 
5 
You can write the executables in any programming language. Mapper and reducer applications written 
in Java are compiled into a JAR file. Executables written in other programming languages use the Hadoop 
streaming utility to implement the mapper and reducer algorithms. 
API Version 2009-11-30 
8
Amazon Elastic MapReduce Developer Guide 
Hadoop and MapReduce 
The mapper executable reads the input from standard input and the reducer outputs data through standard 
output. By default, each line of input/output represents a record and the first tab on each line of the output 
separates the key and value. 
For more information about MapReduce, go to How Map and Reduce operations are actually carried out 
(http://wiki.apache.org/hadoop/HadoopMapReduce). 
Instance Groups 
Amazon EMR runs a managed version of Apache Hadoop, handling the details of creating the cloud-server 
infrastructure to run the Hadoop cluster. Amazon EMR refers to this cluster as a job flow, and defines the 
concept of instance groups, which are collections of Amazon EC2 instances that perform roles analogous 
to the master and slave nodes of Hadoop. There are three types of instance groups: master, core, and 
task. 
Each Amazon EMR job flow includes one master instance group that contains one master node, a core 
instance group containing one or more core nodes, and an optional task instance group, which can contain 
any number of task nodes. 
If the job flow is run on a single node, then that instance is simultaneously a master and a core node. For 
job flows running on more than one node, one instance is the master node and the remaining are core 
or task nodes. 
For more information about instance groups, see Resizing Running Job Flows (p. 96). 
Master Instance Group 
The master instance group manages the job flow: coordinating the distribution of the MapReduce 
executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of 
each task performed, and monitors the health of the instance groups. To monitor the progress of the job 
flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly 
or access the user interface that Hadoop publishes to the web server running on the master node. For 
more information, see View Logs Using SSH (p. 197). 
As the job flow progresses, each core and task node processes its data, transfers the data back to Amazon 
S3, and provides status metadata to the master node. 
Note 
The instance controller on the master node uses MySQL. If MySQL becomes unavailable, the 
instance controller will be unable to launch and manage instances. 
Core Instance Group 
The core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance that 
runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). 
Core nodes are managed by the master node. 
The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow 
run. Because core nodes store data, you can't remove them from a job flow. However, you can add more 
core nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons. 
Caution 
Removing HDFS from a running node runs the risk of losing data. 
For more information about core instance groups, see Resizing Running Job Flows (p. 96). 
API Version 2009-11-30 
9
Amazon Elastic MapReduce Developer Guide 
Hadoop and MapReduce 
Task Instance Group 
The task instance group contains all of the task nodes in a job flow. The task instance group is optional. 
You can add it when you start the job flow or add a task instance group to a job flow in progress. 
Task nodes are managed by the master node. While a job flow is running you can increase and decrease 
the number of task nodes. Because they don't store data and can be added and removed from a job flow, 
you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity to 
handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon. 
For more information about task instance groups, see Resizing Running Job Flows (p. 96). 
Supported Hadoop Versions 
Amazon Elastic MapReduce (Amazon EMR) allows you to choose to run either Hadoop version 0.18, 
Hadoop version 0.20, or Hadoop version 0.20.205. 
For more information on Hadoop configuration, see Hadoop Configuration (p. 299) 
Supported File Systems 
Amazon EMR and Hadoop typically use two or more of the following file systems when processing a job 
flow: 
• Hadoop Distributed File System (HDFS) 
• Amazon S3 Native File System (S3N) 
• Local file system 
• Legacy Amazon S3 Block File System 
HDFS and S3N are the two main file systems used with Amazon EMR 
HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data 
awareness between the Hadoop cluster nodes managing the job flows and the Hadoop cluster nodes 
managing the individual steps. For more information on how HDFS works, see 
http://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html. 
The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon 
S3. The advantage of this file system is that you can access files on Amazon S3 that were written with 
other tools. For information on how Amazon S3 and Hadoop work together, see 
http://wiki.apache.org/hadoop/AmazonS3. 
The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is 
created from an Amazon EC2 instance which comes with a preconfigured block of preattached disk 
storage called an Amazon EC2 local instance store. Data on instance store volumes persists only during 
the life of the associated Amazon EC2 instance. The amount of this disk storage varies by Amazon EC2 
instance type. It is ideal for temporary storage of information that is continually changing, such as buffers, 
caches, scratch data, and other temporary content. For more information about Amazon EC2 instances, 
see Amazon Elastic Compute Cloud. 
The Amazon S3 Block File System Files is a legacy file storage system.We strongly discourage the use 
of this system. 
For more information on how to use and configure file systems in Amazon EMR, see File System 
Configuration (p. 338). 
API Version 2009-11-30 
10
Amazon Elastic MapReduce Developer Guide 
Associated AWS Product Concepts 
Associated AWS Product Concepts 
Topics 
• Amazon EC2 Concepts (p. 11) 
• Amazon S3 Concepts (p. 14) 
• AWS Identity and Access Management (IAM) (p. 14) 
• Regions (p. 14) 
• Data Storage (p. 14) 
This section describes AWS concepts and terminology you need to understand to use Amazon Elastic 
MapReduce (Amazon EMR) effectively. 
Amazon EC2 Concepts 
Topics 
• Amazon EC2 Instances (p. 11) 
• Reserved Instances (p. 13) 
• Elastic IP Address (p. 13) 
• Amazon EC2 Key Pairs (p. 13) 
The following sections describe Amazon EC2 features used by Amazon EMR. 
Amazon EC2 Instances 
Amazon EMR enables you to choose the number and kind of Amazon EC2 instances that comprise the 
cluster that processes your job flow. Amazon EC2 offers several basic types. 
• Standard—You can use Amazon EC2 standard instances for most applications. 
• High-CPU—These instances have proportionally more CPU resources than memory (RAM) for 
compute-intensive applications. 
• High-Memory—These instances offer large memory sizes for high throughput applications, including 
database and memory caching applications. 
• Cluster Compute—These instances provide proportionally high CPU resources with increased network 
performance. They are well suited for demanding network-bound applications. 
• High Storage—These instances provide proportionally high storage resources. They are well suited 
for data warehouse applications. 
Note 
Amazon EMR does not support micro instances at this time. 
The following table describes all the instance types that Amazon EMR supports. 
I/O Name 
Performance 
Platform 
(bits) 
Disk 
Drive 
(GiB) 
Compute 
Units 
RAM 
(GiB) 
Instance Type 
Small (default) 1.7 1 150 32 Moderate m1.small 
Large 7.5 4 840 64 High m1.large 
Extra Large 15 8 1680 64 High m1.xlarge 
API Version 2009-11-30 
11
I/O Name 
Performance 
Amazon Elastic MapReduce Developer Guide 
Platform 
(bits) 
Amazon EC2 Concepts 
Disk 
Drive 
(GiB) 
Compute 
Units 
RAM 
(GiB) 
Instance Type 
High-CPU Medium 1.7 5 340 32 Moderate c1.medium 
High-CPU Extra Large 7 20 1680 64 High c1.xlarge 
High-Memory Extra Large 17.1 6.5 420 64 Moderate m2.xlarge 
High-Memory Double Extra 34.2 13 850 64 Moderate m2.2xlarge 
Large 
High-Memory Quadruple 68.4 26 1680 64 High m2.4xlarge 
Extra Large 
Very High cc1.4xlarge 
(10 Gigabit 
Ethernet) 
Cluster Compute Quadruple 23 33.5 1690 64 
Extra Large Instance* 
Very High cc2.8xlarge 
(10 Gigabit 
Ethernet) 
Cluster Compute Eight 60.5 88 3360 64 
Extra Large** 
Very High hs1.8xlarge 
(10 Gigabit 
Ethernet) 
High Storage* 117 35 49152 64 
Very High cg1.4xlarge 
(10 Gigabit 
Ethernet) 
Cluster GPU*** 23 33.5 1680 64 
*Cluster Compute Quadruple Extra Large instances and High Storage instances are supported only in 
the US East (Northern Virginia) Region. 
**Cluster Compute Eight Extra Large instances are only supported in the US East (Northern Virginia), 
US West (Oregon), and EU (Ireland) Regions. 
***Cluster GPU instances have 22 GB, with 1 GB reserved for GPU operation. 
The practical limit of the amount of data you can process depends on the number and type of Amazon 
EC2 instances selected as your cluster nodes, and on the size of your intermediate and final data. This 
is because the input, intermediate, and output data sets reside on the cluster nodes while your job flow 
runs. For example, the maximum amount of data that you can process on a 20-node cluster is 34 TB (20 
Extra Large instances x 1.69 TB of hard disk per Amazon EC2 instance = 34 TB). 
The default maximum number of Amazon EC2 instances you can specify is 20. If you need more instances, 
you can make a formal request. For more information, go to the Request to Increase Amazon EC2 Instance 
Limit Form. 
Related Topics 
• Request additional Amazon EC2 instances 
• Amazon EC2 Instance Types 
• High Performance Computing (HPC) 
API Version 2009-11-30 
12
Amazon Elastic MapReduce Developer Guide 
Amazon EC2 Concepts 
Reserved Instances 
Reserved Instances provide guaranteed capacity and are an additional Amazon EC2 pricing option.You 
make a one-time payment for an instance to reserve capacity and reduce hourly usage charges. Reserved 
Instances complement existing Amazon EC2 On-Demand Instances and provide an option to reduce 
computing costs. As with On-Demand Instances, you pay only for the compute capacity that you actually 
consume, and if you don't use an instance, you don't pay usage charges for it. 
To use a Reserved Instance with Amazon EMR, launch your job flow in the same Availability Zone as 
your Reserved Instance. For example, let's say you purchase one m1.small Reserved Instance in US-East. 
If you launch a job flow that uses two m1.small instances in the same Availability Zone in Region US-East, 
one instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. If you 
have a sufficient number of available Reserved Instances for the total number of instances you want to 
launch, you are guaranteed capacity.Your Reserved Instances are used before any On-Demand Instances 
are created. 
You can use Reserved Instances by using either the Amazon EMR console, the command line interface 
(CLI), Amazon EMR API actions, or the AWS SDKs. 
Related Topics 
• Amazon EC2 Reserved Instances 
Elastic IP Address 
Elastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP address 
is associated with your account, not a particular instance.You control the addresses associated with your 
account until you choose to explicitly release them. 
You can associate one Elastic IP address with only one job flow at a time. To ensure our customers are 
efficiently using Elastic IP addresses, we impose a small hourly charge when IP addresses associated 
with your account are not mapped to a job flow or Amazon EC2 instance. When Elastic IP addresses are 
mapped to an instance, they are free of charge. 
For more information about enabling Elastic IP addresses with Amazon EMR, see Using Elastic IP 
Addresses (p. 287). For more information about using IP addresses in AWS, go to the Using Elastic IP 
Addresses section in the Amazon Elastic Compute Cloud User Guide. 
Amazon EC2 Key Pairs 
When Amazon EMR starts an Amazon EC2 instance, it uses a 2048-bit RSA key pair that you have 
named. Amazon EC2 stores the public key. Amazon EMR stores the private key and uses the private 
key to validate all requests. 
The key pair ensures that only you can access your job flows. When you launch an instance using your 
key pair name, the public key becomes part of the instance metadata. This allows you to access the 
cluster node securely. 
Although specifying the key pair is optional, we strongly recommend that you use key pairs. This key pair 
becomes associated with all of the nodes created to process your job flow. The key pair name creates a 
handle you can use to access the master node in the Hadoop cluster. With the key pair name, you can 
log in to the master node without using a password, enabling you to monitor the progress of your job 
flows. On the master node, you can retrieve detailed job flow processing status and statistics. 
For more information on how to create and use an Amazon EC2 key pair with Amazon EMR, see "Creating 
an Amazon EC2 Key Pair" in the Getting Started Guide. 
API Version 2009-11-30 
13
Amazon Elastic MapReduce Developer Guide 
Amazon S3 Concepts 
Amazon S3 Concepts 
Topics 
• Buckets (p. 14) 
• Multipart Upload (p. 14) 
The following sections describe Amazon S3 features used by Amazon EMR. 
Buckets 
Amazon EMRrequires Amazon S3 buckets to hold the input and output data of your Hadoop processing. 
Amazon EMR uses the Amazon S3 Native File System for Hadoop processing. Amazon S3 uses the 
hostname method for accessing data, which places restrictions on bucket names used in Amazon EMR 
job flows. 
For more information on creating Amazon S3 buckets for use with Amazon EMR, see Setting Up Your 
Environment to Run a Job Flow (p. 17). For more information on Amazon S3 buckets, go to Working 
with Amazon S3 Buckets in the Amazon S3 Developer Guide. 
Multipart Upload 
Amazon Elastic MapReduce (Amazon EMR) supports Amazon S3 multipart upload through the AWS 
SDK for Java. Multipart upload lets you upload a single object as a set of parts.You can upload these 
object parts independently and in any order. If transmission of any part fails, you can retransmit that part 
without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts 
and creates the object. 
For more information about enabling multipart uploads with Amazon EMR, see Multipart Upload (p. 343). 
For more information on Amazon S3 multipart uploads, go to Uploading Objects Using Multipart Upload 
in the Amazon S3 Developer Guide. 
AWS Identity and Access Management (IAM) 
Amazon Elastic MapReduce (Amazon EMR) supports AWS Identity and Access Management (IAM) 
policies. IAM is a web service that enables AWS customers to manage users and user permissions. For 
more information about enabling IAM policies with Amazon EMR, see Configure User Permissions with 
IAM (p. 274). For more information on IAM, go to Using IAM in the Using AWS Identity and Access 
Management guide. 
Regions 
You can choose the geographical region where Amazon EC2 creates the cluster to process your data. 
You might choose a region to optimize latency, minimize costs, or address regulatory requirements. 
Setting a region-specific endpoint guarantees where your data resides. For the list of regions and endpoints 
supported by Amazon EMR, go to Regions and Endpoints in the Amazon Web Services General Reference. 
Data Storage 
Amazon EMR uses Amazon S3 and Amazon SimpleDB data storage systems when processing a job 
flow. For more information about using Amazon S3 with Hadoop, go to 
http://wiki.apache.org/hadoop/AmazonS3. For more information about Amazon SimpleDB, go to the 
Amazon SimpleDB product description page. 
API Version 2009-11-30 
14
Amazon Elastic MapReduce Developer Guide 
Using Amazon EMR 
Topics 
• Setting Up Your Environment to Run a Job Flow (p. 17) 
• Create a Job Flow (p. 23) 
• View Job Flow Details (p. 72) 
• Terminate a Job Flow (p. 77) 
• Customize a Job Flow (p. 79) 
• Connect to the Master Node in an Amazon EMR Job Flow (p. 110) 
• Use Cases (p. 122) 
• Building Binaries Using Amazon EMR (p. 131) 
• Using Tagging (p. 136) 
• Protect a Job Flow from Termination (p. 136) 
• Lower Costs with Spot Instances (p. 141) 
• Store Data with HBase (p. 155) 
• Troubleshooting (p. 183) 
• Monitor Metrics with Amazon CloudWatch (p. 209) 
• Monitor Performance with Ganglia (p. 220) 
• Distributed Copy Using S3DistCp (p. 227) 
• Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR (p. 234) 
• Use Third Party Applications With Amazon EMR (p. 258) 
This section covers the fundamentals of creating, managing, and troubleshooting a job flow using Amazon 
Elastic MapReduce (Amazon EMR). All supported job flow types are described. Information on using the 
Amazon EMR console, the CLI, SDKs, and API is included. 
If you have not signed up to use Amazon EMR, instructions are provided in the Getting Started Guide. 
Tip 
We strongly recommend that you work through the examples in the Getting Started Guide to 
get a basic understanding of Amazon EMR. 
Amazon EMR offers a variety of interfaces, including a console, a command line interface (CLI), a query 
API, AWS SDKs, and libraries. Each interface offers a different balance of ease and functionality. The 
interface you choose depends on your knowledge of Hadoop, your programming skills, and the functionality 
you require: 
API Version 2009-11-30 
15
Amazon Elastic MapReduce Developer Guide 
• The Amazon EMR console provides a graphical interface from which you can launch Amazon EMR 
job flows and monitor their progress. 
• The CLI combines full compatibility with the Amazon EMR API without requiring a programming 
environment. The Ruby-based Amazon EMR CLI is available for download at Amazon Elastic 
MapReduce Ruby Client (http://aws.amazon.com/developertools/2264.) 
• The Amazon EMR API, SDKs, and libraries offer the most flexibility but require a programming 
environment and software development skills. For more information on using the query API to access 
Amazon EMR see Write Amazon EMR Applications (p. 263) in this guide. The AWS SDKs provides 
support for Java, C#, and .NET. For more information on the AWS SDKs, refer to the list of current 
AWS SDKs. Libraries are available for Perl and PHP. For more information about the Perl and PHP 
libraries see Sample Code & Libraries (http://aws.amazon.com/code/Elastic-MapReduce.) 
The following table compares the functionality of the Amazon EMR interfaces. 
API/SDK/ 
Libraries 
Amazon CLI 
EMR 
Console 
Function 
Create multiple job flows 
Define bootstrap actions in a job flow 
View logs for Hadoop jobs, tasks, and task attempts using 
a graphical interface 
Implement Hadoop data processing programmatically 
Monitor job flows in real time 
Provide verbose job flow details 
Resize running job flows 
Run job flows with multiple steps 
Select version of Hadoop, Hive, and Pig 
Specify the MapReduce executable in multiple computer 
languages 
Specify the number and type of Amazon Amazon EC2 
instances that process the data 
Transfer data to and from Amazon S3 automatically 
Terminate job flows in real time 
The following sections describe how to use Amazon Elastic MapReduce (Amazon EMR) with each of the 
interface types. 
API Version 2009-11-30 
16
Amazon Elastic MapReduce Developer Guide 
Setting Up Your Environment to Run a Job Flow 
Setting Up Your Environment to Run a Job Flow 
This section walks you through how to set up required resources and permissions to run a job flow. The 
tasks that follow show you how to create the resources that your job flow uses to process data. Once 
created, you can reuse these resources for other job flows. Depending on your application, however, it 
may make operational sense to create new resources for each job flow. 
The tasks that must be completed before you create a job flow are as follows: 
1 Choose a Region (p. 17) 
2 Create and Configure an Amazon S3 Bucket (p. 19) 
3 Create an Amazon EC2 Key Pair and PEM File (p. 20) 
4 Modify Your PEM File (p. 21) 
5 For CLI and API users only, Get Security Credentials (p. 21) 
6 For CLI users only, optionally Create a Credentials File (p. 22) 
The following sections provide instructions on how to perform each of the tasks. 
Choose a Region 
AWS enables you to place resources in multiple locations. Locations are composed of Regions and 
Availability Zones within those Regions. Availability Zones are distinct geographical locations that are 
engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency 
network connectivity to other Availability Zones in the same Region. 
All Amazon EC2 Instances, key pairs, security groups, and Amazon Elastic MapReduce (Amazon EMR) 
job flows must be located in the same Region.To optimize performance and reduce latency, all resources 
(such as Amazon S3 buckets) and job flows should be located in the same Availability Zone. 
For more information about Regions and Availability Zones, go to Using Regions and Availability Zones 
in the Amazon Elastic Compute Cloud User Guide 
Note 
Not all AWS products offer the same support in all Regions. For example, Cluster Compute 
instances are available only in the US-East (Northern Virginia) Region and the Asia Pacific 
(Sydney) region supports only Hadoop 1.0.3 and later. Confirm that you are working in the 
appropriate Region for the resources you want to use. 
You must ensure that you use the same Region for each resource you create. Use the table below to 
identify the correct Region name. 
If your Amazon EMR 
The Amazon EMR CLI 
The Amazon S3 
The Amazon EC2 
Region is... 
and API Region is... 
Region is... 
Region is... 
US East (Virginia) us-east-1 US Standard US East (Virginia) 
US West (Oregon) us-west-2 Oregon US West (Oregon) 
US West (N. California) us-west-1 Northern California US West (N. California) 
EU West (Ireland) eu-west-1 Ireland EU West (Ireland) 
Asia Pacific (Singapore) ap-southeast-1 Singapore Asia Pacific (Singapore) 
API Version 2009-11-30 
17
Amazon Elastic MapReduce Developer Guide 
Choose a Region 
If your Amazon EMR 
The Amazon EMR CLI 
The Amazon S3 
The Amazon EC2 
Region is... 
and API Region is... 
Region is... 
Region is... 
Asia Pacific (Sydney) ap-southeast-2 Sydney Asia Pacific (Sydney) 
Asia Pacific (Tokyo) ap-northeast-1 Tokyo Asia Pacific (Tokyo) 
South America (Sao 
Paulo) 
South America (Sao sa-east-1 Sao Paulo 
Paulo) 
Using the Amazon EMR Console to Specify a Region 
To select a region in Amazon EMR 
• From the Amazon EMR console, select the Region from the drop-down list. 
Using the CLI to Specify a Region 
Specify the Region with the --region parameter, as in the following example. If the --region parameter 
is not specified, the job flow is created in the us-east-1 region. 
$ ./elastic-mapreduce --create --alive --stream --input myawsbucket  
--output myawsbucket --log-uri --region eu-west-1 
Tip 
To reduce the number of parameters required each time you issue a command from the CLI, 
you can store information such as Region in your credentials.json file. For more information 
on creating a credentials.json file, go to the Create a Credentials File (p. 22). 
API Version 2009-11-30 
18
Amazon Elastic MapReduce Developer Guide 
Create and Configure an Amazon S3 Bucket 
Using the API to Specify a Region 
To select a region, configure your application to use that Region's endpoint. If you are creating a client 
application using an AWS SDK, you can change the client endpoint by calling setEndpoint, as shown 
in the following example: 
client.setEndpoint(“eu-west-1.elasticmapreduce.amazonaws.com”); 
Once your application has specified a region by setting the endpoint, you can set the Availability Zone 
for your job flow's Amazon EC2 instances with a query request that contains a 
Instances.Placement.AvailabilityZone parameter, as in the following example. If you do not 
specify the Availability Zone for your job flow, Amazon EMR launches the job flow instances in the best 
Availability Zone in that region based on system health and available capacity. 
https://elasticmapreduce.amazonaws.com? 
Operation= 
... 
Instances.Placement.AvailabilityZone=eu-west-1a& 
... 
For more information about the parameters in an Amazon EMR request, see API Reference. 
Note 
For more information on specifying Regions from the CLI and API, see Available Region 
Endpoints for the AWS SDKs . 
Create and Configure an Amazon S3 Bucket 
Amazon Elastic MapReduce (Amazon EMR) uses Amazon S3 to store input data, log files, and output 
data. Amazon S3 refers to these storage locations as buckets.To conform with Amazon S3 requirements, 
DNS requirements, and restrictions in the supported data analysis tools, we recommend following the 
following guidelines for bucket names. All bucket names must: 
• Be between 3 and 63 characters long 
• Contain only lowercase letters, numbers, or periods (.) 
• Not contain a dash (-) or underscore (_) 
For additional details on valid bucket names, go to Bucket Restrictions and Limitations in the Amazon 
Simple Storage Service Developers Guide. 
This section shows you how to use the AWS Management Console to create and then set permissions 
for an Amazon S3 bucket. However, you can also create and set permissions for an Amazon S3 bucket 
using the Amazon S3 API or the third-party Curl command line tool. For information about Curl, go to 
Amazon S3 Authentication Tool for Curl. For information about using the Amazon S3 API to create and 
configure an Amazon S3 bucket, go to the Amazon Simple Storage Service API Reference. 
Using the AWS Management Console to Create an Amazon 
S3 Bucket 
To create an Amazon S3 bucket 
1. Sign in to the AWS Management Console and open the Amazon S3 console at 
https://console.aws.amazon.com/s3/. 
API Version 2009-11-30 
19
Amazon Elastic MapReduce Developer Guide 
Create an Amazon EC2 Key Pair and PEM File 
2. Click Create Bucket. 
The Create a Bucket dialog box opens. 
3. Enter a bucket name, such as mylog-uri. 
This name should be globally unique, and cannot be the same name used by another bucket. 
4. Select the Region for your bucket. To avoid paying cross-region bandwidth charges, create the 
Amazon S3 bucket in the same region as your job flow. 
Refer to Choose a Region (p. 17) for guidance on choosing a Region. 
5. Click Create. 
You created a bucket with the URI s3n://mylog-uri/. 
Note 
If you enable logging in the Create a Bucket wizard, it enables only bucket access logs, not 
Amazon EMR job flow logs. 
Note 
For more information on specifying Region-specific buckets, refer to Buckets and Regions in the 
Amazon Simple Storage Service Developer Guide and Available Region Endpoints for the AWS 
SDKs . 
After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself 
(the owner) read and write access and authenticated users read access. 
Using the AWS Management Console to configure an 
Amazon S3 bucket 
To set permissions on an Amazon S3 bucket 
1. Sign in to the AWS Management Console and open the Amazon S3 console at 
https://console.aws.amazon.com/s3/. 
2. In the Buckets pane, right-click the bucket you just created. 
3. Select Properties. 
4. In the Properties pane, select the Permissions tab. 
5. Click Add more permissions. 
6. Select Authenticated Users in the Grantee field. 
7. To the right of the Grantee drop-down list, select List. 
8. Click Save. 
You have created a bucket and restricted permissions to authenticated users. 
Create an Amazon EC2 Key Pair and PEM File 
Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alone 
have access to the instances that you launch. The PEM file associated with this key pair is required to 
ssh directly to the master node of the cluster running your job flow. 
To create an Amazon EC2 key pair 
1. Sign in to the AWS Management Console and open the Amazon EC2 console at 
https://console.aws.amazon.com/ec2/. 
2. From the Amazon EC2 console, select a Region. 
3. In the Navigation pane, click Key Pairs. 
API Version 2009-11-30 
20
Amazon Elastic MapReduce Developer Guide 
Modify Your PEM File 
4. On the Key Pairs page, click Create Key Pair. 
5. In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair. 
6. Click Create. 
7. Save the resulting PEM file in a safe location. 
Your Amazon EC2 key pair and an associated PEM file are created. 
Modify Your PEM File 
Amazon Elastic MapReduce (Amazon EMR) enables you to work interactively with your job flow, allowing 
you to test job flow steps or troubleshoot your cluster environment. To log in directly to the master node 
of your running job flow, you can use ssh or PuTTY.You use your PEM file to authenticate to the master 
node.The PEM file requires a modification based on the tool you use that supports your operating system. 
You use the CLI to connect on Linux or UNIX computers.You use PuTTY to connect on Microsoft Windows 
computers. For more information on how to install the Amazon EMR CLI or how to install PuTTY, go to 
the Getting Started Guide. 
To modify your credentials file 
• Create a local permissions file: 
If you are Do this... 
using... 
Set the permissions on the PEM file or your Amazon EC2 key pair. For example, if 
you saved the file as mykeypair.pem, the command looks like the following: 
$ chmod og-rwx mykeypair.pem 
Linux or 
UNIX 
a. Download PuTTYgen.exe to your computer from 
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html. 
b. Launch PuTTYgen. 
c. Click Load. Select the PEM file you created earlier. 
d. Click Open. 
e. Click OK on the PuTTYgen Notice telling you the key was successfully imported. 
f. Click Save private key to save the key in the PPK format. 
g. When PuTTYgen prompts you to save the key without a pass phrase, click Yes. 
h. Enter a name for your PuTTY private key, such as, mykeypair.ppk. 
i. Click Save. 
j. Exit the PuTTYgen application. 
Microsoft 
Windows 
Your credentials file is modified to allow you to log in directly to the master node of your running job flow. 
Get Security Credentials 
AWS assigns you an Access Key ID and a Secret Access Key to identify you as the sender of your 
request. AWS uses these security credentials to help protect your data.You include your Access Key ID 
API Version 2009-11-30 
21
Amazon Elastic MapReduce Developer Guide 
Create a Credentials File 
in all AWS requests made through the CLI or API.The AWS Management Console provides these security 
credentials automatically. 
Note 
Your Secret Access Key is a shared secret between you and AWS. Keep this key secret. Amazon 
uses this key to bill you for the AWS services you use. Never include your key in your requests 
to AWS and never email your key to anyone, even if an inquiry appears to originate from AWS 
or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret 
Access Key. 
To get your Access Key ID and Secret Access Key 
1. Go to the AWS website. 
2. Click My Account to display a list of options. 
3. Click Security Credentials and log in to your AWS Account.Your Access Key ID is displayed in 
the Access Credentials section.Your Secret Access Key remains hidden as a further precaution. 
4. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown in 
the following figure. 
You have your Access Key ID and a Secret Access Key to securely identify yourself to AWS.You need 
this information to create a credentials file, as described in the following section. 
Create a Credentials File 
You can use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. 
The credentials file provides information required for many commands.The credentials file is a convenient 
place for you to store command parameters so you don't have to repeatedly enter the information. 
Your credentials are used to calculate the signature value for every request you make.The Amazon EMR 
CLI automatically looks for these credentials in the file credentials.json. you can edit the 
credentials.json file and include your AWS credentials. If you do not have a credentials.json 
file, you must include your credentials in every request you make. 
To create your credentials file 
1. Create a file named credentials.json on your computer. 
2. Add the following lines to your credentials file: 
API Version 2009-11-30 
22
{ 
Amazon Elastic MapReduce Developer Guide 
Create a Job Flow 
"access-id": "AccessKeyID", 
"private-key": "PrivateKey", 
"key-pair": "KeyName", 
"key-pair-file": "location of key pair file", 
"region": "Region", 
"log-uri": "location of bucket on Amazon S3" 
} 
The access-id and private-key are the AWS Access Key ID and a Secret Access Key described in Get 
Security Credentials (p. 21). The key-pair and key-pair-file are the Amazon EC2 key pair and the path 
and name of PEM file you created in Create an Amazon EC2 Key Pair and PEM File (p. 20). The region 
is the Region you selected in Choose a Region (p. 17).The log-uri is the path to the bucket you created 
in Create and Configure an Amazon S3 Bucket (p.19) using the format s3n://BucketName/FolderName. 
Your credentials.json file is configured. 
Each of the preceding tasks guided you through the steps to set up the objects and permissions required 
for a job flow.You are now ready to create your job flow. Instructions on how to create a job flow are at 
Create a Job Flow (p. 23). 
Create a Job Flow 
Topics 
• Choose a Job Flow Type (p. 23) 
• Choose Job Flow Interface (p. 24) 
• Identify Data, Scripts, and Log File locations (p. 24) 
• How to Create a Streaming Job Flow (p. 24) 
• How to Create a Job Flow Using Hive (p. 32) 
• How to Create a Job Flow Using Pig (p. 40) 
• How to Create a Job Flow Using a Custom JAR (p. 48) 
• How to Create a Cascading Job Flow (p. 56) 
• Launch an HBase Cluster on Amazon EMR (p. 64) 
This section covers the basics of creating a job flow using Amazon Elastic MapReduce (Amazon EMR). 
You can create a job flow using the Amazon EMR console, downloading and installing the Command 
Line Interface (CLI), or creating a query request with the Query API. The interface-specific details for 
using either the Amazon EMR console, the CLI, or the API are covered in the following sections. 
For information about creating the objects and setting the permissions needed to create a job flow see 
Setting Up Your Environment to Run a Job Flow (p. 17). For information on the job flow process and how 
individual steps are processed see Job Flows and Steps (p. 6). 
Choose a Job Flow Type 
Choose one of the supported job flow types: your choice of job flow type depends on several factors, 
including the format of the data and your level of programming knowledge. For information on comparing 
the supported job flow types, see Appendix: Compare Job Flow Types (p. 389). 
API Version 2009-11-30 
23
Amazon Elastic MapReduce Developer Guide 
Choose Job Flow Interface 
Choose Job Flow Interface 
Choose the manner in which you want to create your job flow. The description of each job flow type in 
this section includes details on how to create a job flow using the Amazon EMR console, the CLI, or 
Query API. The Amazon EMR console provides a graphical interface to launch Elastic MapReduce job 
flows and monitor their progress. The CLI combines full compatibility with the Elastic MapReduce API 
without requiring a programming environment.The Elastic MapReduce API, AWS SDK, and libraries offer 
the most flexibility, but require a programming environment and software development skills. 
Identify Data, Scripts, and Log File locations 
You need to plan the job flow you want to run and specify where Amazon EMR finds the information. 
Typically, the MapReduce program or script is located in a bucket on Amazon S3.Your job flow input, 
output, and job flow logs are also typically located on Amazon S3. 
Required Amazon S3 buckets must exist before you can create a job flow.You must upload any required 
scripts or data referenced in the job flow to Amazon S3. The following table describes example data, 
scripts, and log file locations. 
Information Example Location on Amazon S3 
script or program s3://myawsbucket/wordcount/wordSplitter.py 
log files s3://myawsbucket/wordcount/logs 
input data s3://myawsbucket/wordcount/input 
output data s3://myawsbucket/wordcount/output 
For information on how to upload objects to Amazon S3, go to Add an Object to Your Bucket in the 
Amazon Simple Storage Service Getting Started Guide. 
How to Create a Streaming Job Flow 
This section covers the basics of creating and launching a streaming job flow using Amazon Elastic 
MapReduce (Amazon EMR).You'll step through how to create a streaming job flow using either the 
Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create 
objects and permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17). 
A streaming job flow reads input from standard input and then runs a script or executable (called a mapper) 
against each input. The result from each of the inputs is saved locally, typically on a Hadoop Distributed 
File System (HDFS) partition. Once all the input is processed by the mapper, a second script or executable 
(called a reducer) processes the mapper results.The results from the reducer are sent to standard output. 
You can chain together a series of streaming job flows, where the output of one streaming job flow 
becomes the input of another job flow. 
The mapper and the reducer can each be referenced as a file or you can supply a Java class.You can 
implement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python, 
PHP, or Bash. 
The example that follows is based on the Amazon EMR Word Count Example. This example shows how 
to use Hadoop streaming to count the number of times each word occurs within a text file. In this example, 
the input is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/wordcount/input. 
The mapper is a Python script that counts the number of times a word occurs in each input string and is 
located at s3://elasticmapreduce/samples/wordcount/wordSplitter.py.The reducer references 
a standard Hadoop library package called aggregate. Aggregate provides a special Java class and a list 
API Version 2009-11-30 
24
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
of simple aggregators that perform aggregations such as sum, max, and min over a sequence of values. 
The output is saved to an Amazon S3 bucket you created in Setting Up Your Environment to Run a Job 
Flow (p. 17). 
Amazon EMR Console 
This example describes how to use the Amazon EMR console to create a streaming job flow. 
To create a streaming job flow 
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at 
https://console.aws.amazon.com/elasticmapreduce/. 
2. Click Create New Job Flow. 
3. In the DEFINE JOB FLOW page, do the following: 
a. Enter a name in the Job Flow Name field. This name is optional, and does not need to be 
unique. 
b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to 
run the Amazon distribution of Hadoop or one of two MapR distributions. For more information 
about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for 
Hadoop (p. 260). 
c. Select Run your own application. 
d. Select Streaming in the drop-down list. 
e. Click Continue. 
API Version 2009-11-30 
25
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
4. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, 
and then click Continue. 
Field Action 
Specify the URI where the input data resides in Amazon S3. The value must 
be in the form BucketName/path. 
Input Location* 
Specify the URI where you want the output stored in Amazon S3. The value 
must be in the form BucketName/path. 
Output Location* 
Specify either a class name that refers to a mapper class in Hadoop, or a path 
on Amazon S3 where the mapper executable, such as a Python program, 
resides. The path value must be in the form 
BucketName/path/MapperExecutable. 
Mapper* 
Specify either a class name that refers to a reducer class in Hadoop, or a path 
on Amazon S3 where the reducer executable, such as a Python program, 
resides. The path value must be in the form 
BucketName/path/ReducerExecutable. Amazon EMR supports the special 
aggregate keyword. For more information, go to the Aggregate library supplied 
by Hadoop. 
Reducer* 
Optionally, enter a list of arguments (space-separated strings) to pass to the 
Hadoop streaming utility. For example, you can specify additional files to load 
into the distributed cache. 
Extra Args 
* Required parameter 
API Version 2009-11-30 
26
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the 
following table as a guide, and then click Continue. 
Note 
Twenty is the default maximum number of nodes per AWS account. For example, if you 
have two job flows running, the total number of nodes running for both job flows must be 
20 or less. If you need more than 20 nodes, you must submit a request to increase your 
Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon 
EC2 Instance Limit Form. 
Field Action 
Specify the number of nodes to use in the Hadoop cluster. There is always 
one master node in each job flow.You can specify the number of core and 
tasks nodes. 
Instance Count 
Specify the Amazon EC2 instance types to use as master, core, and task 
nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, 
c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, and 
cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East 
(Northern Virginia), US West (Oregon), and EU (Ireland) Regions. The 
cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East 
(Northern Virginia) Region. 
Instance Type 
Specify whether to run master, core, or task nodes on Spot Instances. For 
more information, see Lower Costs with Spot Instances (p. 141) 
Request Spot 
Instances 
* Required parameter 
API Version 2009-11-30 
27
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table 
as a guide, and then click Continue. 
Field Action 
Optionally, specify a key pair that you created previously. For more information, 
see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not enter 
a value in this field, you cannot SSH into the master node. 
Amazon EC2 Key 
Pair 
Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon 
VPC. For more information, see Running Job Flows on an Amazon VPC (p. 381). 
Amazon VPC 
Subnet Id 
Optionally, specify a path in Amazon S3 to store the Amazon EMR log files. 
The value must be in the form BucketName/path. If you do not supply a 
location, Amazon EMR does not log any files. 
Amazon S3 Log 
Path 
Select Yes to store Amazon Elastic MapReduce-generated log files.You must 
enable debugging at this level if you want to store the log files generated by 
Amazon EMR. 
If you select Yes, you must supply an Amazon S3 bucket name where Amazon 
Elastic MapReduce can upload your log files. 
For more information, see Troubleshooting (p. 183). 
Important 
You can enable debugging for a job flow only when you initially create 
the job flow. 
Enable 
Debugging 
Select Yes to cause the job flow to continue running when all processing is 
completed. 
Keep Alive 
API Version 2009-11-30 
28
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
Field Action 
Select Yes to ensure the job flow is not shut down due to accident or error. For 
more information, see Protect a Job Flow from Termination (p. 136). 
Termination 
Protection 
Select Yes to make the job flow visible and accessible to all IAM users on the 
AWS account. For more information, see Configure User Permissions with 
IAM (p. 274). 
Visible To All IAM 
Users 
7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click 
Continue. 
For more information about bootstrap actions, see Bootstrap Actions (p. 84). 
API Version 2009-11-30 
29
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and 
then click Create Job Flow when the information is correct. 
After you click Create Job Flow your request is processed; when it succeeds, a message appears. 
API Version 2009-11-30 
30
9. Click Close. 
Amazon Elastic MapReduce Developer Guide 
How to Create a Streaming Job Flow 
The Amazon EMR console shows the new job flow starting. Starting a new job flow may take several 
minutes, depending on the number and type of EC2 instances Amazon EMR is launching and 
configuring. Click the Refresh button for the latest view of the job flow's progress. 
CLI 
This example describes how to use the CLI to create a streaming job flow. Replace the red text with your 
Amazon S3 bucket information. 
To create a job flow 
• Use the information in the following table to create your job flow: 
If you are Enter the following... 
using... 
& ./elastic-mapreduce --create --stream  
--input s3n://elasticmapreduce/samples/wordcount/input  
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
 
--reducer aggregate  
--output s3n://myawsbucket 
Linux or 
UNIX 
c:ruby elastic-mapreduce --create --stream  
--input s3n://elasticmapreduce/samples/wordcount/input  
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
 
--reducer aggregate  
--output s3n://myawsbucket 
Microsoft 
Windows 
The output looks similar to the following. 
Created jobflow JobFlowID 
API Version 2009-11-30 
31
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
By default, this command launches a job flow to run on a single-node cluster using an Amazon EC2 
m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can 
launch job flows to run on multiple nodes.You can specify the number of nodes and the type of instance 
to run with the --num-instances and --instance-type parameters, respectively. 
API 
This section describes the Amazon EMR API Query request parameters you need to create a streaming 
job flow. The response includes a <JobFlowID>, which you use in other Amazon EMR operations, such 
as when describing or terminating a job flow. For this reason, it is important to store job flow IDs. 
The Args argument contains location information for your input data, output data, mapper, reducer, and 
cache file, as shown in the following example. 
"Name": "streaming job flow", 
"HadoopJarStep": 
{ 
"Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", 
"Args": 
[ 
"-input", "s3n://elasticmapreduce/samples/wordcount/input", 
"-output", "s3n://myawsbucket", 
"-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplit 
ter.py", 
"-reducer", "aggregate" 
] 
} 
Note 
All paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If you 
use HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as in 
hdfs:///home/hadoop/sampleInput2/. 
How to Create a Job Flow Using Hive 
This section covers the basics of creating a job flow using Hive in Amazon Elastic MapReduce (Amazon 
EMR).You'll step through how to create a job flow using Hive with either the Amazon EMR console, the 
CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for 
more information see Setting Up Your Environment to Run a Job Flow (p. 17). 
For advanced information on Hive configuration options, see Hive Configuration (p. 348). 
A job flow using Hive enables you to create a data analysis application using a SQL-like language. The 
example that follows is based on the Amazon EMR sample: Contextual Advertising using Apache Hive 
and Amazon EMR with High Performance Computing instances. This sample describes how to correlate 
customer click data to specific advertisements. 
In this example, the Hive script is located in an Amazon S3 bucket at 
s3n://elasticmapreduce/samples/hive-ads/libs/model-build. All of the data processing 
instructions are located in the Hive script. The script requires additional libraries that are located in an 
Amazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs.The input data is located 
in the Amazon S3 bucket s3n://elasticmapreduce/samples/hive-ads/tables. The output is 
saved to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job 
Flow (p. 17). 
API Version 2009-11-30 
32
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
Amazon EMR Console 
This example describes how to use the Amazon EMR console to create a job flow using Hive. 
To create a job flow using Hive 
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at 
https://console.aws.amazon.com/elasticmapreduce/. 
2. Click Create New Job Flow. 
3. In the DEFINE JOB FLOW page, do the following: 
a. Enter a name in the Job Flow Name field. 
We recommended you use a descriptive name. It does not need to be unique. 
b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to 
run the Amazon distribution of Hadoop or one of two MapR distributions. For more information 
about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for 
Hadoop (p. 260). 
c. Select Run your own application. 
d. Select Hive in the drop-down list. 
e. Click Continue. 
API Version 2009-11-30 
33
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
4. In SPECIFY PARAMETERS page, specify whether you want to run the Hive job from a script or 
interactively from the master node. If you are running Hive from a script, enter values in the boxes 
using the following table as a guide. Click Continue. 
Field Action 
Specify the URI where your script resides in Amazon S3. The value must be in 
the form BucketName/path/ScriptName. 
Script Location* 
Optionally, specify the URI where your input files reside in Amazon S3. The 
value must be in the form BucketName/path/. If specified, this will be passed 
to the Hive script as a parameter named INPUT. 
Input Location 
Optionally, specify the URI where you want the output stored in Amazon S3. 
The value must be in the form BucketName/path. If specified, this will be passed 
to the Hive script as a parameter named OUTPUT. 
Output Location 
Extra Args Optionally, enter a list of arguments (space-separated strings) to pass to Hive. 
* Required parameter 
API Version 2009-11-30 
34
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the 
following table as a guide, and then click Continue. 
Note 
Twenty is the default maximum number of nodes per AWS account. For example, if you 
have two job flows running, the total number of nodes running for both job flows must be 
20 or less. If you need more than 20 nodes, you must submit a request to increase your 
Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon 
EC2 Instance Limit Form. 
Field Action 
Specify the number of nodes to use in the Hadoop cluster. There is always 
one master node in each job flow.You can specify the number of core and 
tasks nodes. 
Instance Count 
Specify the Amazon EC2 instance types to use as master, core, and task 
nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, 
c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, and 
cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East 
(Northern Virginia), US West (Oregon), and EU (Ireland) Regions. The 
cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East 
(Northern Virginia) Region. 
Instance Type 
Specify whether to run master, core, or task nodes on Spot Instances. For 
more information, see Lower Costs with Spot Instances (p. 141) 
Request Spot 
Instances 
* Required parameter 
API Version 2009-11-30 
35
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table 
as a guide, and then click Continue. 
Field Action 
Optionally, specify a key pair that you created previously. For more information, 
see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not enter 
a value in this field, you cannot SSH into the master node. 
Amazon EC2 Key 
Pair 
Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon 
VPC. For more information, see Running Job Flows on an Amazon VPC (p. 381). 
Amazon VPC 
Subnet Id 
Optionally, specify a path in Amazon S3 to store the Amazon EMR log files. 
The value must be in the form BucketName/path. If you do not supply a 
location, Amazon EMR does not log any files. 
Amazon S3 Log 
Path 
Select Yes to store Amazon Elastic MapReduce-generated log files.You must 
enable debugging at this level if you want to store the log files generated by 
Amazon EMR. 
If you select Yes, you must supply an Amazon S3 bucket name where Amazon 
Elastic MapReduce can upload your log files. 
For more information, see Troubleshooting (p. 183). 
Important 
You can enable debugging for a job flow only when you initially create 
the job flow. 
Enable 
Debugging 
Select Yes to cause the job flow to continue running when all processing is 
completed. 
Keep Alive 
API Version 2009-11-30 
36
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
Field Action 
Select Yes to ensure the job flow is not shut down due to accident or error. For 
more information, see Protect a Job Flow from Termination (p. 136). 
Termination 
Protection 
Select Yes to make the job flow visible and accessible to all IAM users on the 
AWS account. For more information, see Configure User Permissions with 
IAM (p. 274). 
Visible To All IAM 
Users 
7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click 
Continue. 
For more information about bootstrap actions, see Bootstrap Actions (p. 84). 
API Version 2009-11-30 
37
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and 
then click Create Job Flow when the information is correct. 
After you click Create Job Flow your request is processed; when it succeeds, a message appears. 
API Version 2009-11-30 
38
9. Click Close. 
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Hive 
The Amazon EMR console shows the new job flow starting. Starting a new job flow may take several 
minutes, depending on the number and type of EC2 instances Amazon EMR is launching and 
configuring. Click the Refresh button for the latest view of the job flow's progress. 
CLI 
This example describes how to use the CLI to create a job flow using Hive. 
To create a job flow using Hive 
• Use the information in the following table to create your job flow: 
If you are Enter the following... 
using... 
& ./elastic-mapreduce --create --name "Test Hive"  
--hive-script  
s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q  
--args "-d",  
"LIBS=s3n://elasticmapreduce/samples/hive-ads/libs","-d",  
"INPUT=s3n://elasticmapreduce/samples/hive-ads/tables",  
"-d","OUTPUT=s3n://myawsbucket/hive-ads/output/" 
Linux or 
UNIX 
c: ruby elastic-mapreduce --create --name "Test Hive"  
--hive-script  
s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q  
--args "-d","LIBS=s3n://elasticmapreduce/samples/hive-ads/libs", 
"-d","INPUT=s3n://elasticmapreduce/samples/hive-ads/tables", 
"-d","OUTPUT=s3n://myawsbucket/hive-ads/output/" 
Microsoft 
Windows 
The output looks similar to the following. 
Created job flow JobFlowID 
API Version 2009-11-30 
39
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Pig 
By default, this command launches a job flow to run on a two-node cluster using an Amazon EC2 m1.small 
instance. Later, when your steps are running correctly on a small set of sample data, you can launch job 
flows to run on multiple nodes.You can specify the number of nodes and the type of instance to run with 
the --num-instances and --instance-type parameters, respectively. 
API 
This section describes the Amazon EMR API Query request parameters you need to create a job flow 
using Hive. For an explanation of the parameters unique to RunJobFlow, go to RunJobFlow in the Amazon 
Elastic MapReduce (Amazon EMR) API Reference. The response includes a <JobFlowID>, which you 
use in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason, 
it is important to store job flow IDs. 
The Args argument contains location information for your input data, output data, and LIBS, as shown 
in the following example. 
"Name": "Hive job flow", 
"HadoopJarStep": 
{ 
"Jar":"s3://us-west-1.elasticmapreduce/libs/script-runner/script-runner.jar", 
"Args":[ 
"s3://us-west-1.elasticmapreduce/libs/hive/hive-script", 
"--base-path", 
"s3://us-west-1.elasticmapreduce/libs/hive/", 
"--run-hive-script", 
"--args", 
"-f", 
"s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q", 
"-d LIBS=s3n://elasticmapreduce/samples/hive-ads/libs" 
] 
} 
Note 
All paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If you 
use HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as in 
hdfs:///home/hadoop/sampleInput2/. 
How to Create a Job Flow Using Pig 
This section covers the basics of creating a job flow using Pig in Amazon Elastic MapReduce (Amazon 
EMR).You'll step through how to create a job flow using Pig with either the Amazon EMR console, the 
CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for 
more information see Setting Up Your Environment to Run a Job Flow (p. 17). 
A job flow using Pig takes SQL-like commands written in Pig Latin, and converts those commands into 
Hadoop MapReduce algorithms. The examples that follow are based on the Amazon EMR sample: 
Apache Log Analysis using Pig. The sample evaluates Apache log files and then generates a report 
containing the total bytes transferred, a list of the top 50 IP addresses, a list of the top 50 external referrers, 
and the top 50 search terms using Bing and Google. The Pig script is located in the Amazon S3 bucket 
s3n://elasticmapreduce/samples/pig-apache/do-reports2.pig. Input data is located in the 
Amazon S3 bucket s3n://elasticmapreduce/samples/pig-apache/input. The output is saved 
to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17). 
Amazon EMR Console 
This example describes how to use the Amazon EMR console to create a job flow using Pig. 
API Version 2009-11-30 
40
Amazon Elastic MapReduce Developer Guide 
How to Create a Job Flow Using Pig 
To create a job flow using Pig 
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at 
https://console.aws.amazon.com/elasticmapreduce/. 
2. Click Create New Job Flow. 
3. In the DEFINE JOB FLOW page, enter the following: 
a. Enter a name in the Job Flow Name field. 
We recommended you use a descriptive name. It does not need to be unique. 
b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to 
run the Amazon distribution of Hadoop or one of two MapR distributions. For more information 
about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for 
Hadoop (p. 260). 
c. Select Run your own application. 
d. Select Pig Program in the drop-down list. 
e. Click Continue. 
API Version 2009-11-30 
41
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce
Amazon elastic map reduce

More Related Content

What's hot

Elasticfox Owners Manual
Elasticfox Owners ManualElasticfox Owners Manual
Elasticfox Owners ManualHiroshi Ono
 
ESM Administrator's Guide for ESM 6.0c
ESM Administrator's Guide for ESM 6.0cESM Administrator's Guide for ESM 6.0c
ESM Administrator's Guide for ESM 6.0cProtect724
 
Management Console User's Guide for ESM + CORR-Engine
Management Console User's Guide for ESM + CORR-EngineManagement Console User's Guide for ESM + CORR-Engine
Management Console User's Guide for ESM + CORR-EngineProtect724
 
Beginers Guide To Seo
Beginers Guide To SeoBeginers Guide To Seo
Beginers Guide To SeoW.Peterson
 
Edraw Max Pro 使用者手冊 - All-In-One Diagram Software!!
Edraw Max Pro  使用者手冊 - All-In-One Diagram Software!!Edraw Max Pro  使用者手冊 - All-In-One Diagram Software!!
Edraw Max Pro 使用者手冊 - All-In-One Diagram Software!!Cheer Chain Enterprise Co., Ltd.
 
In designcs5 scripting tutorial
In designcs5 scripting tutorialIn designcs5 scripting tutorial
In designcs5 scripting tutorialMustfeez Rasul
 
ESM 6.5c SP1 Administrator's Guide
ESM 6.5c SP1 Administrator's GuideESM 6.5c SP1 Administrator's Guide
ESM 6.5c SP1 Administrator's GuideProtect724mouni
 
Dell poweredge-rc-s140 users-guide-en-us-nhat-thien-minh
Dell poweredge-rc-s140 users-guide-en-us-nhat-thien-minhDell poweredge-rc-s140 users-guide-en-us-nhat-thien-minh
Dell poweredge-rc-s140 users-guide-en-us-nhat-thien-minhMr Cuong
 
Esm arc sightweb_userguide_ae_v3.0
Esm arc sightweb_userguide_ae_v3.0Esm arc sightweb_userguide_ae_v3.0
Esm arc sightweb_userguide_ae_v3.0Protect724
 
Mvc music store tutorial - v3.0
Mvc music store   tutorial - v3.0Mvc music store   tutorial - v3.0
Mvc music store tutorial - v3.0jackmilesdvo
 
Administrator's Guide for ESM 6.8
Administrator's Guide for ESM 6.8Administrator's Guide for ESM 6.8
Administrator's Guide for ESM 6.8Protect724gopi
 
Ovm user's guide
Ovm user's guideOvm user's guide
Ovm user's guideconlee82
 
AccuProcess Modeler User Guide
AccuProcess Modeler User GuideAccuProcess Modeler User Guide
AccuProcess Modeler User GuideD S
 
Ibp manual
Ibp manualIbp manual
Ibp manualXuan Le
 

What's hot (18)

Elasticfox Owners Manual
Elasticfox Owners ManualElasticfox Owners Manual
Elasticfox Owners Manual
 
ESM Administrator's Guide for ESM 6.0c
ESM Administrator's Guide for ESM 6.0cESM Administrator's Guide for ESM 6.0c
ESM Administrator's Guide for ESM 6.0c
 
Management Console User's Guide for ESM + CORR-Engine
Management Console User's Guide for ESM + CORR-EngineManagement Console User's Guide for ESM + CORR-Engine
Management Console User's Guide for ESM + CORR-Engine
 
Beginers Guide To Seo
Beginers Guide To SeoBeginers Guide To Seo
Beginers Guide To Seo
 
AMT: Requester UI
AMT: Requester UIAMT: Requester UI
AMT: Requester UI
 
Amazan Ec2
Amazan Ec2Amazan Ec2
Amazan Ec2
 
Edrawmanual
EdrawmanualEdrawmanual
Edrawmanual
 
Edraw Max Pro 使用者手冊 - All-In-One Diagram Software!!
Edraw Max Pro  使用者手冊 - All-In-One Diagram Software!!Edraw Max Pro  使用者手冊 - All-In-One Diagram Software!!
Edraw Max Pro 使用者手冊 - All-In-One Diagram Software!!
 
In designcs5 scripting tutorial
In designcs5 scripting tutorialIn designcs5 scripting tutorial
In designcs5 scripting tutorial
 
ESM 6.5c SP1 Administrator's Guide
ESM 6.5c SP1 Administrator's GuideESM 6.5c SP1 Administrator's Guide
ESM 6.5c SP1 Administrator's Guide
 
Dell poweredge-rc-s140 users-guide-en-us-nhat-thien-minh
Dell poweredge-rc-s140 users-guide-en-us-nhat-thien-minhDell poweredge-rc-s140 users-guide-en-us-nhat-thien-minh
Dell poweredge-rc-s140 users-guide-en-us-nhat-thien-minh
 
Esm arc sightweb_userguide_ae_v3.0
Esm arc sightweb_userguide_ae_v3.0Esm arc sightweb_userguide_ae_v3.0
Esm arc sightweb_userguide_ae_v3.0
 
Mvc music store tutorial - v3.0
Mvc music store   tutorial - v3.0Mvc music store   tutorial - v3.0
Mvc music store tutorial - v3.0
 
Administrator's Guide for ESM 6.8
Administrator's Guide for ESM 6.8Administrator's Guide for ESM 6.8
Administrator's Guide for ESM 6.8
 
Ovm user's guide
Ovm user's guideOvm user's guide
Ovm user's guide
 
AccuProcess Modeler User Guide
AccuProcess Modeler User GuideAccuProcess Modeler User Guide
AccuProcess Modeler User Guide
 
Ibp manual
Ibp manualIbp manual
Ibp manual
 
Ibp manual
Ibp manualIbp manual
Ibp manual
 

Similar to Amazon elastic map reduce

Sun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuideSun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuideHiroshi Ono
 
OE Application Server Administratoion
OE Application Server AdministratoionOE Application Server Administratoion
OE Application Server Administratoiontawatchai.psp
 
aws-cli.pdf
aws-cli.pdfaws-cli.pdf
aws-cli.pdfSenzo711
 
Esm admin guide_5.2
Esm admin guide_5.2Esm admin guide_5.2
Esm admin guide_5.2Protect724v3
 
Esm admin guide_5.2
Esm admin guide_5.2Esm admin guide_5.2
Esm admin guide_5.2Protect724v3
 
Not all XML Gateways are Created Equal
Not all XML Gateways are Created EqualNot all XML Gateways are Created Equal
Not all XML Gateways are Created EqualCA API Management
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windowswebhostingguy
 
IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...
IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...
IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...IBM India Smarter Computing
 
Managing device addressing of san attached tape for use with tivoli storage m...
Managing device addressing of san attached tape for use with tivoli storage m...Managing device addressing of san attached tape for use with tivoli storage m...
Managing device addressing of san attached tape for use with tivoli storage m...Banking at Ho Chi Minh city
 
OAF Developer Guide 13.1.3
OAF Developer Guide 13.1.3OAF Developer Guide 13.1.3
OAF Developer Guide 13.1.3crwanare
 
Essbase database administrator's guide
Essbase database administrator's guideEssbase database administrator's guide
Essbase database administrator's guideChanukya Mekala
 
Microsoft Dynamics CRM - Connector Overview
Microsoft Dynamics CRM - Connector OverviewMicrosoft Dynamics CRM - Connector Overview
Microsoft Dynamics CRM - Connector OverviewMicrosoft Private Cloud
 

Similar to Amazon elastic map reduce (20)

Aws tkv-ug
Aws tkv-ugAws tkv-ug
Aws tkv-ug
 
Xi31 sp3 bip_admin_en
Xi31 sp3 bip_admin_enXi31 sp3 bip_admin_en
Xi31 sp3 bip_admin_en
 
Machinelearning dg
Machinelearning dgMachinelearning dg
Machinelearning dg
 
aws-cli.pdf
aws-cli.pdfaws-cli.pdf
aws-cli.pdf
 
Sun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuideSun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuide
 
OE Application Server Administratoion
OE Application Server AdministratoionOE Application Server Administratoion
OE Application Server Administratoion
 
aws-cli.pdf
aws-cli.pdfaws-cli.pdf
aws-cli.pdf
 
Esm admin guide_5.2
Esm admin guide_5.2Esm admin guide_5.2
Esm admin guide_5.2
 
Esm admin guide_5.2
Esm admin guide_5.2Esm admin guide_5.2
Esm admin guide_5.2
 
Not all XML Gateways are Created Equal
Not all XML Gateways are Created EqualNot all XML Gateways are Created Equal
Not all XML Gateways are Created Equal
 
Mail chimp for-designers
Mail chimp for-designersMail chimp for-designers
Mail chimp for-designers
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
 
Awsgsg wah
Awsgsg wahAwsgsg wah
Awsgsg wah
 
Awsgsg wah
Awsgsg wahAwsgsg wah
Awsgsg wah
 
IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...
IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...
IBM Storwize 7000 Unified, SONAS, and VMware Site Recovery Manager: An overvi...
 
Managing device addressing of san attached tape for use with tivoli storage m...
Managing device addressing of san attached tape for use with tivoli storage m...Managing device addressing of san attached tape for use with tivoli storage m...
Managing device addressing of san attached tape for use with tivoli storage m...
 
hci10_help_sap_en.pdf
hci10_help_sap_en.pdfhci10_help_sap_en.pdf
hci10_help_sap_en.pdf
 
OAF Developer Guide 13.1.3
OAF Developer Guide 13.1.3OAF Developer Guide 13.1.3
OAF Developer Guide 13.1.3
 
Essbase database administrator's guide
Essbase database administrator's guideEssbase database administrator's guide
Essbase database administrator's guide
 
Microsoft Dynamics CRM - Connector Overview
Microsoft Dynamics CRM - Connector OverviewMicrosoft Dynamics CRM - Connector Overview
Microsoft Dynamics CRM - Connector Overview
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Amazon elastic map reduce

  • 1. Amazon Elastic MapReduce Developer Guide API Version 2009-11-30
  • 2. Amazon Elastic MapReduce Developer Guide Amazon Web Services
  • 3. Amazon Elastic MapReduce Developer Guide Amazon Elastic MapReduce: Developer Guide Amazon Web Services Copyright © 2012 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. The following are trademarks or registered trademarks of Amazon: Amazon, Amazon.com, Amazon.com Design, Amazon DevPay, Amazon EC2, Amazon Web Services Design, AWS, CloudFront, EC2, Elastic Compute Cloud, Kindle, and Mechanical Turk. In addition, Amazon.com graphics, logos, page headers, button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or other countries. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
  • 4. Amazon Elastic MapReduce Developer Guide Welcome ................................................................................................................................................. 1 Understand Amazon EMR ...................................................................................................................... 2 Overview of Amazon EMR ...................................................................................................................... 2 Architectural Overview of Amazon EMR ....................................................................................... 3 Elastic MapReduce Features ........................................................................................................ 4 Amazon EMR Concepts ......................................................................................................................... 6 Job Flows and Steps ..................................................................................................................... 6 Hadoop and MapReduce .............................................................................................................. 7 Associated AWS Product Concepts ...................................................................................................... 11 Using Amazon EMR ............................................................................................................................. 15 Setting Up Your Environment to Run a Job Flow .................................................................................. 17 Create a Job Flow ................................................................................................................................. 23 How to Create a Streaming Job Flow .......................................................................................... 24 How to Create a Job Flow Using Hive ......................................................................................... 32 How to Create a Job Flow Using Pig ........................................................................................... 40 How to Create a Job Flow Using a Custom JAR ......................................................................... 48 How to Create a Cascading Job Flow ......................................................................................... 56 Launch an HBase Cluster on Amazon EMR ............................................................................... 64 View Job Flow Details ........................................................................................................................... 72 Terminate a Job Flow ............................................................................................................................ 77 Customize a Job Flow .......................................................................................................................... 79 Add Steps to a Job Flow ............................................................................................................. 79 Wait for Steps to Complete ................................................................................................ 81 Add More than 256 Steps to a Job Flow ............................................................................ 82 Bootstrap Actions ........................................................................................................................ 84 Resizing Running Job Flows ....................................................................................................... 96 Calling Additional Files and Libraries ........................................................................................ 104 Using Distributed Cache .................................................................................................. 104 Running a Script in a Job Flow ........................................................................................ 109 Connect to the Master Node in an Amazon EMR Job Flow ............................................................... 110 Connect to the Master Node Using SSH ................................................................................... 111 Web Interfaces Hosted on the Master Node ............................................................................. 115 Open an SSH Tunnel to the Master Node ................................................................................. 116 Configure Foxy Proxy to View Websites Hosted on the Master Node ....................................... 117 Use Cases .......................................................................................................................................... 122 Cascading ................................................................................................................................. 122 Pig ............................................................................................................................................. 126 Streaming .................................................................................................................................. 129 Building Binaries Using Amazon EMR ................................................................................................ 131 Using Tagging ..................................................................................................................................... 136 Protect a Job Flow from Termination .................................................................................................. 136 Lower Costs with Spot Instances ........................................................................................................ 141 Choosing What to Launch as Spot Instances ........................................................................... 142 Spot Instance Pricing in Amazon EMR ..................................................................................... 144 Availability Zones and Regions ................................................................................................. 144 Launching Spot Instances in Job Flows .................................................................................... 145 Changing the Number of Spot Instances in a Job Flow ............................................................ 151 Troubleshooting Spot Instances ................................................................................................ 154 Store Data with HBase ....................................................................................................................... 155 HBase Job Flow Prerequisites .................................................................................................. 155 Launch an HBase Cluster on Amazon EMR ............................................................................. 156 Connect to HBase Using the Command Line ............................................................................ 164 Back Up and Restore HBase .................................................................................................... 165 Terminate an HBase Cluster ..................................................................................................... 174 Configure HBase ....................................................................................................................... 174 Access HBase Data with Hive ................................................................................................... 178 View the HBase User Interface ................................................................................................. 180 View HBase Log Files ............................................................................................................... 180 API Version 2009-11-30 4
  • 5. Amazon Elastic MapReduce Developer Guide Monitor HBase with CloudWatch ............................................................................................... 181 Monitor HBase with Ganglia ...................................................................................................... 181 Troubleshooting .................................................................................................................................. 183 Things to Check When Your Amazon EMR Job Flow Fails ....................................................... 183 Amazon EMR Logging .............................................................................................................. 187 Enable Logging and Debugging ................................................................................................ 187 Use Log Files ............................................................................................................................ 190 Monitor Hadoop on the Master Node ........................................................................................ 199 View the Hadoop Web Interfaces .............................................................................................. 200 Troubleshooting Tips ................................................................................................................. 204 Monitor Metrics with Amazon CloudWatch ......................................................................................... 209 Monitor Performance with Ganglia ...................................................................................................... 220 Distributed Copy Using S3DistCp ....................................................................................................... 227 Export, Query, and Join Tables in Amazon DynamoDB ...................................................................... 234 Prerequisites for Integrating Amazon EMR ............................................................................... 235 Step 1: Create a Key Pair .......................................................................................................... 235 Step 2: Create a Job Flow ......................................................................................................... 236 Step 3: SSH into the Master Node ............................................................................................ 241 Step 4: Set Up a Hive Table to Run Hive Commands ................................................................ 244 Hive Command Examples for Exporting, Importing, and Querying Data .................................. 248 Optimizing Performance ............................................................................................................ 255 Use Third Party Applications With Amazon EMR ............................................................................... 258 Parse Data with HParser ........................................................................................................... 258 Using Karmasphere Analytics ................................................................................................... 259 Launch a Job Flow on the MapR Distribution for Hadoop ......................................................... 260 Write Amazon EMR Applications ........................................................................................................ 263 Common Concepts for API Calls ........................................................................................................ 263 Use SDKs to Call Amazon EMR APIs ................................................................................................ 265 Using the AWS SDK for Java to Create an Amazon EMR Job Flow ......................................... 266 Using the AWS SDK for .Net to Create an Amazon EMR Job Flow .......................................... 267 Using the Java SDK to Sign a Query Request .......................................................................... 267 Use Query Requests to Call Amazon EMR APIs ............................................................................... 268 Why Query Requests Are Signed ............................................................................................. 269 Components of a Query Request in Amazon EMR ................................................................... 269 How to Generate a Signature for a Query Request in Amazon EMR ........................................ 270 Configure Amazon EMR ..................................................................................................................... 274 Configure User Permissions with IAM ................................................................................................ 274 Set Policy for an IAM User ........................................................................................................ 277 Configure IAM Roles for Amazon EMR .............................................................................................. 280 Set Access Permissions on Files Written to Amazon S3 .................................................................... 285 Using Elastic IP Addresses ................................................................................................................. 287 Specify the Amazon EMR AMI Version ............................................................................................... 290 Hadoop Configuration ......................................................................................................................... 299 Supported Hadoop Versions ..................................................................................................... 300 Configuration of hadoop-user-env.sh ........................................................................................ 302 Upgrading to Hadoop 1.0 .......................................................................................................... 302 Hadoop Version Behavior ................................................................................................ 303 Hadoop 0.20 Streaming Configuration ...................................................................................... 304 Hadoop Default Configuration (AMI 1.0) ................................................................................... 304 Hadoop Configuration (AMI 1.0) ...................................................................................... 304 HDFS Configuration (AMI 1.0) ......................................................................................... 307 Task Configuration (AMI 1.0) ........................................................................................... 308 Intermediate Compression (AMI 1.0) ............................................................................... 311 Hadoop Memory-Intensive Configuration Settings (AMI 1.0) ................................................... 311 Hadoop Default Configuration (AMI 2.0 and 2.1) ...................................................................... 314 Hadoop Configuration (AMI 2.0 and 2.1) ......................................................................... 314 HDFS Configuration (AMI 2.0 and 2.1) ............................................................................ 318 Task Configuration (AMI 2.0 and 2.1) .............................................................................. 318 API Version 2009-11-30 5
  • 6. Amazon Elastic MapReduce Developer Guide Intermediate Compression (AMI 2.0 and 2.1) .................................................................. 321 Hadoop Default Configuration (AMI 2.2) ................................................................................... 322 Hadoop Configuration (AMI 2.2) ...................................................................................... 322 HDFS Configuration (AMI 2.2) ......................................................................................... 326 Task Configuration (AMI 2.2) ........................................................................................... 326 Intermediate Compression (AMI 2.2) ............................................................................... 329 Hadoop Default Configuration (AMI 2.3) ................................................................................... 330 Hadoop Configuration (AMI 2.3) ...................................................................................... 330 HDFS Configuration (AMI 2.3) ......................................................................................... 334 Task Configuration (AMI 2.3) ........................................................................................... 334 Intermediate Compression (AMI 2.3) ............................................................................... 337 File System Configuration ......................................................................................................... 338 JSON Configuration Files .......................................................................................................... 340 Multipart Upload ........................................................................................................................ 343 Hadoop Data Compression ....................................................................................................... 344 Setting Permissions on the System Directory ........................................................................... 345 Hadoop Patches ........................................................................................................................ 346 Hive Configuration .............................................................................................................................. 348 Supported Hive Versions ........................................................................................................... 349 Share Data Between Hive Versions ........................................................................................... 353 Differences from Apache Hive Defaults .................................................................................... 353 Interactive and Batch Modes ..................................................................................................... 355 Creating a Metastore Outside the Hadoop Cluster ................................................................... 357 Using the Hive JDBC Driver ...................................................................................................... 359 Additional Features of Hive in Amazon EMR ............................................................................ 362 Upgrade to Hive 0.8 .................................................................................................................. 368 Upgrade the Configuration Files ...................................................................................... 368 Upgrade the Metastore .................................................................................................... 369 Upgrade to Hive 0.8 (MySQL on the Master Node) ................................................ 369 Upgrade to Hive 0.8 (MySQL on Amazon RDS) ..................................................... 373 Pig Configuration ................................................................................................................................ 377 Supported Pig Versions ............................................................................................................. 377 Pig Version Details .................................................................................................................... 379 Performance Tuning ............................................................................................................................ 381 Running Job Flows on an Amazon VPC ............................................................................................. 381 Appendix: Compare Job Flow Types ................................................................................................... 389 Appendix: Amazon EMR Resources ................................................................................................... 391 Document History ............................................................................................................................... 396 Glossary ............................................................................................................................................. 393 Index ................................................................................................................................................... 401 API Version 2009-11-30 6
  • 7. Welcome Amazon Elastic MapReduce Developer Guide How Do I...? This is the Amazon Elastic MapReduce (Amazon EMR) Developer Guide.This guide provides a conceptual overview of Amazon EMR, an overview of related AWS products, and detailed information on all functionality available from Amazon EMR. Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. Amazon EMR uses Hadoop processing combined with several AWS products to do such tasks as web indexing, data mining, log file analysis, machine learning, scientific simulation, and data warehousing. How Do I...? How Do I? Relevant Sections Decide whether Amazon EMR is right Amazon Elastic MapReduce detail page for my needs Get started with Amazon EMR Getting Started Guide Learn about troubleshooting job flows Troubleshooting (p. 183) Learn how to create a job flow Create a Job Flow (p. 23) Learn about bootstrap actions Bootstrap Actions (p. 84) Learn about Hadoop cluster Hadoop Configuration (p. 299) configuration Learn about the Amazon EMR API Write Amazon EMR Applications (p. 263) Compare different job flow types Appendix: Compare Job Flow Types (p. 389) API Version 2009-11-30 1
  • 8. Amazon Elastic MapReduce Developer Guide Overview of Amazon EMR Understand Amazon EMR Topics • Overview of Amazon EMR (p. 2) • Amazon EMR Concepts (p. 6) • Associated AWS Product Concepts (p. 11) This introduction to Amazon Elastic MapReduce (Amazon EMR) provides a summary of this web service. After reading this section, you should understand the service features, know how Amazon EMR interacts with other AWS products, and understand the basic functions of Amazon EMR. In this guide, we assume that you have read and completed the instructions described in the Getting Started Guide, which provides information on creating your Amazon Elastic MapReduce (Amazon EMR) account and credentials. You should be familiar with the following: • Hadoop. For more information go to http://hadoop.apache.org/core/. • Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and Amazon SimpleDB. For more information, see the Amazon Elastic Compute Cloud User Guide, the Amazon Simple Storage Service Developer Guide, and the Amazon SimpleDB Developer Guide, respectively. Overview of Amazon EMR Amazon Elastic MapReduce (Amazon EMR) is a data analysis tool that simplifies the set-up and management of a computer cluster, the source data, and the computational tools that help you implement sophisticated data processing jobs quickly. Typically, data processing involves performing a series of relatively simple operations on large amounts of data. In Amazon EMR, each operation is called a step and a sequence of steps is a job flow. A job flow that processes encrypted data might look like the following example. Step 1 Decrypt data Step 2 Process data API Version 2009-11-30 2
  • 9. Amazon Elastic MapReduce Developer Guide Architectural Overview of Amazon EMR Step 3 Encrypt data Step 4 Save data Amazon EMR uses Hadoop to divide up the work among the instances in the cluster, tracks status, and combine the individual results into one output. For an overview of Hadoop, see What Is Hadoop? (p. 8). Amazon EMR takes care of provisioning a Hadoop cluster, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. Amazon EMR removes most of the cumbersome details of setting up the hardware and networking required by the Hadoop cluster, such as monitoring the setup, configuring Hadoop, and executing the job flow.Together, Amazon EMR and Hadoop provide all of the power of Hadoop processing with the ease, low cost, scalability, and power that Amazon S3 and Amazon EC2 offer. Architectural Overview of Amazon EMR Amazon Elastic MapReduce (Amazon EMR) works in conjunction with Amazon EC2 to create a Hadoop cluster, and with Amazon S3 to store scripts, input data, log files, and output results. The Amazon EMR process is outlined in the following figure and table. API Version 2009-11-30 3
  • 10. Amazon Elastic MapReduce Developer Guide Amazon EMR Process Elastic MapReduce Features Upload to Amazon S3 the data you want to process, as well as the mapper and reducer executables that process the data, and then send a request to Amazon EMR to start a job flow. 1 Amazon EMR starts a Hadoop cluster, which loads any specified bootstrap actions and then runs Hadoop on each node. 2 Hadoop executes a job flow by downloading data from Amazon S3 to core and task nodes. Alternatively, the data is loaded dynamically at run time by mapper tasks. 3 4 Hadoop processes the data and then uploads the results from the cluster to Amazon S3. 5 The job flow is completed and you retrieve the processed data from Amazon S3. For details on mapping legacy job flows to instance groups, see Mapping Legacy Job Flows to Instance Groups (p. 102). Elastic MapReduce Features Topics • Bootstrap Actions (p. 4) • Configurable Data Storage (p. 4) • Hadoop and Step Logging (p. 5) • Hive Support (p. 5) • Resizeable Running Job Flows (p. 5) • Secure Data (p. 5) • Supports Hadoop Methods (p. 5) • Multiple Sequential Steps (p. 5) The following sections describe the features available in Amazon Elastic MapReduce (Amazon EMR). Bootstrap Actions A bootstrap action is a mechanism that lets you run a script on Elastic MapReduce cluster nodes before Hadoop starts. Bootstrap action scripts are stored in Amazon S3 and passed to Amazon EMR when creating a new job flow. Bootstrap action scripts are downloaded from Amazon S3 and executed on each node before the job flow is executed. By using bootstrap actions, you can install software on the node, modify the default Hadoop site configuration, or change the way Java parameters are used to run Hadoop daemons. Both predefined and custom bootstrap actions are available. The predefined bootstrap actions include Configure Hadoop, Configure Daemons, and Run-if.You can write custom bootstrap actions in any language already installed on the job flow instance, such as Ruby, Python, Perl, or bash. You can specify a bootstrap action from the command line interface, from the Amazon EMR console, or from the Amazon EMR API when starting a job flow. For more information, see Bootstrap Actions (p. 84). Configurable Data Storage Amazon EMR supports Hadoop Distributed Files System (HDFS). HDFS is fault-tolerant, scalable, and easily configurable. The default configuration is already optimized for most job flows. Generally, the API Version 2009-11-30 4
  • 11. Amazon Elastic MapReduce Developer Guide Elastic MapReduce Features configuration needs to be changed only for very large clusters. Configuration changes are accomplished using bootstrap actions. For more information, see Hadoop Configuration (p. 299). Hadoop and Step Logging Amazon EMR provides detailed logs you can use to debug both Hadoop and Amazon EMR. For more information on how to create logs, view logs, and use them to troubleshoot a job flow, see Troubleshooting (p. 183). Hive Support Amazon Elastic MapReduce (Amazon EMR) supports Apache Hive. Hive is an integrated data warehouse infrastructure built on top of Hadoop. It provides tools to simplify data summarization and provides ad hoc querying and analysis of large datasets stored in Hadoop files. Hive provides a simple query language called Hive QL, which is based on SQL. For more information on the supported versions of Hive, see Hive Configuration (p. 348). Resizeable Running Job Flows The ability to resize a running job flow lets you increase or decrease the number of nodes in a running cluster. Core nodes contain the Hadoop Distributed File System (HDFS). After a job flow is running, you can increase the number of core nodes. Task nodes also run your Hadoop, but do not contain HDFS. After a job flow is running you can also increase and decrease the number of task nodes. For more information, see Resizing Running Job Flows (p. 96). Secure Data Amazon EMR provides an authentication mechanism to ensure that data stored in Amazon S3 is secured against unauthorized access. By default, only the AWS Account owner can access the data uploaded to Amazon S3. Other users can access the data only if you explicitly edit security permissions. You can send data to Amazon S3 using the secure HTTPS protocol. Amazon EMR always uses a secure channel to send data between Amazon S3 and Amazon EC2. For added security, you can encrypt your data before uploading it to Amazon S3. For more information on AWS security, go to the AWS Security Center. Supports Hadoop Methods Amazon EMR supports job flows based on streaming, Hive, Pig, Custom JAR, and Cascading. Streaming enables you to write application logic in any language and to process large amounts of data using the Hadoop framework. Hive and Pig offer nonprogramming options with their SQL-like scripting languages. Custom JAR files enable you to write Java-based MapReduce functions. Cascading is an API with built-in MapReduce support that lets you create complex distributed processes. For more information, see Using Amazon EMR (p. 15). Multiple Sequential Steps Amazon EMR supports job flows with multiple, sequential steps, including the ability to add steps while a job flow runs. Individual steps can combine to create more sophisticated job flows. Additionally, you can incrementally add steps to a running job flow to help with debugging. For more information, see Add Steps to a Job Flow (p. 79). API Version 2009-11-30 5
  • 12. Amazon Elastic MapReduce Developer Guide Amazon EMR Concepts Amazon EMR Concepts Topics • Job Flows and Steps (p. 6) • Hadoop and MapReduce (p. 7) This section describes the concepts and terminology you need to understand and use Amazon Elastic MapReduce (Amazon EMR). Job Flows and Steps A job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data. A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data. Steps are executed in the order in which they are defined in the job flow. You can track the progress of a job flow by checking its state. The following diagram shows the life cycle of a job flow and how each part of the job flow process maps to a particular job flow state. A successful Amazon Elastic MapReduce (Amazon EMR) job flow follows this process: Amazon EMR first provisions a Hadoop cluster. During this phase, the job flow state is STARTING. Next, any user-defined bootstrap actions are run. During this phase, the job flow state is BOOTSTRAPPING. After all bootstrap actions are completed, the job flow state is RUNNING. The job flow sequentially runs all job flow steps during this phase. After all steps run, the job flow state transitions to SHUTTING_DOWN and the job flow shuts down the cluster. All data stored on a cluster node is deleted. Information stored elsewhere, such as in your Amazon S3 bucket, persists. Finally, when all job flow activity is complete, the job flow state is marked as COMPLETED. You can configure a job flow to go into a WAITING state once it completes processing of all steps. A job flow in the WAITING state continues running, waiting for you to add steps or manually terminate it.When you manually terminate a job flow, the Hadoop cluster shuts down and job flow state is SHUTTING_DOWN. When the job flow activity is complete, the final job flow state is TERMINATED. Creating a WAITING job flow is useful when troubleshooting. For more information on troubleshooting, see Debug Job Flows with Steps (p. 206). Any failure during the job flow process terminates the job flow and shuts down all cluster nodes. Any data stored on a cluster node is deleted. The job flow state is marked as FAILED. API Version 2009-11-30 6
  • 13. Amazon Elastic MapReduce Developer Guide Hadoop and MapReduce For a complete list of job flow states, see the JobFlowExecutionStatusDetail data type in the Amazon Elastic MapReduce (Amazon EMR) API Reference. You can also track the progress of job flow steps by checking their state. The following diagram shows the processing of job flow steps and how each step maps to a particular state. A job flow contains one or more steps. Steps are processed in the order in which they are listed in the job flow. Step are run following this sequence: all steps have their state set to PENDING. The first step is run and the step's state is set to RUNNING. When the step is completed, the step's state changes to COMPLETED. The next step in the queue is run, and the step's state is set to RUNNING. After each step completes, the step's state is set to COMPLETED and the next step in the queue is run. Step are run until there are no more steps. Processing flow returns to the job flow. If a step fails, the step state is FAILED and all remaining steps with a PENDING state are marked as CANCELLED. No further steps are run. and processing returns to the job flow. Data is normally communicated from one step to the next using files stored on the cluster's Hadoop Distributed File System (HDFS). Data stored on HDFS exists only as long as the cluster is running.When the cluster is shut down, all data is deleted. The final step in a job flow typically stores the processing results in an Amazon S3 bucket. For a complete list of step states, see the StepExecutionStatusDetail data type in the Amazon Elastic MapReduce (Amazon EMR) API Reference. Hadoop and MapReduce Topics • What Is Hadoop? (p. 8) • What Is MapReduce? (p. 8) • Instance Groups (p. 9) API Version 2009-11-30 7
  • 14. Amazon Elastic MapReduce Developer Guide Hadoop and MapReduce • Supported Hadoop Versions (p. 10) • Supported File Systems (p. 10) This section explains the roles of Apache Hadoop and MapReduce in Amazon Elastic MapReduce (Amazon EMR) and how these two methodologies work together to process data. What Is Hadoop? Apache Hadoop is an open-source Java software framework that supports massive data processing across a cluster of servers. Hadoop uses a programming model called MapReduce that divides a large data set into many small fragments. Hadoop distributes a data fragment and a copy of the MapReduce executable to each of the slave nodes in a Hadoop cluster. Each slave node runs the MapReduce executable on its subset of the data. Hadoop then combines the results from all of the nodes into a finished output. Amazon EMR enables you to upload that output into an Amazon S3 bucket you designate. For more information about Hadoop, go to http://hadoop.apache.org. What Is MapReduce? MapReduce is a combination of mapper and reducer executables that work together to process data. The mapper executable processes the raw data into key/value pairs, called intermediate results. The reducer executable combines the intermediate results, applies additional algorithms, and produces the final output, as described in the following process. MapReduce Process Amazon Elastic MapReduce (Amazon EMR) starts your instances in two security groups: one for the master node and another for the core node and task nodes. 1 Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a single cluster node. 2 Hadoop distributes the data files and the MapReduce executable to the core and task nodes of the cluster. Hadoop handles machine failures and manages network communication between the master, core, and task nodes. In this way, developers do not need to know how to perform distributed programming or handle the details of data redundancy and fail over. 3 The mapper function uses an algorithm that you supply to parse the data into key/value pairs.These key/value pairs are passed to the reducer. As an example, for a job flow that counts the number of times a word appears in a document, the mapper might take each word in a document and assign it a value of 1. Each word is a key in this case, and all values are 1. 4 The reducer function collects the results from all of the mapper functions in the cluster, eliminates redundant keys by combining values of all like keys, then performs the designated operation on all the values for each key, and then outputs the results. Continuing with the previous example, the reducer takes all of the word counts from all of the mappers functions running in a cluster, adds up the number of times each word was found, and then outputs that result to Amazon S3. 5 You can write the executables in any programming language. Mapper and reducer applications written in Java are compiled into a JAR file. Executables written in other programming languages use the Hadoop streaming utility to implement the mapper and reducer algorithms. API Version 2009-11-30 8
  • 15. Amazon Elastic MapReduce Developer Guide Hadoop and MapReduce The mapper executable reads the input from standard input and the reducer outputs data through standard output. By default, each line of input/output represents a record and the first tab on each line of the output separates the key and value. For more information about MapReduce, go to How Map and Reduce operations are actually carried out (http://wiki.apache.org/hadoop/HadoopMapReduce). Instance Groups Amazon EMR runs a managed version of Apache Hadoop, handling the details of creating the cloud-server infrastructure to run the Hadoop cluster. Amazon EMR refers to this cluster as a job flow, and defines the concept of instance groups, which are collections of Amazon EC2 instances that perform roles analogous to the master and slave nodes of Hadoop. There are three types of instance groups: master, core, and task. Each Amazon EMR job flow includes one master instance group that contains one master node, a core instance group containing one or more core nodes, and an optional task instance group, which can contain any number of task nodes. If the job flow is run on a single node, then that instance is simultaneously a master and a core node. For job flows running on more than one node, one instance is the master node and the remaining are core or task nodes. For more information about instance groups, see Resizing Running Job Flows (p. 96). Master Instance Group The master instance group manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop publishes to the web server running on the master node. For more information, see View Logs Using SSH (p. 197). As the job flow progresses, each core and task node processes its data, transfers the data back to Amazon S3, and provides status metadata to the master node. Note The instance controller on the master node uses MySQL. If MySQL becomes unavailable, the instance controller will be unable to launch and manage instances. Core Instance Group The core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node. The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Because core nodes store data, you can't remove them from a job flow. However, you can add more core nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons. Caution Removing HDFS from a running node runs the risk of losing data. For more information about core instance groups, see Resizing Running Job Flows (p. 96). API Version 2009-11-30 9
  • 16. Amazon Elastic MapReduce Developer Guide Hadoop and MapReduce Task Instance Group The task instance group contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress. Task nodes are managed by the master node. While a job flow is running you can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a job flow, you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon. For more information about task instance groups, see Resizing Running Job Flows (p. 96). Supported Hadoop Versions Amazon Elastic MapReduce (Amazon EMR) allows you to choose to run either Hadoop version 0.18, Hadoop version 0.20, or Hadoop version 0.20.205. For more information on Hadoop configuration, see Hadoop Configuration (p. 299) Supported File Systems Amazon EMR and Hadoop typically use two or more of the following file systems when processing a job flow: • Hadoop Distributed File System (HDFS) • Amazon S3 Native File System (S3N) • Local file system • Legacy Amazon S3 Block File System HDFS and S3N are the two main file systems used with Amazon EMR HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the job flows and the Hadoop cluster nodes managing the individual steps. For more information on how HDFS works, see http://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html. The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon S3. The advantage of this file system is that you can access files on Amazon S3 that were written with other tools. For information on how Amazon S3 and Hadoop work together, see http://wiki.apache.org/hadoop/AmazonS3. The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an Amazon EC2 instance which comes with a preconfigured block of preattached disk storage called an Amazon EC2 local instance store. Data on instance store volumes persists only during the life of the associated Amazon EC2 instance. The amount of this disk storage varies by Amazon EC2 instance type. It is ideal for temporary storage of information that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information about Amazon EC2 instances, see Amazon Elastic Compute Cloud. The Amazon S3 Block File System Files is a legacy file storage system.We strongly discourage the use of this system. For more information on how to use and configure file systems in Amazon EMR, see File System Configuration (p. 338). API Version 2009-11-30 10
  • 17. Amazon Elastic MapReduce Developer Guide Associated AWS Product Concepts Associated AWS Product Concepts Topics • Amazon EC2 Concepts (p. 11) • Amazon S3 Concepts (p. 14) • AWS Identity and Access Management (IAM) (p. 14) • Regions (p. 14) • Data Storage (p. 14) This section describes AWS concepts and terminology you need to understand to use Amazon Elastic MapReduce (Amazon EMR) effectively. Amazon EC2 Concepts Topics • Amazon EC2 Instances (p. 11) • Reserved Instances (p. 13) • Elastic IP Address (p. 13) • Amazon EC2 Key Pairs (p. 13) The following sections describe Amazon EC2 features used by Amazon EMR. Amazon EC2 Instances Amazon EMR enables you to choose the number and kind of Amazon EC2 instances that comprise the cluster that processes your job flow. Amazon EC2 offers several basic types. • Standard—You can use Amazon EC2 standard instances for most applications. • High-CPU—These instances have proportionally more CPU resources than memory (RAM) for compute-intensive applications. • High-Memory—These instances offer large memory sizes for high throughput applications, including database and memory caching applications. • Cluster Compute—These instances provide proportionally high CPU resources with increased network performance. They are well suited for demanding network-bound applications. • High Storage—These instances provide proportionally high storage resources. They are well suited for data warehouse applications. Note Amazon EMR does not support micro instances at this time. The following table describes all the instance types that Amazon EMR supports. I/O Name Performance Platform (bits) Disk Drive (GiB) Compute Units RAM (GiB) Instance Type Small (default) 1.7 1 150 32 Moderate m1.small Large 7.5 4 840 64 High m1.large Extra Large 15 8 1680 64 High m1.xlarge API Version 2009-11-30 11
  • 18. I/O Name Performance Amazon Elastic MapReduce Developer Guide Platform (bits) Amazon EC2 Concepts Disk Drive (GiB) Compute Units RAM (GiB) Instance Type High-CPU Medium 1.7 5 340 32 Moderate c1.medium High-CPU Extra Large 7 20 1680 64 High c1.xlarge High-Memory Extra Large 17.1 6.5 420 64 Moderate m2.xlarge High-Memory Double Extra 34.2 13 850 64 Moderate m2.2xlarge Large High-Memory Quadruple 68.4 26 1680 64 High m2.4xlarge Extra Large Very High cc1.4xlarge (10 Gigabit Ethernet) Cluster Compute Quadruple 23 33.5 1690 64 Extra Large Instance* Very High cc2.8xlarge (10 Gigabit Ethernet) Cluster Compute Eight 60.5 88 3360 64 Extra Large** Very High hs1.8xlarge (10 Gigabit Ethernet) High Storage* 117 35 49152 64 Very High cg1.4xlarge (10 Gigabit Ethernet) Cluster GPU*** 23 33.5 1680 64 *Cluster Compute Quadruple Extra Large instances and High Storage instances are supported only in the US East (Northern Virginia) Region. **Cluster Compute Eight Extra Large instances are only supported in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions. ***Cluster GPU instances have 22 GB, with 1 GB reserved for GPU operation. The practical limit of the amount of data you can process depends on the number and type of Amazon EC2 instances selected as your cluster nodes, and on the size of your intermediate and final data. This is because the input, intermediate, and output data sets reside on the cluster nodes while your job flow runs. For example, the maximum amount of data that you can process on a 20-node cluster is 34 TB (20 Extra Large instances x 1.69 TB of hard disk per Amazon EC2 instance = 34 TB). The default maximum number of Amazon EC2 instances you can specify is 20. If you need more instances, you can make a formal request. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form. Related Topics • Request additional Amazon EC2 instances • Amazon EC2 Instance Types • High Performance Computing (HPC) API Version 2009-11-30 12
  • 19. Amazon Elastic MapReduce Developer Guide Amazon EC2 Concepts Reserved Instances Reserved Instances provide guaranteed capacity and are an additional Amazon EC2 pricing option.You make a one-time payment for an instance to reserve capacity and reduce hourly usage charges. Reserved Instances complement existing Amazon EC2 On-Demand Instances and provide an option to reduce computing costs. As with On-Demand Instances, you pay only for the compute capacity that you actually consume, and if you don't use an instance, you don't pay usage charges for it. To use a Reserved Instance with Amazon EMR, launch your job flow in the same Availability Zone as your Reserved Instance. For example, let's say you purchase one m1.small Reserved Instance in US-East. If you launch a job flow that uses two m1.small instances in the same Availability Zone in Region US-East, one instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. If you have a sufficient number of available Reserved Instances for the total number of instances you want to launch, you are guaranteed capacity.Your Reserved Instances are used before any On-Demand Instances are created. You can use Reserved Instances by using either the Amazon EMR console, the command line interface (CLI), Amazon EMR API actions, or the AWS SDKs. Related Topics • Amazon EC2 Reserved Instances Elastic IP Address Elastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP address is associated with your account, not a particular instance.You control the addresses associated with your account until you choose to explicitly release them. You can associate one Elastic IP address with only one job flow at a time. To ensure our customers are efficiently using Elastic IP addresses, we impose a small hourly charge when IP addresses associated with your account are not mapped to a job flow or Amazon EC2 instance. When Elastic IP addresses are mapped to an instance, they are free of charge. For more information about enabling Elastic IP addresses with Amazon EMR, see Using Elastic IP Addresses (p. 287). For more information about using IP addresses in AWS, go to the Using Elastic IP Addresses section in the Amazon Elastic Compute Cloud User Guide. Amazon EC2 Key Pairs When Amazon EMR starts an Amazon EC2 instance, it uses a 2048-bit RSA key pair that you have named. Amazon EC2 stores the public key. Amazon EMR stores the private key and uses the private key to validate all requests. The key pair ensures that only you can access your job flows. When you launch an instance using your key pair name, the public key becomes part of the instance metadata. This allows you to access the cluster node securely. Although specifying the key pair is optional, we strongly recommend that you use key pairs. This key pair becomes associated with all of the nodes created to process your job flow. The key pair name creates a handle you can use to access the master node in the Hadoop cluster. With the key pair name, you can log in to the master node without using a password, enabling you to monitor the progress of your job flows. On the master node, you can retrieve detailed job flow processing status and statistics. For more information on how to create and use an Amazon EC2 key pair with Amazon EMR, see "Creating an Amazon EC2 Key Pair" in the Getting Started Guide. API Version 2009-11-30 13
  • 20. Amazon Elastic MapReduce Developer Guide Amazon S3 Concepts Amazon S3 Concepts Topics • Buckets (p. 14) • Multipart Upload (p. 14) The following sections describe Amazon S3 features used by Amazon EMR. Buckets Amazon EMRrequires Amazon S3 buckets to hold the input and output data of your Hadoop processing. Amazon EMR uses the Amazon S3 Native File System for Hadoop processing. Amazon S3 uses the hostname method for accessing data, which places restrictions on bucket names used in Amazon EMR job flows. For more information on creating Amazon S3 buckets for use with Amazon EMR, see Setting Up Your Environment to Run a Job Flow (p. 17). For more information on Amazon S3 buckets, go to Working with Amazon S3 Buckets in the Amazon S3 Developer Guide. Multipart Upload Amazon Elastic MapReduce (Amazon EMR) supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts.You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object. For more information about enabling multipart uploads with Amazon EMR, see Multipart Upload (p. 343). For more information on Amazon S3 multipart uploads, go to Uploading Objects Using Multipart Upload in the Amazon S3 Developer Guide. AWS Identity and Access Management (IAM) Amazon Elastic MapReduce (Amazon EMR) supports AWS Identity and Access Management (IAM) policies. IAM is a web service that enables AWS customers to manage users and user permissions. For more information about enabling IAM policies with Amazon EMR, see Configure User Permissions with IAM (p. 274). For more information on IAM, go to Using IAM in the Using AWS Identity and Access Management guide. Regions You can choose the geographical region where Amazon EC2 creates the cluster to process your data. You might choose a region to optimize latency, minimize costs, or address regulatory requirements. Setting a region-specific endpoint guarantees where your data resides. For the list of regions and endpoints supported by Amazon EMR, go to Regions and Endpoints in the Amazon Web Services General Reference. Data Storage Amazon EMR uses Amazon S3 and Amazon SimpleDB data storage systems when processing a job flow. For more information about using Amazon S3 with Hadoop, go to http://wiki.apache.org/hadoop/AmazonS3. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page. API Version 2009-11-30 14
  • 21. Amazon Elastic MapReduce Developer Guide Using Amazon EMR Topics • Setting Up Your Environment to Run a Job Flow (p. 17) • Create a Job Flow (p. 23) • View Job Flow Details (p. 72) • Terminate a Job Flow (p. 77) • Customize a Job Flow (p. 79) • Connect to the Master Node in an Amazon EMR Job Flow (p. 110) • Use Cases (p. 122) • Building Binaries Using Amazon EMR (p. 131) • Using Tagging (p. 136) • Protect a Job Flow from Termination (p. 136) • Lower Costs with Spot Instances (p. 141) • Store Data with HBase (p. 155) • Troubleshooting (p. 183) • Monitor Metrics with Amazon CloudWatch (p. 209) • Monitor Performance with Ganglia (p. 220) • Distributed Copy Using S3DistCp (p. 227) • Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR (p. 234) • Use Third Party Applications With Amazon EMR (p. 258) This section covers the fundamentals of creating, managing, and troubleshooting a job flow using Amazon Elastic MapReduce (Amazon EMR). All supported job flow types are described. Information on using the Amazon EMR console, the CLI, SDKs, and API is included. If you have not signed up to use Amazon EMR, instructions are provided in the Getting Started Guide. Tip We strongly recommend that you work through the examples in the Getting Started Guide to get a basic understanding of Amazon EMR. Amazon EMR offers a variety of interfaces, including a console, a command line interface (CLI), a query API, AWS SDKs, and libraries. Each interface offers a different balance of ease and functionality. The interface you choose depends on your knowledge of Hadoop, your programming skills, and the functionality you require: API Version 2009-11-30 15
  • 22. Amazon Elastic MapReduce Developer Guide • The Amazon EMR console provides a graphical interface from which you can launch Amazon EMR job flows and monitor their progress. • The CLI combines full compatibility with the Amazon EMR API without requiring a programming environment. The Ruby-based Amazon EMR CLI is available for download at Amazon Elastic MapReduce Ruby Client (http://aws.amazon.com/developertools/2264.) • The Amazon EMR API, SDKs, and libraries offer the most flexibility but require a programming environment and software development skills. For more information on using the query API to access Amazon EMR see Write Amazon EMR Applications (p. 263) in this guide. The AWS SDKs provides support for Java, C#, and .NET. For more information on the AWS SDKs, refer to the list of current AWS SDKs. Libraries are available for Perl and PHP. For more information about the Perl and PHP libraries see Sample Code & Libraries (http://aws.amazon.com/code/Elastic-MapReduce.) The following table compares the functionality of the Amazon EMR interfaces. API/SDK/ Libraries Amazon CLI EMR Console Function Create multiple job flows Define bootstrap actions in a job flow View logs for Hadoop jobs, tasks, and task attempts using a graphical interface Implement Hadoop data processing programmatically Monitor job flows in real time Provide verbose job flow details Resize running job flows Run job flows with multiple steps Select version of Hadoop, Hive, and Pig Specify the MapReduce executable in multiple computer languages Specify the number and type of Amazon Amazon EC2 instances that process the data Transfer data to and from Amazon S3 automatically Terminate job flows in real time The following sections describe how to use Amazon Elastic MapReduce (Amazon EMR) with each of the interface types. API Version 2009-11-30 16
  • 23. Amazon Elastic MapReduce Developer Guide Setting Up Your Environment to Run a Job Flow Setting Up Your Environment to Run a Job Flow This section walks you through how to set up required resources and permissions to run a job flow. The tasks that follow show you how to create the resources that your job flow uses to process data. Once created, you can reuse these resources for other job flows. Depending on your application, however, it may make operational sense to create new resources for each job flow. The tasks that must be completed before you create a job flow are as follows: 1 Choose a Region (p. 17) 2 Create and Configure an Amazon S3 Bucket (p. 19) 3 Create an Amazon EC2 Key Pair and PEM File (p. 20) 4 Modify Your PEM File (p. 21) 5 For CLI and API users only, Get Security Credentials (p. 21) 6 For CLI users only, optionally Create a Credentials File (p. 22) The following sections provide instructions on how to perform each of the tasks. Choose a Region AWS enables you to place resources in multiple locations. Locations are composed of Regions and Availability Zones within those Regions. Availability Zones are distinct geographical locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. All Amazon EC2 Instances, key pairs, security groups, and Amazon Elastic MapReduce (Amazon EMR) job flows must be located in the same Region.To optimize performance and reduce latency, all resources (such as Amazon S3 buckets) and job flows should be located in the same Availability Zone. For more information about Regions and Availability Zones, go to Using Regions and Availability Zones in the Amazon Elastic Compute Cloud User Guide Note Not all AWS products offer the same support in all Regions. For example, Cluster Compute instances are available only in the US-East (Northern Virginia) Region and the Asia Pacific (Sydney) region supports only Hadoop 1.0.3 and later. Confirm that you are working in the appropriate Region for the resources you want to use. You must ensure that you use the same Region for each resource you create. Use the table below to identify the correct Region name. If your Amazon EMR The Amazon EMR CLI The Amazon S3 The Amazon EC2 Region is... and API Region is... Region is... Region is... US East (Virginia) us-east-1 US Standard US East (Virginia) US West (Oregon) us-west-2 Oregon US West (Oregon) US West (N. California) us-west-1 Northern California US West (N. California) EU West (Ireland) eu-west-1 Ireland EU West (Ireland) Asia Pacific (Singapore) ap-southeast-1 Singapore Asia Pacific (Singapore) API Version 2009-11-30 17
  • 24. Amazon Elastic MapReduce Developer Guide Choose a Region If your Amazon EMR The Amazon EMR CLI The Amazon S3 The Amazon EC2 Region is... and API Region is... Region is... Region is... Asia Pacific (Sydney) ap-southeast-2 Sydney Asia Pacific (Sydney) Asia Pacific (Tokyo) ap-northeast-1 Tokyo Asia Pacific (Tokyo) South America (Sao Paulo) South America (Sao sa-east-1 Sao Paulo Paulo) Using the Amazon EMR Console to Specify a Region To select a region in Amazon EMR • From the Amazon EMR console, select the Region from the drop-down list. Using the CLI to Specify a Region Specify the Region with the --region parameter, as in the following example. If the --region parameter is not specified, the job flow is created in the us-east-1 region. $ ./elastic-mapreduce --create --alive --stream --input myawsbucket --output myawsbucket --log-uri --region eu-west-1 Tip To reduce the number of parameters required each time you issue a command from the CLI, you can store information such as Region in your credentials.json file. For more information on creating a credentials.json file, go to the Create a Credentials File (p. 22). API Version 2009-11-30 18
  • 25. Amazon Elastic MapReduce Developer Guide Create and Configure an Amazon S3 Bucket Using the API to Specify a Region To select a region, configure your application to use that Region's endpoint. If you are creating a client application using an AWS SDK, you can change the client endpoint by calling setEndpoint, as shown in the following example: client.setEndpoint(“eu-west-1.elasticmapreduce.amazonaws.com”); Once your application has specified a region by setting the endpoint, you can set the Availability Zone for your job flow's Amazon EC2 instances with a query request that contains a Instances.Placement.AvailabilityZone parameter, as in the following example. If you do not specify the Availability Zone for your job flow, Amazon EMR launches the job flow instances in the best Availability Zone in that region based on system health and available capacity. https://elasticmapreduce.amazonaws.com? Operation= ... Instances.Placement.AvailabilityZone=eu-west-1a& ... For more information about the parameters in an Amazon EMR request, see API Reference. Note For more information on specifying Regions from the CLI and API, see Available Region Endpoints for the AWS SDKs . Create and Configure an Amazon S3 Bucket Amazon Elastic MapReduce (Amazon EMR) uses Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as buckets.To conform with Amazon S3 requirements, DNS requirements, and restrictions in the supported data analysis tools, we recommend following the following guidelines for bucket names. All bucket names must: • Be between 3 and 63 characters long • Contain only lowercase letters, numbers, or periods (.) • Not contain a dash (-) or underscore (_) For additional details on valid bucket names, go to Bucket Restrictions and Limitations in the Amazon Simple Storage Service Developers Guide. This section shows you how to use the AWS Management Console to create and then set permissions for an Amazon S3 bucket. However, you can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or the third-party Curl command line tool. For information about Curl, go to Amazon S3 Authentication Tool for Curl. For information about using the Amazon S3 API to create and configure an Amazon S3 bucket, go to the Amazon Simple Storage Service API Reference. Using the AWS Management Console to Create an Amazon S3 Bucket To create an Amazon S3 bucket 1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/. API Version 2009-11-30 19
  • 26. Amazon Elastic MapReduce Developer Guide Create an Amazon EC2 Key Pair and PEM File 2. Click Create Bucket. The Create a Bucket dialog box opens. 3. Enter a bucket name, such as mylog-uri. This name should be globally unique, and cannot be the same name used by another bucket. 4. Select the Region for your bucket. To avoid paying cross-region bandwidth charges, create the Amazon S3 bucket in the same region as your job flow. Refer to Choose a Region (p. 17) for guidance on choosing a Region. 5. Click Create. You created a bucket with the URI s3n://mylog-uri/. Note If you enable logging in the Create a Bucket wizard, it enables only bucket access logs, not Amazon EMR job flow logs. Note For more information on specifying Region-specific buckets, refer to Buckets and Regions in the Amazon Simple Storage Service Developer Guide and Available Region Endpoints for the AWS SDKs . After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself (the owner) read and write access and authenticated users read access. Using the AWS Management Console to configure an Amazon S3 bucket To set permissions on an Amazon S3 bucket 1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/. 2. In the Buckets pane, right-click the bucket you just created. 3. Select Properties. 4. In the Properties pane, select the Permissions tab. 5. Click Add more permissions. 6. Select Authenticated Users in the Grantee field. 7. To the right of the Grantee drop-down list, select List. 8. Click Save. You have created a bucket and restricted permissions to authenticated users. Create an Amazon EC2 Key Pair and PEM File Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alone have access to the instances that you launch. The PEM file associated with this key pair is required to ssh directly to the master node of the cluster running your job flow. To create an Amazon EC2 key pair 1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/. 2. From the Amazon EC2 console, select a Region. 3. In the Navigation pane, click Key Pairs. API Version 2009-11-30 20
  • 27. Amazon Elastic MapReduce Developer Guide Modify Your PEM File 4. On the Key Pairs page, click Create Key Pair. 5. In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair. 6. Click Create. 7. Save the resulting PEM file in a safe location. Your Amazon EC2 key pair and an associated PEM file are created. Modify Your PEM File Amazon Elastic MapReduce (Amazon EMR) enables you to work interactively with your job flow, allowing you to test job flow steps or troubleshoot your cluster environment. To log in directly to the master node of your running job flow, you can use ssh or PuTTY.You use your PEM file to authenticate to the master node.The PEM file requires a modification based on the tool you use that supports your operating system. You use the CLI to connect on Linux or UNIX computers.You use PuTTY to connect on Microsoft Windows computers. For more information on how to install the Amazon EMR CLI or how to install PuTTY, go to the Getting Started Guide. To modify your credentials file • Create a local permissions file: If you are Do this... using... Set the permissions on the PEM file or your Amazon EC2 key pair. For example, if you saved the file as mykeypair.pem, the command looks like the following: $ chmod og-rwx mykeypair.pem Linux or UNIX a. Download PuTTYgen.exe to your computer from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html. b. Launch PuTTYgen. c. Click Load. Select the PEM file you created earlier. d. Click Open. e. Click OK on the PuTTYgen Notice telling you the key was successfully imported. f. Click Save private key to save the key in the PPK format. g. When PuTTYgen prompts you to save the key without a pass phrase, click Yes. h. Enter a name for your PuTTY private key, such as, mykeypair.ppk. i. Click Save. j. Exit the PuTTYgen application. Microsoft Windows Your credentials file is modified to allow you to log in directly to the master node of your running job flow. Get Security Credentials AWS assigns you an Access Key ID and a Secret Access Key to identify you as the sender of your request. AWS uses these security credentials to help protect your data.You include your Access Key ID API Version 2009-11-30 21
  • 28. Amazon Elastic MapReduce Developer Guide Create a Credentials File in all AWS requests made through the CLI or API.The AWS Management Console provides these security credentials automatically. Note Your Secret Access Key is a shared secret between you and AWS. Keep this key secret. Amazon uses this key to bill you for the AWS services you use. Never include your key in your requests to AWS and never email your key to anyone, even if an inquiry appears to originate from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key. To get your Access Key ID and Secret Access Key 1. Go to the AWS website. 2. Click My Account to display a list of options. 3. Click Security Credentials and log in to your AWS Account.Your Access Key ID is displayed in the Access Credentials section.Your Secret Access Key remains hidden as a further precaution. 4. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown in the following figure. You have your Access Key ID and a Secret Access Key to securely identify yourself to AWS.You need this information to create a credentials file, as described in the following section. Create a Credentials File You can use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands.The credentials file is a convenient place for you to store command parameters so you don't have to repeatedly enter the information. Your credentials are used to calculate the signature value for every request you make.The Amazon EMR CLI automatically looks for these credentials in the file credentials.json. you can edit the credentials.json file and include your AWS credentials. If you do not have a credentials.json file, you must include your credentials in every request you make. To create your credentials file 1. Create a file named credentials.json on your computer. 2. Add the following lines to your credentials file: API Version 2009-11-30 22
  • 29. { Amazon Elastic MapReduce Developer Guide Create a Job Flow "access-id": "AccessKeyID", "private-key": "PrivateKey", "key-pair": "KeyName", "key-pair-file": "location of key pair file", "region": "Region", "log-uri": "location of bucket on Amazon S3" } The access-id and private-key are the AWS Access Key ID and a Secret Access Key described in Get Security Credentials (p. 21). The key-pair and key-pair-file are the Amazon EC2 key pair and the path and name of PEM file you created in Create an Amazon EC2 Key Pair and PEM File (p. 20). The region is the Region you selected in Choose a Region (p. 17).The log-uri is the path to the bucket you created in Create and Configure an Amazon S3 Bucket (p.19) using the format s3n://BucketName/FolderName. Your credentials.json file is configured. Each of the preceding tasks guided you through the steps to set up the objects and permissions required for a job flow.You are now ready to create your job flow. Instructions on how to create a job flow are at Create a Job Flow (p. 23). Create a Job Flow Topics • Choose a Job Flow Type (p. 23) • Choose Job Flow Interface (p. 24) • Identify Data, Scripts, and Log File locations (p. 24) • How to Create a Streaming Job Flow (p. 24) • How to Create a Job Flow Using Hive (p. 32) • How to Create a Job Flow Using Pig (p. 40) • How to Create a Job Flow Using a Custom JAR (p. 48) • How to Create a Cascading Job Flow (p. 56) • Launch an HBase Cluster on Amazon EMR (p. 64) This section covers the basics of creating a job flow using Amazon Elastic MapReduce (Amazon EMR). You can create a job flow using the Amazon EMR console, downloading and installing the Command Line Interface (CLI), or creating a query request with the Query API. The interface-specific details for using either the Amazon EMR console, the CLI, or the API are covered in the following sections. For information about creating the objects and setting the permissions needed to create a job flow see Setting Up Your Environment to Run a Job Flow (p. 17). For information on the job flow process and how individual steps are processed see Job Flows and Steps (p. 6). Choose a Job Flow Type Choose one of the supported job flow types: your choice of job flow type depends on several factors, including the format of the data and your level of programming knowledge. For information on comparing the supported job flow types, see Appendix: Compare Job Flow Types (p. 389). API Version 2009-11-30 23
  • 30. Amazon Elastic MapReduce Developer Guide Choose Job Flow Interface Choose Job Flow Interface Choose the manner in which you want to create your job flow. The description of each job flow type in this section includes details on how to create a job flow using the Amazon EMR console, the CLI, or Query API. The Amazon EMR console provides a graphical interface to launch Elastic MapReduce job flows and monitor their progress. The CLI combines full compatibility with the Elastic MapReduce API without requiring a programming environment.The Elastic MapReduce API, AWS SDK, and libraries offer the most flexibility, but require a programming environment and software development skills. Identify Data, Scripts, and Log File locations You need to plan the job flow you want to run and specify where Amazon EMR finds the information. Typically, the MapReduce program or script is located in a bucket on Amazon S3.Your job flow input, output, and job flow logs are also typically located on Amazon S3. Required Amazon S3 buckets must exist before you can create a job flow.You must upload any required scripts or data referenced in the job flow to Amazon S3. The following table describes example data, scripts, and log file locations. Information Example Location on Amazon S3 script or program s3://myawsbucket/wordcount/wordSplitter.py log files s3://myawsbucket/wordcount/logs input data s3://myawsbucket/wordcount/input output data s3://myawsbucket/wordcount/output For information on how to upload objects to Amazon S3, go to Add an Object to Your Bucket in the Amazon Simple Storage Service Getting Started Guide. How to Create a Streaming Job Flow This section covers the basics of creating and launching a streaming job flow using Amazon Elastic MapReduce (Amazon EMR).You'll step through how to create a streaming job flow using either the Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17). A streaming job flow reads input from standard input and then runs a script or executable (called a mapper) against each input. The result from each of the inputs is saved locally, typically on a Hadoop Distributed File System (HDFS) partition. Once all the input is processed by the mapper, a second script or executable (called a reducer) processes the mapper results.The results from the reducer are sent to standard output. You can chain together a series of streaming job flows, where the output of one streaming job flow becomes the input of another job flow. The mapper and the reducer can each be referenced as a file or you can supply a Java class.You can implement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python, PHP, or Bash. The example that follows is based on the Amazon EMR Word Count Example. This example shows how to use Hadoop streaming to count the number of times each word occurs within a text file. In this example, the input is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/wordcount/input. The mapper is a Python script that counts the number of times a word occurs in each input string and is located at s3://elasticmapreduce/samples/wordcount/wordSplitter.py.The reducer references a standard Hadoop library package called aggregate. Aggregate provides a special Java class and a list API Version 2009-11-30 24
  • 31. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow of simple aggregators that perform aggregations such as sum, max, and min over a sequence of values. The output is saved to an Amazon S3 bucket you created in Setting Up Your Environment to Run a Job Flow (p. 17). Amazon EMR Console This example describes how to use the Amazon EMR console to create a streaming job flow. To create a streaming job flow 1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/. 2. Click Create New Job Flow. 3. In the DEFINE JOB FLOW page, do the following: a. Enter a name in the Job Flow Name field. This name is optional, and does not need to be unique. b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for Hadoop (p. 260). c. Select Run your own application. d. Select Streaming in the drop-down list. e. Click Continue. API Version 2009-11-30 25
  • 32. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow 4. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue. Field Action Specify the URI where the input data resides in Amazon S3. The value must be in the form BucketName/path. Input Location* Specify the URI where you want the output stored in Amazon S3. The value must be in the form BucketName/path. Output Location* Specify either a class name that refers to a mapper class in Hadoop, or a path on Amazon S3 where the mapper executable, such as a Python program, resides. The path value must be in the form BucketName/path/MapperExecutable. Mapper* Specify either a class name that refers to a reducer class in Hadoop, or a path on Amazon S3 where the reducer executable, such as a Python program, resides. The path value must be in the form BucketName/path/ReducerExecutable. Amazon EMR supports the special aggregate keyword. For more information, go to the Aggregate library supplied by Hadoop. Reducer* Optionally, enter a list of arguments (space-separated strings) to pass to the Hadoop streaming utility. For example, you can specify additional files to load into the distributed cache. Extra Args * Required parameter API Version 2009-11-30 26
  • 33. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow 5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the following table as a guide, and then click Continue. Note Twenty is the default maximum number of nodes per AWS account. For example, if you have two job flows running, the total number of nodes running for both job flows must be 20 or less. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form. Field Action Specify the number of nodes to use in the Hadoop cluster. There is always one master node in each job flow.You can specify the number of core and tasks nodes. Instance Count Specify the Amazon EC2 instance types to use as master, core, and task nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, and cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions. The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (Northern Virginia) Region. Instance Type Specify whether to run master, core, or task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (p. 141) Request Spot Instances * Required parameter API Version 2009-11-30 27
  • 34. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow 6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table as a guide, and then click Continue. Field Action Optionally, specify a key pair that you created previously. For more information, see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not enter a value in this field, you cannot SSH into the master node. Amazon EC2 Key Pair Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon VPC. For more information, see Running Job Flows on an Amazon VPC (p. 381). Amazon VPC Subnet Id Optionally, specify a path in Amazon S3 to store the Amazon EMR log files. The value must be in the form BucketName/path. If you do not supply a location, Amazon EMR does not log any files. Amazon S3 Log Path Select Yes to store Amazon Elastic MapReduce-generated log files.You must enable debugging at this level if you want to store the log files generated by Amazon EMR. If you select Yes, you must supply an Amazon S3 bucket name where Amazon Elastic MapReduce can upload your log files. For more information, see Troubleshooting (p. 183). Important You can enable debugging for a job flow only when you initially create the job flow. Enable Debugging Select Yes to cause the job flow to continue running when all processing is completed. Keep Alive API Version 2009-11-30 28
  • 35. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow Field Action Select Yes to ensure the job flow is not shut down due to accident or error. For more information, see Protect a Job Flow from Termination (p. 136). Termination Protection Select Yes to make the job flow visible and accessible to all IAM users on the AWS account. For more information, see Configure User Permissions with IAM (p. 274). Visible To All IAM Users 7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click Continue. For more information about bootstrap actions, see Bootstrap Actions (p. 84). API Version 2009-11-30 29
  • 36. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow 8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct. After you click Create Job Flow your request is processed; when it succeeds, a message appears. API Version 2009-11-30 30
  • 37. 9. Click Close. Amazon Elastic MapReduce Developer Guide How to Create a Streaming Job Flow The Amazon EMR console shows the new job flow starting. Starting a new job flow may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button for the latest view of the job flow's progress. CLI This example describes how to use the CLI to create a streaming job flow. Replace the red text with your Amazon S3 bucket information. To create a job flow • Use the information in the following table to create your job flow: If you are Enter the following... using... & ./elastic-mapreduce --create --stream --input s3n://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --output s3n://myawsbucket Linux or UNIX c:ruby elastic-mapreduce --create --stream --input s3n://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --output s3n://myawsbucket Microsoft Windows The output looks similar to the following. Created jobflow JobFlowID API Version 2009-11-30 31
  • 38. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive By default, this command launches a job flow to run on a single-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes.You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively. API This section describes the Amazon EMR API Query request parameters you need to create a streaming job flow. The response includes a <JobFlowID>, which you use in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason, it is important to store job flow IDs. The Args argument contains location information for your input data, output data, mapper, reducer, and cache file, as shown in the following example. "Name": "streaming job flow", "HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://elasticmapreduce/samples/wordcount/input", "-output", "s3n://myawsbucket", "-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplit ter.py", "-reducer", "aggregate" ] } Note All paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If you use HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as in hdfs:///home/hadoop/sampleInput2/. How to Create a Job Flow Using Hive This section covers the basics of creating a job flow using Hive in Amazon Elastic MapReduce (Amazon EMR).You'll step through how to create a job flow using Hive with either the Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17). For advanced information on Hive configuration options, see Hive Configuration (p. 348). A job flow using Hive enables you to create a data analysis application using a SQL-like language. The example that follows is based on the Amazon EMR sample: Contextual Advertising using Apache Hive and Amazon EMR with High Performance Computing instances. This sample describes how to correlate customer click data to specific advertisements. In this example, the Hive script is located in an Amazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs/model-build. All of the data processing instructions are located in the Hive script. The script requires additional libraries that are located in an Amazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs.The input data is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/hive-ads/tables. The output is saved to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17). API Version 2009-11-30 32
  • 39. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive Amazon EMR Console This example describes how to use the Amazon EMR console to create a job flow using Hive. To create a job flow using Hive 1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/. 2. Click Create New Job Flow. 3. In the DEFINE JOB FLOW page, do the following: a. Enter a name in the Job Flow Name field. We recommended you use a descriptive name. It does not need to be unique. b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for Hadoop (p. 260). c. Select Run your own application. d. Select Hive in the drop-down list. e. Click Continue. API Version 2009-11-30 33
  • 40. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive 4. In SPECIFY PARAMETERS page, specify whether you want to run the Hive job from a script or interactively from the master node. If you are running Hive from a script, enter values in the boxes using the following table as a guide. Click Continue. Field Action Specify the URI where your script resides in Amazon S3. The value must be in the form BucketName/path/ScriptName. Script Location* Optionally, specify the URI where your input files reside in Amazon S3. The value must be in the form BucketName/path/. If specified, this will be passed to the Hive script as a parameter named INPUT. Input Location Optionally, specify the URI where you want the output stored in Amazon S3. The value must be in the form BucketName/path. If specified, this will be passed to the Hive script as a parameter named OUTPUT. Output Location Extra Args Optionally, enter a list of arguments (space-separated strings) to pass to Hive. * Required parameter API Version 2009-11-30 34
  • 41. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive 5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the following table as a guide, and then click Continue. Note Twenty is the default maximum number of nodes per AWS account. For example, if you have two job flows running, the total number of nodes running for both job flows must be 20 or less. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form. Field Action Specify the number of nodes to use in the Hadoop cluster. There is always one master node in each job flow.You can specify the number of core and tasks nodes. Instance Count Specify the Amazon EC2 instance types to use as master, core, and task nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, and cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions. The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (Northern Virginia) Region. Instance Type Specify whether to run master, core, or task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (p. 141) Request Spot Instances * Required parameter API Version 2009-11-30 35
  • 42. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive 6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table as a guide, and then click Continue. Field Action Optionally, specify a key pair that you created previously. For more information, see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not enter a value in this field, you cannot SSH into the master node. Amazon EC2 Key Pair Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon VPC. For more information, see Running Job Flows on an Amazon VPC (p. 381). Amazon VPC Subnet Id Optionally, specify a path in Amazon S3 to store the Amazon EMR log files. The value must be in the form BucketName/path. If you do not supply a location, Amazon EMR does not log any files. Amazon S3 Log Path Select Yes to store Amazon Elastic MapReduce-generated log files.You must enable debugging at this level if you want to store the log files generated by Amazon EMR. If you select Yes, you must supply an Amazon S3 bucket name where Amazon Elastic MapReduce can upload your log files. For more information, see Troubleshooting (p. 183). Important You can enable debugging for a job flow only when you initially create the job flow. Enable Debugging Select Yes to cause the job flow to continue running when all processing is completed. Keep Alive API Version 2009-11-30 36
  • 43. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive Field Action Select Yes to ensure the job flow is not shut down due to accident or error. For more information, see Protect a Job Flow from Termination (p. 136). Termination Protection Select Yes to make the job flow visible and accessible to all IAM users on the AWS account. For more information, see Configure User Permissions with IAM (p. 274). Visible To All IAM Users 7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click Continue. For more information about bootstrap actions, see Bootstrap Actions (p. 84). API Version 2009-11-30 37
  • 44. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive 8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct. After you click Create Job Flow your request is processed; when it succeeds, a message appears. API Version 2009-11-30 38
  • 45. 9. Click Close. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Hive The Amazon EMR console shows the new job flow starting. Starting a new job flow may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button for the latest view of the job flow's progress. CLI This example describes how to use the CLI to create a job flow using Hive. To create a job flow using Hive • Use the information in the following table to create your job flow: If you are Enter the following... using... & ./elastic-mapreduce --create --name "Test Hive" --hive-script s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q --args "-d", "LIBS=s3n://elasticmapreduce/samples/hive-ads/libs","-d", "INPUT=s3n://elasticmapreduce/samples/hive-ads/tables", "-d","OUTPUT=s3n://myawsbucket/hive-ads/output/" Linux or UNIX c: ruby elastic-mapreduce --create --name "Test Hive" --hive-script s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q --args "-d","LIBS=s3n://elasticmapreduce/samples/hive-ads/libs", "-d","INPUT=s3n://elasticmapreduce/samples/hive-ads/tables", "-d","OUTPUT=s3n://myawsbucket/hive-ads/output/" Microsoft Windows The output looks similar to the following. Created job flow JobFlowID API Version 2009-11-30 39
  • 46. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Pig By default, this command launches a job flow to run on a two-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes.You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively. API This section describes the Amazon EMR API Query request parameters you need to create a job flow using Hive. For an explanation of the parameters unique to RunJobFlow, go to RunJobFlow in the Amazon Elastic MapReduce (Amazon EMR) API Reference. The response includes a <JobFlowID>, which you use in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason, it is important to store job flow IDs. The Args argument contains location information for your input data, output data, and LIBS, as shown in the following example. "Name": "Hive job flow", "HadoopJarStep": { "Jar":"s3://us-west-1.elasticmapreduce/libs/script-runner/script-runner.jar", "Args":[ "s3://us-west-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-west-1.elasticmapreduce/libs/hive/", "--run-hive-script", "--args", "-f", "s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q", "-d LIBS=s3n://elasticmapreduce/samples/hive-ads/libs" ] } Note All paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If you use HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as in hdfs:///home/hadoop/sampleInput2/. How to Create a Job Flow Using Pig This section covers the basics of creating a job flow using Pig in Amazon Elastic MapReduce (Amazon EMR).You'll step through how to create a job flow using Pig with either the Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17). A job flow using Pig takes SQL-like commands written in Pig Latin, and converts those commands into Hadoop MapReduce algorithms. The examples that follow are based on the Amazon EMR sample: Apache Log Analysis using Pig. The sample evaluates Apache log files and then generates a report containing the total bytes transferred, a list of the top 50 IP addresses, a list of the top 50 external referrers, and the top 50 search terms using Bing and Google. The Pig script is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/pig-apache/do-reports2.pig. Input data is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/pig-apache/input. The output is saved to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17). Amazon EMR Console This example describes how to use the Amazon EMR console to create a job flow using Pig. API Version 2009-11-30 40
  • 47. Amazon Elastic MapReduce Developer Guide How to Create a Job Flow Using Pig To create a job flow using Pig 1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/. 2. Click Create New Job Flow. 3. In the DEFINE JOB FLOW page, enter the following: a. Enter a name in the Job Flow Name field. We recommended you use a descriptive name. It does not need to be unique. b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for Hadoop (p. 260). c. Select Run your own application. d. Select Pig Program in the drop-down list. e. Click Continue. API Version 2009-11-30 41