4. Amazon Elastic MapReduce Developer Guide
Welcome ................................................................................................................................................. 1
Understand Amazon EMR ...................................................................................................................... 2
Overview of Amazon EMR ...................................................................................................................... 2
Architectural Overview of Amazon EMR ....................................................................................... 3
Elastic MapReduce Features ........................................................................................................ 4
Amazon EMR Concepts ......................................................................................................................... 6
Job Flows and Steps ..................................................................................................................... 6
Hadoop and MapReduce .............................................................................................................. 7
Associated AWS Product Concepts ...................................................................................................... 11
Using Amazon EMR ............................................................................................................................. 15
Setting Up Your Environment to Run a Job Flow .................................................................................. 17
Create a Job Flow ................................................................................................................................. 23
How to Create a Streaming Job Flow .......................................................................................... 24
How to Create a Job Flow Using Hive ......................................................................................... 32
How to Create a Job Flow Using Pig ........................................................................................... 40
How to Create a Job Flow Using a Custom JAR ......................................................................... 48
How to Create a Cascading Job Flow ......................................................................................... 56
Launch an HBase Cluster on Amazon EMR ............................................................................... 64
View Job Flow Details ........................................................................................................................... 72
Terminate a Job Flow ............................................................................................................................ 77
Customize a Job Flow .......................................................................................................................... 79
Add Steps to a Job Flow ............................................................................................................. 79
Wait for Steps to Complete ................................................................................................ 81
Add More than 256 Steps to a Job Flow ............................................................................ 82
Bootstrap Actions ........................................................................................................................ 84
Resizing Running Job Flows ....................................................................................................... 96
Calling Additional Files and Libraries ........................................................................................ 104
Using Distributed Cache .................................................................................................. 104
Running a Script in a Job Flow ........................................................................................ 109
Connect to the Master Node in an Amazon EMR Job Flow ............................................................... 110
Connect to the Master Node Using SSH ................................................................................... 111
Web Interfaces Hosted on the Master Node ............................................................................. 115
Open an SSH Tunnel to the Master Node ................................................................................. 116
Configure Foxy Proxy to View Websites Hosted on the Master Node ....................................... 117
Use Cases .......................................................................................................................................... 122
Cascading ................................................................................................................................. 122
Pig ............................................................................................................................................. 126
Streaming .................................................................................................................................. 129
Building Binaries Using Amazon EMR ................................................................................................ 131
Using Tagging ..................................................................................................................................... 136
Protect a Job Flow from Termination .................................................................................................. 136
Lower Costs with Spot Instances ........................................................................................................ 141
Choosing What to Launch as Spot Instances ........................................................................... 142
Spot Instance Pricing in Amazon EMR ..................................................................................... 144
Availability Zones and Regions ................................................................................................. 144
Launching Spot Instances in Job Flows .................................................................................... 145
Changing the Number of Spot Instances in a Job Flow ............................................................ 151
Troubleshooting Spot Instances ................................................................................................ 154
Store Data with HBase ....................................................................................................................... 155
HBase Job Flow Prerequisites .................................................................................................. 155
Launch an HBase Cluster on Amazon EMR ............................................................................. 156
Connect to HBase Using the Command Line ............................................................................ 164
Back Up and Restore HBase .................................................................................................... 165
Terminate an HBase Cluster ..................................................................................................... 174
Configure HBase ....................................................................................................................... 174
Access HBase Data with Hive ................................................................................................... 178
View the HBase User Interface ................................................................................................. 180
View HBase Log Files ............................................................................................................... 180
API Version 2009-11-30
4
5. Amazon Elastic MapReduce Developer Guide
Monitor HBase with CloudWatch ............................................................................................... 181
Monitor HBase with Ganglia ...................................................................................................... 181
Troubleshooting .................................................................................................................................. 183
Things to Check When Your Amazon EMR Job Flow Fails ....................................................... 183
Amazon EMR Logging .............................................................................................................. 187
Enable Logging and Debugging ................................................................................................ 187
Use Log Files ............................................................................................................................ 190
Monitor Hadoop on the Master Node ........................................................................................ 199
View the Hadoop Web Interfaces .............................................................................................. 200
Troubleshooting Tips ................................................................................................................. 204
Monitor Metrics with Amazon CloudWatch ......................................................................................... 209
Monitor Performance with Ganglia ...................................................................................................... 220
Distributed Copy Using S3DistCp ....................................................................................................... 227
Export, Query, and Join Tables in Amazon DynamoDB ...................................................................... 234
Prerequisites for Integrating Amazon EMR ............................................................................... 235
Step 1: Create a Key Pair .......................................................................................................... 235
Step 2: Create a Job Flow ......................................................................................................... 236
Step 3: SSH into the Master Node ............................................................................................ 241
Step 4: Set Up a Hive Table to Run Hive Commands ................................................................ 244
Hive Command Examples for Exporting, Importing, and Querying Data .................................. 248
Optimizing Performance ............................................................................................................ 255
Use Third Party Applications With Amazon EMR ............................................................................... 258
Parse Data with HParser ........................................................................................................... 258
Using Karmasphere Analytics ................................................................................................... 259
Launch a Job Flow on the MapR Distribution for Hadoop ......................................................... 260
Write Amazon EMR Applications ........................................................................................................ 263
Common Concepts for API Calls ........................................................................................................ 263
Use SDKs to Call Amazon EMR APIs ................................................................................................ 265
Using the AWS SDK for Java to Create an Amazon EMR Job Flow ......................................... 266
Using the AWS SDK for .Net to Create an Amazon EMR Job Flow .......................................... 267
Using the Java SDK to Sign a Query Request .......................................................................... 267
Use Query Requests to Call Amazon EMR APIs ............................................................................... 268
Why Query Requests Are Signed ............................................................................................. 269
Components of a Query Request in Amazon EMR ................................................................... 269
How to Generate a Signature for a Query Request in Amazon EMR ........................................ 270
Configure Amazon EMR ..................................................................................................................... 274
Configure User Permissions with IAM ................................................................................................ 274
Set Policy for an IAM User ........................................................................................................ 277
Configure IAM Roles for Amazon EMR .............................................................................................. 280
Set Access Permissions on Files Written to Amazon S3 .................................................................... 285
Using Elastic IP Addresses ................................................................................................................. 287
Specify the Amazon EMR AMI Version ............................................................................................... 290
Hadoop Configuration ......................................................................................................................... 299
Supported Hadoop Versions ..................................................................................................... 300
Configuration of hadoop-user-env.sh ........................................................................................ 302
Upgrading to Hadoop 1.0 .......................................................................................................... 302
Hadoop Version Behavior ................................................................................................ 303
Hadoop 0.20 Streaming Configuration ...................................................................................... 304
Hadoop Default Configuration (AMI 1.0) ................................................................................... 304
Hadoop Configuration (AMI 1.0) ...................................................................................... 304
HDFS Configuration (AMI 1.0) ......................................................................................... 307
Task Configuration (AMI 1.0) ........................................................................................... 308
Intermediate Compression (AMI 1.0) ............................................................................... 311
Hadoop Memory-Intensive Configuration Settings (AMI 1.0) ................................................... 311
Hadoop Default Configuration (AMI 2.0 and 2.1) ...................................................................... 314
Hadoop Configuration (AMI 2.0 and 2.1) ......................................................................... 314
HDFS Configuration (AMI 2.0 and 2.1) ............................................................................ 318
Task Configuration (AMI 2.0 and 2.1) .............................................................................. 318
API Version 2009-11-30
5
6. Amazon Elastic MapReduce Developer Guide
Intermediate Compression (AMI 2.0 and 2.1) .................................................................. 321
Hadoop Default Configuration (AMI 2.2) ................................................................................... 322
Hadoop Configuration (AMI 2.2) ...................................................................................... 322
HDFS Configuration (AMI 2.2) ......................................................................................... 326
Task Configuration (AMI 2.2) ........................................................................................... 326
Intermediate Compression (AMI 2.2) ............................................................................... 329
Hadoop Default Configuration (AMI 2.3) ................................................................................... 330
Hadoop Configuration (AMI 2.3) ...................................................................................... 330
HDFS Configuration (AMI 2.3) ......................................................................................... 334
Task Configuration (AMI 2.3) ........................................................................................... 334
Intermediate Compression (AMI 2.3) ............................................................................... 337
File System Configuration ......................................................................................................... 338
JSON Configuration Files .......................................................................................................... 340
Multipart Upload ........................................................................................................................ 343
Hadoop Data Compression ....................................................................................................... 344
Setting Permissions on the System Directory ........................................................................... 345
Hadoop Patches ........................................................................................................................ 346
Hive Configuration .............................................................................................................................. 348
Supported Hive Versions ........................................................................................................... 349
Share Data Between Hive Versions ........................................................................................... 353
Differences from Apache Hive Defaults .................................................................................... 353
Interactive and Batch Modes ..................................................................................................... 355
Creating a Metastore Outside the Hadoop Cluster ................................................................... 357
Using the Hive JDBC Driver ...................................................................................................... 359
Additional Features of Hive in Amazon EMR ............................................................................ 362
Upgrade to Hive 0.8 .................................................................................................................. 368
Upgrade the Configuration Files ...................................................................................... 368
Upgrade the Metastore .................................................................................................... 369
Upgrade to Hive 0.8 (MySQL on the Master Node) ................................................ 369
Upgrade to Hive 0.8 (MySQL on Amazon RDS) ..................................................... 373
Pig Configuration ................................................................................................................................ 377
Supported Pig Versions ............................................................................................................. 377
Pig Version Details .................................................................................................................... 379
Performance Tuning ............................................................................................................................ 381
Running Job Flows on an Amazon VPC ............................................................................................. 381
Appendix: Compare Job Flow Types ................................................................................................... 389
Appendix: Amazon EMR Resources ................................................................................................... 391
Document History ............................................................................................................................... 396
Glossary ............................................................................................................................................. 393
Index ................................................................................................................................................... 401
API Version 2009-11-30
6
7. Welcome
Amazon Elastic MapReduce Developer Guide
How Do I...?
This is the Amazon Elastic MapReduce (Amazon EMR) Developer Guide.This guide provides a conceptual
overview of Amazon EMR, an overview of related AWS products, and detailed information on all
functionality available from Amazon EMR.
Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. Amazon
EMR uses Hadoop processing combined with several AWS products to do such tasks as web indexing,
data mining, log file analysis, machine learning, scientific simulation, and data warehousing.
How Do I...?
How Do I? Relevant Sections
Decide whether Amazon EMR is right Amazon Elastic MapReduce detail page
for my needs
Get started with Amazon EMR Getting Started Guide
Learn about troubleshooting job flows Troubleshooting (p. 183)
Learn how to create a job flow Create a Job Flow (p. 23)
Learn about bootstrap actions Bootstrap Actions (p. 84)
Learn about Hadoop cluster Hadoop Configuration (p. 299)
configuration
Learn about the Amazon EMR API Write Amazon EMR Applications (p. 263)
Compare different job flow types Appendix: Compare Job Flow Types (p. 389)
API Version 2009-11-30
1
8. Amazon Elastic MapReduce Developer Guide
Overview of Amazon EMR
Understand Amazon EMR
Topics
• Overview of Amazon EMR (p. 2)
• Amazon EMR Concepts (p. 6)
• Associated AWS Product Concepts (p. 11)
This introduction to Amazon Elastic MapReduce (Amazon EMR) provides a summary of this web service.
After reading this section, you should understand the service features, know how Amazon EMR interacts
with other AWS products, and understand the basic functions of Amazon EMR.
In this guide, we assume that you have read and completed the instructions described in the Getting
Started Guide, which provides information on creating your Amazon Elastic MapReduce (Amazon EMR)
account and credentials.
You should be familiar with the following:
• Hadoop. For more information go to http://hadoop.apache.org/core/.
• Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and
Amazon SimpleDB. For more information, see the Amazon Elastic Compute Cloud User Guide, the
Amazon Simple Storage Service Developer Guide, and the Amazon SimpleDB Developer Guide,
respectively.
Overview of Amazon EMR
Amazon Elastic MapReduce (Amazon EMR) is a data analysis tool that simplifies the set-up and
management of a computer cluster, the source data, and the computational tools that help you implement
sophisticated data processing jobs quickly.
Typically, data processing involves performing a series of relatively simple operations on large amounts
of data. In Amazon EMR, each operation is called a step and a sequence of steps is a job flow. A job flow
that processes encrypted data might look like the following example.
Step 1 Decrypt data
Step 2 Process data
API Version 2009-11-30
2
9. Amazon Elastic MapReduce Developer Guide
Architectural Overview of Amazon EMR
Step 3 Encrypt data
Step 4 Save data
Amazon EMR uses Hadoop to divide up the work among the instances in the cluster, tracks status, and
combine the individual results into one output. For an overview of Hadoop, see What Is Hadoop? (p. 8).
Amazon EMR takes care of provisioning a Hadoop cluster, running the job flow, terminating the job flow,
moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. Amazon EMR removes
most of the cumbersome details of setting up the hardware and networking required by the Hadoop
cluster, such as monitoring the setup, configuring Hadoop, and executing the job flow.Together, Amazon
EMR and Hadoop provide all of the power of Hadoop processing with the ease, low cost, scalability, and
power that Amazon S3 and Amazon EC2 offer.
Architectural Overview of Amazon EMR
Amazon Elastic MapReduce (Amazon EMR) works in conjunction with Amazon EC2 to create a Hadoop
cluster, and with Amazon S3 to store scripts, input data, log files, and output results. The Amazon EMR
process is outlined in the following figure and table.
API Version 2009-11-30
3
10. Amazon Elastic MapReduce Developer Guide
Amazon EMR Process
Elastic MapReduce Features
Upload to Amazon S3 the data you want to process, as well as the mapper and reducer executables
that process the data, and then send a request to Amazon EMR to start a job flow.
1
Amazon EMR starts a Hadoop cluster, which loads any specified bootstrap actions and then runs
Hadoop on each node.
2
Hadoop executes a job flow by downloading data from Amazon S3 to core and task nodes.
Alternatively, the data is loaded dynamically at run time by mapper tasks.
3
4 Hadoop processes the data and then uploads the results from the cluster to Amazon S3.
5 The job flow is completed and you retrieve the processed data from Amazon S3.
For details on mapping legacy job flows to instance groups, see Mapping Legacy Job Flows to Instance
Groups (p. 102).
Elastic MapReduce Features
Topics
• Bootstrap Actions (p. 4)
• Configurable Data Storage (p. 4)
• Hadoop and Step Logging (p. 5)
• Hive Support (p. 5)
• Resizeable Running Job Flows (p. 5)
• Secure Data (p. 5)
• Supports Hadoop Methods (p. 5)
• Multiple Sequential Steps (p. 5)
The following sections describe the features available in Amazon Elastic MapReduce (Amazon EMR).
Bootstrap Actions
A bootstrap action is a mechanism that lets you run a script on Elastic MapReduce cluster nodes before
Hadoop starts. Bootstrap action scripts are stored in Amazon S3 and passed to Amazon EMR when
creating a new job flow. Bootstrap action scripts are downloaded from Amazon S3 and executed on each
node before the job flow is executed.
By using bootstrap actions, you can install software on the node, modify the default Hadoop site
configuration, or change the way Java parameters are used to run Hadoop daemons.
Both predefined and custom bootstrap actions are available. The predefined bootstrap actions include
Configure Hadoop, Configure Daemons, and Run-if.You can write custom bootstrap actions in any
language already installed on the job flow instance, such as Ruby, Python, Perl, or bash.
You can specify a bootstrap action from the command line interface, from the Amazon EMR console, or
from the Amazon EMR API when starting a job flow. For more information, see Bootstrap Actions (p. 84).
Configurable Data Storage
Amazon EMR supports Hadoop Distributed Files System (HDFS). HDFS is fault-tolerant, scalable, and
easily configurable. The default configuration is already optimized for most job flows. Generally, the
API Version 2009-11-30
4
11. Amazon Elastic MapReduce Developer Guide
Elastic MapReduce Features
configuration needs to be changed only for very large clusters. Configuration changes are accomplished
using bootstrap actions. For more information, see Hadoop Configuration (p. 299).
Hadoop and Step Logging
Amazon EMR provides detailed logs you can use to debug both Hadoop and Amazon EMR. For more
information on how to create logs, view logs, and use them to troubleshoot a job flow, see
Troubleshooting (p. 183).
Hive Support
Amazon Elastic MapReduce (Amazon EMR) supports Apache Hive. Hive is an integrated data warehouse
infrastructure built on top of Hadoop. It provides tools to simplify data summarization and provides ad
hoc querying and analysis of large datasets stored in Hadoop files. Hive provides a simple query language
called Hive QL, which is based on SQL.
For more information on the supported versions of Hive, see Hive Configuration (p. 348).
Resizeable Running Job Flows
The ability to resize a running job flow lets you increase or decrease the number of nodes in a running
cluster. Core nodes contain the Hadoop Distributed File System (HDFS). After a job flow is running, you
can increase the number of core nodes. Task nodes also run your Hadoop, but do not contain HDFS.
After a job flow is running you can also increase and decrease the number of task nodes. For more
information, see Resizing Running Job Flows (p. 96).
Secure Data
Amazon EMR provides an authentication mechanism to ensure that data stored in Amazon S3 is secured
against unauthorized access. By default, only the AWS Account owner can access the data uploaded to
Amazon S3. Other users can access the data only if you explicitly edit security permissions.
You can send data to Amazon S3 using the secure HTTPS protocol. Amazon EMR always uses a secure
channel to send data between Amazon S3 and Amazon EC2. For added security, you can encrypt your
data before uploading it to Amazon S3. For more information on AWS security, go to the AWS Security
Center.
Supports Hadoop Methods
Amazon EMR supports job flows based on streaming, Hive, Pig, Custom JAR, and Cascading. Streaming
enables you to write application logic in any language and to process large amounts of data using the
Hadoop framework. Hive and Pig offer nonprogramming options with their SQL-like scripting languages.
Custom JAR files enable you to write Java-based MapReduce functions. Cascading is an API with built-in
MapReduce support that lets you create complex distributed processes. For more information, see Using
Amazon EMR (p. 15).
Multiple Sequential Steps
Amazon EMR supports job flows with multiple, sequential steps, including the ability to add steps while
a job flow runs. Individual steps can combine to create more sophisticated job flows. Additionally, you
can incrementally add steps to a running job flow to help with debugging. For more information, see Add
Steps to a Job Flow (p. 79).
API Version 2009-11-30
5
12. Amazon Elastic MapReduce Developer Guide
Amazon EMR Concepts
Amazon EMR Concepts
Topics
• Job Flows and Steps (p. 6)
• Hadoop and MapReduce (p. 7)
This section describes the concepts and terminology you need to understand and use Amazon Elastic
MapReduce (Amazon EMR).
Job Flows and Steps
A job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data.
A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data.
Steps are executed in the order in which they are defined in the job flow.
You can track the progress of a job flow by checking its state. The following diagram shows the life cycle
of a job flow and how each part of the job flow process maps to a particular job flow state.
A successful Amazon Elastic MapReduce (Amazon EMR) job flow follows this process: Amazon EMR
first provisions a Hadoop cluster. During this phase, the job flow state is STARTING. Next, any user-defined
bootstrap actions are run. During this phase, the job flow state is BOOTSTRAPPING. After all bootstrap
actions are completed, the job flow state is RUNNING. The job flow sequentially runs all job flow steps
during this phase. After all steps run, the job flow state transitions to SHUTTING_DOWN and the job flow
shuts down the cluster. All data stored on a cluster node is deleted. Information stored elsewhere, such
as in your Amazon S3 bucket, persists. Finally, when all job flow activity is complete, the job flow state
is marked as COMPLETED.
You can configure a job flow to go into a WAITING state once it completes processing of all steps. A job
flow in the WAITING state continues running, waiting for you to add steps or manually terminate it.When
you manually terminate a job flow, the Hadoop cluster shuts down and job flow state is SHUTTING_DOWN.
When the job flow activity is complete, the final job flow state is TERMINATED. Creating a WAITING job
flow is useful when troubleshooting. For more information on troubleshooting, see Debug Job Flows with
Steps (p. 206).
Any failure during the job flow process terminates the job flow and shuts down all cluster nodes. Any data
stored on a cluster node is deleted. The job flow state is marked as FAILED.
API Version 2009-11-30
6
13. Amazon Elastic MapReduce Developer Guide
Hadoop and MapReduce
For a complete list of job flow states, see the JobFlowExecutionStatusDetail data type in the Amazon
Elastic MapReduce (Amazon EMR) API Reference.
You can also track the progress of job flow steps by checking their state. The following diagram shows
the processing of job flow steps and how each step maps to a particular state.
A job flow contains one or more steps. Steps are processed in the order in which they are listed in the
job flow. Step are run following this sequence: all steps have their state set to PENDING. The first step is
run and the step's state is set to RUNNING. When the step is completed, the step's state changes to
COMPLETED. The next step in the queue is run, and the step's state is set to RUNNING. After each step
completes, the step's state is set to COMPLETED and the next step in the queue is run. Step are run until
there are no more steps. Processing flow returns to the job flow.
If a step fails, the step state is FAILED and all remaining steps with a PENDING state are marked as
CANCELLED. No further steps are run. and processing returns to the job flow.
Data is normally communicated from one step to the next using files stored on the cluster's Hadoop
Distributed File System (HDFS). Data stored on HDFS exists only as long as the cluster is running.When
the cluster is shut down, all data is deleted. The final step in a job flow typically stores the processing
results in an Amazon S3 bucket.
For a complete list of step states, see the StepExecutionStatusDetail data type in the Amazon Elastic
MapReduce (Amazon EMR) API Reference.
Hadoop and MapReduce
Topics
• What Is Hadoop? (p. 8)
• What Is MapReduce? (p. 8)
• Instance Groups (p. 9)
API Version 2009-11-30
7
14. Amazon Elastic MapReduce Developer Guide
Hadoop and MapReduce
• Supported Hadoop Versions (p. 10)
• Supported File Systems (p. 10)
This section explains the roles of Apache Hadoop and MapReduce in Amazon Elastic MapReduce
(Amazon EMR) and how these two methodologies work together to process data.
What Is Hadoop?
Apache Hadoop is an open-source Java software framework that supports massive data processing
across a cluster of servers. Hadoop uses a programming model called MapReduce that divides a large
data set into many small fragments. Hadoop distributes a data fragment and a copy of the MapReduce
executable to each of the slave nodes in a Hadoop cluster. Each slave node runs the MapReduce
executable on its subset of the data. Hadoop then combines the results from all of the nodes into a finished
output. Amazon EMR enables you to upload that output into an Amazon S3 bucket you designate.
For more information about Hadoop, go to http://hadoop.apache.org.
What Is MapReduce?
MapReduce is a combination of mapper and reducer executables that work together to process data.
The mapper executable processes the raw data into key/value pairs, called intermediate results. The
reducer executable combines the intermediate results, applies additional algorithms, and produces the
final output, as described in the following process.
MapReduce Process
Amazon Elastic MapReduce (Amazon EMR) starts your instances in two security groups: one for
the master node and another for the core node and task nodes.
1
Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a single
cluster node.
2
Hadoop distributes the data files and the MapReduce executable to the core and task nodes of the
cluster.
Hadoop handles machine failures and manages network communication between the master, core,
and task nodes. In this way, developers do not need to know how to perform distributed programming
or handle the details of data redundancy and fail over.
3
The mapper function uses an algorithm that you supply to parse the data into key/value pairs.These
key/value pairs are passed to the reducer.
As an example, for a job flow that counts the number of times a word appears in a document, the
mapper might take each word in a document and assign it a value of 1. Each word is a key in this
case, and all values are 1.
4
The reducer function collects the results from all of the mapper functions in the cluster, eliminates
redundant keys by combining values of all like keys, then performs the designated operation on all
the values for each key, and then outputs the results.
Continuing with the previous example, the reducer takes all of the word counts from all of the
mappers functions running in a cluster, adds up the number of times each word was found, and
then outputs that result to Amazon S3.
5
You can write the executables in any programming language. Mapper and reducer applications written
in Java are compiled into a JAR file. Executables written in other programming languages use the Hadoop
streaming utility to implement the mapper and reducer algorithms.
API Version 2009-11-30
8
15. Amazon Elastic MapReduce Developer Guide
Hadoop and MapReduce
The mapper executable reads the input from standard input and the reducer outputs data through standard
output. By default, each line of input/output represents a record and the first tab on each line of the output
separates the key and value.
For more information about MapReduce, go to How Map and Reduce operations are actually carried out
(http://wiki.apache.org/hadoop/HadoopMapReduce).
Instance Groups
Amazon EMR runs a managed version of Apache Hadoop, handling the details of creating the cloud-server
infrastructure to run the Hadoop cluster. Amazon EMR refers to this cluster as a job flow, and defines the
concept of instance groups, which are collections of Amazon EC2 instances that perform roles analogous
to the master and slave nodes of Hadoop. There are three types of instance groups: master, core, and
task.
Each Amazon EMR job flow includes one master instance group that contains one master node, a core
instance group containing one or more core nodes, and an optional task instance group, which can contain
any number of task nodes.
If the job flow is run on a single node, then that instance is simultaneously a master and a core node. For
job flows running on more than one node, one instance is the master node and the remaining are core
or task nodes.
For more information about instance groups, see Resizing Running Job Flows (p. 96).
Master Instance Group
The master instance group manages the job flow: coordinating the distribution of the MapReduce
executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of
each task performed, and monitors the health of the instance groups. To monitor the progress of the job
flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly
or access the user interface that Hadoop publishes to the web server running on the master node. For
more information, see View Logs Using SSH (p. 197).
As the job flow progresses, each core and task node processes its data, transfers the data back to Amazon
S3, and provides status metadata to the master node.
Note
The instance controller on the master node uses MySQL. If MySQL becomes unavailable, the
instance controller will be unable to launch and manage instances.
Core Instance Group
The core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance that
runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS).
Core nodes are managed by the master node.
The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow
run. Because core nodes store data, you can't remove them from a job flow. However, you can add more
core nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons.
Caution
Removing HDFS from a running node runs the risk of losing data.
For more information about core instance groups, see Resizing Running Job Flows (p. 96).
API Version 2009-11-30
9
16. Amazon Elastic MapReduce Developer Guide
Hadoop and MapReduce
Task Instance Group
The task instance group contains all of the task nodes in a job flow. The task instance group is optional.
You can add it when you start the job flow or add a task instance group to a job flow in progress.
Task nodes are managed by the master node. While a job flow is running you can increase and decrease
the number of task nodes. Because they don't store data and can be added and removed from a job flow,
you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity to
handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.
For more information about task instance groups, see Resizing Running Job Flows (p. 96).
Supported Hadoop Versions
Amazon Elastic MapReduce (Amazon EMR) allows you to choose to run either Hadoop version 0.18,
Hadoop version 0.20, or Hadoop version 0.20.205.
For more information on Hadoop configuration, see Hadoop Configuration (p. 299)
Supported File Systems
Amazon EMR and Hadoop typically use two or more of the following file systems when processing a job
flow:
• Hadoop Distributed File System (HDFS)
• Amazon S3 Native File System (S3N)
• Local file system
• Legacy Amazon S3 Block File System
HDFS and S3N are the two main file systems used with Amazon EMR
HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data
awareness between the Hadoop cluster nodes managing the job flows and the Hadoop cluster nodes
managing the individual steps. For more information on how HDFS works, see
http://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html.
The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon
S3. The advantage of this file system is that you can access files on Amazon S3 that were written with
other tools. For information on how Amazon S3 and Hadoop work together, see
http://wiki.apache.org/hadoop/AmazonS3.
The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is
created from an Amazon EC2 instance which comes with a preconfigured block of preattached disk
storage called an Amazon EC2 local instance store. Data on instance store volumes persists only during
the life of the associated Amazon EC2 instance. The amount of this disk storage varies by Amazon EC2
instance type. It is ideal for temporary storage of information that is continually changing, such as buffers,
caches, scratch data, and other temporary content. For more information about Amazon EC2 instances,
see Amazon Elastic Compute Cloud.
The Amazon S3 Block File System Files is a legacy file storage system.We strongly discourage the use
of this system.
For more information on how to use and configure file systems in Amazon EMR, see File System
Configuration (p. 338).
API Version 2009-11-30
10
17. Amazon Elastic MapReduce Developer Guide
Associated AWS Product Concepts
Associated AWS Product Concepts
Topics
• Amazon EC2 Concepts (p. 11)
• Amazon S3 Concepts (p. 14)
• AWS Identity and Access Management (IAM) (p. 14)
• Regions (p. 14)
• Data Storage (p. 14)
This section describes AWS concepts and terminology you need to understand to use Amazon Elastic
MapReduce (Amazon EMR) effectively.
Amazon EC2 Concepts
Topics
• Amazon EC2 Instances (p. 11)
• Reserved Instances (p. 13)
• Elastic IP Address (p. 13)
• Amazon EC2 Key Pairs (p. 13)
The following sections describe Amazon EC2 features used by Amazon EMR.
Amazon EC2 Instances
Amazon EMR enables you to choose the number and kind of Amazon EC2 instances that comprise the
cluster that processes your job flow. Amazon EC2 offers several basic types.
• Standard—You can use Amazon EC2 standard instances for most applications.
• High-CPU—These instances have proportionally more CPU resources than memory (RAM) for
compute-intensive applications.
• High-Memory—These instances offer large memory sizes for high throughput applications, including
database and memory caching applications.
• Cluster Compute—These instances provide proportionally high CPU resources with increased network
performance. They are well suited for demanding network-bound applications.
• High Storage—These instances provide proportionally high storage resources. They are well suited
for data warehouse applications.
Note
Amazon EMR does not support micro instances at this time.
The following table describes all the instance types that Amazon EMR supports.
I/O Name
Performance
Platform
(bits)
Disk
Drive
(GiB)
Compute
Units
RAM
(GiB)
Instance Type
Small (default) 1.7 1 150 32 Moderate m1.small
Large 7.5 4 840 64 High m1.large
Extra Large 15 8 1680 64 High m1.xlarge
API Version 2009-11-30
11
18. I/O Name
Performance
Amazon Elastic MapReduce Developer Guide
Platform
(bits)
Amazon EC2 Concepts
Disk
Drive
(GiB)
Compute
Units
RAM
(GiB)
Instance Type
High-CPU Medium 1.7 5 340 32 Moderate c1.medium
High-CPU Extra Large 7 20 1680 64 High c1.xlarge
High-Memory Extra Large 17.1 6.5 420 64 Moderate m2.xlarge
High-Memory Double Extra 34.2 13 850 64 Moderate m2.2xlarge
Large
High-Memory Quadruple 68.4 26 1680 64 High m2.4xlarge
Extra Large
Very High cc1.4xlarge
(10 Gigabit
Ethernet)
Cluster Compute Quadruple 23 33.5 1690 64
Extra Large Instance*
Very High cc2.8xlarge
(10 Gigabit
Ethernet)
Cluster Compute Eight 60.5 88 3360 64
Extra Large**
Very High hs1.8xlarge
(10 Gigabit
Ethernet)
High Storage* 117 35 49152 64
Very High cg1.4xlarge
(10 Gigabit
Ethernet)
Cluster GPU*** 23 33.5 1680 64
*Cluster Compute Quadruple Extra Large instances and High Storage instances are supported only in
the US East (Northern Virginia) Region.
**Cluster Compute Eight Extra Large instances are only supported in the US East (Northern Virginia),
US West (Oregon), and EU (Ireland) Regions.
***Cluster GPU instances have 22 GB, with 1 GB reserved for GPU operation.
The practical limit of the amount of data you can process depends on the number and type of Amazon
EC2 instances selected as your cluster nodes, and on the size of your intermediate and final data. This
is because the input, intermediate, and output data sets reside on the cluster nodes while your job flow
runs. For example, the maximum amount of data that you can process on a 20-node cluster is 34 TB (20
Extra Large instances x 1.69 TB of hard disk per Amazon EC2 instance = 34 TB).
The default maximum number of Amazon EC2 instances you can specify is 20. If you need more instances,
you can make a formal request. For more information, go to the Request to Increase Amazon EC2 Instance
Limit Form.
Related Topics
• Request additional Amazon EC2 instances
• Amazon EC2 Instance Types
• High Performance Computing (HPC)
API Version 2009-11-30
12
19. Amazon Elastic MapReduce Developer Guide
Amazon EC2 Concepts
Reserved Instances
Reserved Instances provide guaranteed capacity and are an additional Amazon EC2 pricing option.You
make a one-time payment for an instance to reserve capacity and reduce hourly usage charges. Reserved
Instances complement existing Amazon EC2 On-Demand Instances and provide an option to reduce
computing costs. As with On-Demand Instances, you pay only for the compute capacity that you actually
consume, and if you don't use an instance, you don't pay usage charges for it.
To use a Reserved Instance with Amazon EMR, launch your job flow in the same Availability Zone as
your Reserved Instance. For example, let's say you purchase one m1.small Reserved Instance in US-East.
If you launch a job flow that uses two m1.small instances in the same Availability Zone in Region US-East,
one instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. If you
have a sufficient number of available Reserved Instances for the total number of instances you want to
launch, you are guaranteed capacity.Your Reserved Instances are used before any On-Demand Instances
are created.
You can use Reserved Instances by using either the Amazon EMR console, the command line interface
(CLI), Amazon EMR API actions, or the AWS SDKs.
Related Topics
• Amazon EC2 Reserved Instances
Elastic IP Address
Elastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP address
is associated with your account, not a particular instance.You control the addresses associated with your
account until you choose to explicitly release them.
You can associate one Elastic IP address with only one job flow at a time. To ensure our customers are
efficiently using Elastic IP addresses, we impose a small hourly charge when IP addresses associated
with your account are not mapped to a job flow or Amazon EC2 instance. When Elastic IP addresses are
mapped to an instance, they are free of charge.
For more information about enabling Elastic IP addresses with Amazon EMR, see Using Elastic IP
Addresses (p. 287). For more information about using IP addresses in AWS, go to the Using Elastic IP
Addresses section in the Amazon Elastic Compute Cloud User Guide.
Amazon EC2 Key Pairs
When Amazon EMR starts an Amazon EC2 instance, it uses a 2048-bit RSA key pair that you have
named. Amazon EC2 stores the public key. Amazon EMR stores the private key and uses the private
key to validate all requests.
The key pair ensures that only you can access your job flows. When you launch an instance using your
key pair name, the public key becomes part of the instance metadata. This allows you to access the
cluster node securely.
Although specifying the key pair is optional, we strongly recommend that you use key pairs. This key pair
becomes associated with all of the nodes created to process your job flow. The key pair name creates a
handle you can use to access the master node in the Hadoop cluster. With the key pair name, you can
log in to the master node without using a password, enabling you to monitor the progress of your job
flows. On the master node, you can retrieve detailed job flow processing status and statistics.
For more information on how to create and use an Amazon EC2 key pair with Amazon EMR, see "Creating
an Amazon EC2 Key Pair" in the Getting Started Guide.
API Version 2009-11-30
13
20. Amazon Elastic MapReduce Developer Guide
Amazon S3 Concepts
Amazon S3 Concepts
Topics
• Buckets (p. 14)
• Multipart Upload (p. 14)
The following sections describe Amazon S3 features used by Amazon EMR.
Buckets
Amazon EMRrequires Amazon S3 buckets to hold the input and output data of your Hadoop processing.
Amazon EMR uses the Amazon S3 Native File System for Hadoop processing. Amazon S3 uses the
hostname method for accessing data, which places restrictions on bucket names used in Amazon EMR
job flows.
For more information on creating Amazon S3 buckets for use with Amazon EMR, see Setting Up Your
Environment to Run a Job Flow (p. 17). For more information on Amazon S3 buckets, go to Working
with Amazon S3 Buckets in the Amazon S3 Developer Guide.
Multipart Upload
Amazon Elastic MapReduce (Amazon EMR) supports Amazon S3 multipart upload through the AWS
SDK for Java. Multipart upload lets you upload a single object as a set of parts.You can upload these
object parts independently and in any order. If transmission of any part fails, you can retransmit that part
without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts
and creates the object.
For more information about enabling multipart uploads with Amazon EMR, see Multipart Upload (p. 343).
For more information on Amazon S3 multipart uploads, go to Uploading Objects Using Multipart Upload
in the Amazon S3 Developer Guide.
AWS Identity and Access Management (IAM)
Amazon Elastic MapReduce (Amazon EMR) supports AWS Identity and Access Management (IAM)
policies. IAM is a web service that enables AWS customers to manage users and user permissions. For
more information about enabling IAM policies with Amazon EMR, see Configure User Permissions with
IAM (p. 274). For more information on IAM, go to Using IAM in the Using AWS Identity and Access
Management guide.
Regions
You can choose the geographical region where Amazon EC2 creates the cluster to process your data.
You might choose a region to optimize latency, minimize costs, or address regulatory requirements.
Setting a region-specific endpoint guarantees where your data resides. For the list of regions and endpoints
supported by Amazon EMR, go to Regions and Endpoints in the Amazon Web Services General Reference.
Data Storage
Amazon EMR uses Amazon S3 and Amazon SimpleDB data storage systems when processing a job
flow. For more information about using Amazon S3 with Hadoop, go to
http://wiki.apache.org/hadoop/AmazonS3. For more information about Amazon SimpleDB, go to the
Amazon SimpleDB product description page.
API Version 2009-11-30
14
21. Amazon Elastic MapReduce Developer Guide
Using Amazon EMR
Topics
• Setting Up Your Environment to Run a Job Flow (p. 17)
• Create a Job Flow (p. 23)
• View Job Flow Details (p. 72)
• Terminate a Job Flow (p. 77)
• Customize a Job Flow (p. 79)
• Connect to the Master Node in an Amazon EMR Job Flow (p. 110)
• Use Cases (p. 122)
• Building Binaries Using Amazon EMR (p. 131)
• Using Tagging (p. 136)
• Protect a Job Flow from Termination (p. 136)
• Lower Costs with Spot Instances (p. 141)
• Store Data with HBase (p. 155)
• Troubleshooting (p. 183)
• Monitor Metrics with Amazon CloudWatch (p. 209)
• Monitor Performance with Ganglia (p. 220)
• Distributed Copy Using S3DistCp (p. 227)
• Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR (p. 234)
• Use Third Party Applications With Amazon EMR (p. 258)
This section covers the fundamentals of creating, managing, and troubleshooting a job flow using Amazon
Elastic MapReduce (Amazon EMR). All supported job flow types are described. Information on using the
Amazon EMR console, the CLI, SDKs, and API is included.
If you have not signed up to use Amazon EMR, instructions are provided in the Getting Started Guide.
Tip
We strongly recommend that you work through the examples in the Getting Started Guide to
get a basic understanding of Amazon EMR.
Amazon EMR offers a variety of interfaces, including a console, a command line interface (CLI), a query
API, AWS SDKs, and libraries. Each interface offers a different balance of ease and functionality. The
interface you choose depends on your knowledge of Hadoop, your programming skills, and the functionality
you require:
API Version 2009-11-30
15
22. Amazon Elastic MapReduce Developer Guide
• The Amazon EMR console provides a graphical interface from which you can launch Amazon EMR
job flows and monitor their progress.
• The CLI combines full compatibility with the Amazon EMR API without requiring a programming
environment. The Ruby-based Amazon EMR CLI is available for download at Amazon Elastic
MapReduce Ruby Client (http://aws.amazon.com/developertools/2264.)
• The Amazon EMR API, SDKs, and libraries offer the most flexibility but require a programming
environment and software development skills. For more information on using the query API to access
Amazon EMR see Write Amazon EMR Applications (p. 263) in this guide. The AWS SDKs provides
support for Java, C#, and .NET. For more information on the AWS SDKs, refer to the list of current
AWS SDKs. Libraries are available for Perl and PHP. For more information about the Perl and PHP
libraries see Sample Code & Libraries (http://aws.amazon.com/code/Elastic-MapReduce.)
The following table compares the functionality of the Amazon EMR interfaces.
API/SDK/
Libraries
Amazon CLI
EMR
Console
Function
Create multiple job flows
Define bootstrap actions in a job flow
View logs for Hadoop jobs, tasks, and task attempts using
a graphical interface
Implement Hadoop data processing programmatically
Monitor job flows in real time
Provide verbose job flow details
Resize running job flows
Run job flows with multiple steps
Select version of Hadoop, Hive, and Pig
Specify the MapReduce executable in multiple computer
languages
Specify the number and type of Amazon Amazon EC2
instances that process the data
Transfer data to and from Amazon S3 automatically
Terminate job flows in real time
The following sections describe how to use Amazon Elastic MapReduce (Amazon EMR) with each of the
interface types.
API Version 2009-11-30
16
23. Amazon Elastic MapReduce Developer Guide
Setting Up Your Environment to Run a Job Flow
Setting Up Your Environment to Run a Job Flow
This section walks you through how to set up required resources and permissions to run a job flow. The
tasks that follow show you how to create the resources that your job flow uses to process data. Once
created, you can reuse these resources for other job flows. Depending on your application, however, it
may make operational sense to create new resources for each job flow.
The tasks that must be completed before you create a job flow are as follows:
1 Choose a Region (p. 17)
2 Create and Configure an Amazon S3 Bucket (p. 19)
3 Create an Amazon EC2 Key Pair and PEM File (p. 20)
4 Modify Your PEM File (p. 21)
5 For CLI and API users only, Get Security Credentials (p. 21)
6 For CLI users only, optionally Create a Credentials File (p. 22)
The following sections provide instructions on how to perform each of the tasks.
Choose a Region
AWS enables you to place resources in multiple locations. Locations are composed of Regions and
Availability Zones within those Regions. Availability Zones are distinct geographical locations that are
engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency
network connectivity to other Availability Zones in the same Region.
All Amazon EC2 Instances, key pairs, security groups, and Amazon Elastic MapReduce (Amazon EMR)
job flows must be located in the same Region.To optimize performance and reduce latency, all resources
(such as Amazon S3 buckets) and job flows should be located in the same Availability Zone.
For more information about Regions and Availability Zones, go to Using Regions and Availability Zones
in the Amazon Elastic Compute Cloud User Guide
Note
Not all AWS products offer the same support in all Regions. For example, Cluster Compute
instances are available only in the US-East (Northern Virginia) Region and the Asia Pacific
(Sydney) region supports only Hadoop 1.0.3 and later. Confirm that you are working in the
appropriate Region for the resources you want to use.
You must ensure that you use the same Region for each resource you create. Use the table below to
identify the correct Region name.
If your Amazon EMR
The Amazon EMR CLI
The Amazon S3
The Amazon EC2
Region is...
and API Region is...
Region is...
Region is...
US East (Virginia) us-east-1 US Standard US East (Virginia)
US West (Oregon) us-west-2 Oregon US West (Oregon)
US West (N. California) us-west-1 Northern California US West (N. California)
EU West (Ireland) eu-west-1 Ireland EU West (Ireland)
Asia Pacific (Singapore) ap-southeast-1 Singapore Asia Pacific (Singapore)
API Version 2009-11-30
17
24. Amazon Elastic MapReduce Developer Guide
Choose a Region
If your Amazon EMR
The Amazon EMR CLI
The Amazon S3
The Amazon EC2
Region is...
and API Region is...
Region is...
Region is...
Asia Pacific (Sydney) ap-southeast-2 Sydney Asia Pacific (Sydney)
Asia Pacific (Tokyo) ap-northeast-1 Tokyo Asia Pacific (Tokyo)
South America (Sao
Paulo)
South America (Sao sa-east-1 Sao Paulo
Paulo)
Using the Amazon EMR Console to Specify a Region
To select a region in Amazon EMR
• From the Amazon EMR console, select the Region from the drop-down list.
Using the CLI to Specify a Region
Specify the Region with the --region parameter, as in the following example. If the --region parameter
is not specified, the job flow is created in the us-east-1 region.
$ ./elastic-mapreduce --create --alive --stream --input myawsbucket
--output myawsbucket --log-uri --region eu-west-1
Tip
To reduce the number of parameters required each time you issue a command from the CLI,
you can store information such as Region in your credentials.json file. For more information
on creating a credentials.json file, go to the Create a Credentials File (p. 22).
API Version 2009-11-30
18
25. Amazon Elastic MapReduce Developer Guide
Create and Configure an Amazon S3 Bucket
Using the API to Specify a Region
To select a region, configure your application to use that Region's endpoint. If you are creating a client
application using an AWS SDK, you can change the client endpoint by calling setEndpoint, as shown
in the following example:
client.setEndpoint(“eu-west-1.elasticmapreduce.amazonaws.com”);
Once your application has specified a region by setting the endpoint, you can set the Availability Zone
for your job flow's Amazon EC2 instances with a query request that contains a
Instances.Placement.AvailabilityZone parameter, as in the following example. If you do not
specify the Availability Zone for your job flow, Amazon EMR launches the job flow instances in the best
Availability Zone in that region based on system health and available capacity.
https://elasticmapreduce.amazonaws.com?
Operation=
...
Instances.Placement.AvailabilityZone=eu-west-1a&
...
For more information about the parameters in an Amazon EMR request, see API Reference.
Note
For more information on specifying Regions from the CLI and API, see Available Region
Endpoints for the AWS SDKs .
Create and Configure an Amazon S3 Bucket
Amazon Elastic MapReduce (Amazon EMR) uses Amazon S3 to store input data, log files, and output
data. Amazon S3 refers to these storage locations as buckets.To conform with Amazon S3 requirements,
DNS requirements, and restrictions in the supported data analysis tools, we recommend following the
following guidelines for bucket names. All bucket names must:
• Be between 3 and 63 characters long
• Contain only lowercase letters, numbers, or periods (.)
• Not contain a dash (-) or underscore (_)
For additional details on valid bucket names, go to Bucket Restrictions and Limitations in the Amazon
Simple Storage Service Developers Guide.
This section shows you how to use the AWS Management Console to create and then set permissions
for an Amazon S3 bucket. However, you can also create and set permissions for an Amazon S3 bucket
using the Amazon S3 API or the third-party Curl command line tool. For information about Curl, go to
Amazon S3 Authentication Tool for Curl. For information about using the Amazon S3 API to create and
configure an Amazon S3 bucket, go to the Amazon Simple Storage Service API Reference.
Using the AWS Management Console to Create an Amazon
S3 Bucket
To create an Amazon S3 bucket
1. Sign in to the AWS Management Console and open the Amazon S3 console at
https://console.aws.amazon.com/s3/.
API Version 2009-11-30
19
26. Amazon Elastic MapReduce Developer Guide
Create an Amazon EC2 Key Pair and PEM File
2. Click Create Bucket.
The Create a Bucket dialog box opens.
3. Enter a bucket name, such as mylog-uri.
This name should be globally unique, and cannot be the same name used by another bucket.
4. Select the Region for your bucket. To avoid paying cross-region bandwidth charges, create the
Amazon S3 bucket in the same region as your job flow.
Refer to Choose a Region (p. 17) for guidance on choosing a Region.
5. Click Create.
You created a bucket with the URI s3n://mylog-uri/.
Note
If you enable logging in the Create a Bucket wizard, it enables only bucket access logs, not
Amazon EMR job flow logs.
Note
For more information on specifying Region-specific buckets, refer to Buckets and Regions in the
Amazon Simple Storage Service Developer Guide and Available Region Endpoints for the AWS
SDKs .
After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself
(the owner) read and write access and authenticated users read access.
Using the AWS Management Console to configure an
Amazon S3 bucket
To set permissions on an Amazon S3 bucket
1. Sign in to the AWS Management Console and open the Amazon S3 console at
https://console.aws.amazon.com/s3/.
2. In the Buckets pane, right-click the bucket you just created.
3. Select Properties.
4. In the Properties pane, select the Permissions tab.
5. Click Add more permissions.
6. Select Authenticated Users in the Grantee field.
7. To the right of the Grantee drop-down list, select List.
8. Click Save.
You have created a bucket and restricted permissions to authenticated users.
Create an Amazon EC2 Key Pair and PEM File
Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alone
have access to the instances that you launch. The PEM file associated with this key pair is required to
ssh directly to the master node of the cluster running your job flow.
To create an Amazon EC2 key pair
1. Sign in to the AWS Management Console and open the Amazon EC2 console at
https://console.aws.amazon.com/ec2/.
2. From the Amazon EC2 console, select a Region.
3. In the Navigation pane, click Key Pairs.
API Version 2009-11-30
20
27. Amazon Elastic MapReduce Developer Guide
Modify Your PEM File
4. On the Key Pairs page, click Create Key Pair.
5. In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair.
6. Click Create.
7. Save the resulting PEM file in a safe location.
Your Amazon EC2 key pair and an associated PEM file are created.
Modify Your PEM File
Amazon Elastic MapReduce (Amazon EMR) enables you to work interactively with your job flow, allowing
you to test job flow steps or troubleshoot your cluster environment. To log in directly to the master node
of your running job flow, you can use ssh or PuTTY.You use your PEM file to authenticate to the master
node.The PEM file requires a modification based on the tool you use that supports your operating system.
You use the CLI to connect on Linux or UNIX computers.You use PuTTY to connect on Microsoft Windows
computers. For more information on how to install the Amazon EMR CLI or how to install PuTTY, go to
the Getting Started Guide.
To modify your credentials file
• Create a local permissions file:
If you are Do this...
using...
Set the permissions on the PEM file or your Amazon EC2 key pair. For example, if
you saved the file as mykeypair.pem, the command looks like the following:
$ chmod og-rwx mykeypair.pem
Linux or
UNIX
a. Download PuTTYgen.exe to your computer from
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
b. Launch PuTTYgen.
c. Click Load. Select the PEM file you created earlier.
d. Click Open.
e. Click OK on the PuTTYgen Notice telling you the key was successfully imported.
f. Click Save private key to save the key in the PPK format.
g. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.
h. Enter a name for your PuTTY private key, such as, mykeypair.ppk.
i. Click Save.
j. Exit the PuTTYgen application.
Microsoft
Windows
Your credentials file is modified to allow you to log in directly to the master node of your running job flow.
Get Security Credentials
AWS assigns you an Access Key ID and a Secret Access Key to identify you as the sender of your
request. AWS uses these security credentials to help protect your data.You include your Access Key ID
API Version 2009-11-30
21
28. Amazon Elastic MapReduce Developer Guide
Create a Credentials File
in all AWS requests made through the CLI or API.The AWS Management Console provides these security
credentials automatically.
Note
Your Secret Access Key is a shared secret between you and AWS. Keep this key secret. Amazon
uses this key to bill you for the AWS services you use. Never include your key in your requests
to AWS and never email your key to anyone, even if an inquiry appears to originate from AWS
or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret
Access Key.
To get your Access Key ID and Secret Access Key
1. Go to the AWS website.
2. Click My Account to display a list of options.
3. Click Security Credentials and log in to your AWS Account.Your Access Key ID is displayed in
the Access Credentials section.Your Secret Access Key remains hidden as a further precaution.
4. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown in
the following figure.
You have your Access Key ID and a Secret Access Key to securely identify yourself to AWS.You need
this information to create a credentials file, as described in the following section.
Create a Credentials File
You can use an Amazon EMR credentials file to simplify job flow creation and authentication of requests.
The credentials file provides information required for many commands.The credentials file is a convenient
place for you to store command parameters so you don't have to repeatedly enter the information.
Your credentials are used to calculate the signature value for every request you make.The Amazon EMR
CLI automatically looks for these credentials in the file credentials.json. you can edit the
credentials.json file and include your AWS credentials. If you do not have a credentials.json
file, you must include your credentials in every request you make.
To create your credentials file
1. Create a file named credentials.json on your computer.
2. Add the following lines to your credentials file:
API Version 2009-11-30
22
29. {
Amazon Elastic MapReduce Developer Guide
Create a Job Flow
"access-id": "AccessKeyID",
"private-key": "PrivateKey",
"key-pair": "KeyName",
"key-pair-file": "location of key pair file",
"region": "Region",
"log-uri": "location of bucket on Amazon S3"
}
The access-id and private-key are the AWS Access Key ID and a Secret Access Key described in Get
Security Credentials (p. 21). The key-pair and key-pair-file are the Amazon EC2 key pair and the path
and name of PEM file you created in Create an Amazon EC2 Key Pair and PEM File (p. 20). The region
is the Region you selected in Choose a Region (p. 17).The log-uri is the path to the bucket you created
in Create and Configure an Amazon S3 Bucket (p.19) using the format s3n://BucketName/FolderName.
Your credentials.json file is configured.
Each of the preceding tasks guided you through the steps to set up the objects and permissions required
for a job flow.You are now ready to create your job flow. Instructions on how to create a job flow are at
Create a Job Flow (p. 23).
Create a Job Flow
Topics
• Choose a Job Flow Type (p. 23)
• Choose Job Flow Interface (p. 24)
• Identify Data, Scripts, and Log File locations (p. 24)
• How to Create a Streaming Job Flow (p. 24)
• How to Create a Job Flow Using Hive (p. 32)
• How to Create a Job Flow Using Pig (p. 40)
• How to Create a Job Flow Using a Custom JAR (p. 48)
• How to Create a Cascading Job Flow (p. 56)
• Launch an HBase Cluster on Amazon EMR (p. 64)
This section covers the basics of creating a job flow using Amazon Elastic MapReduce (Amazon EMR).
You can create a job flow using the Amazon EMR console, downloading and installing the Command
Line Interface (CLI), or creating a query request with the Query API. The interface-specific details for
using either the Amazon EMR console, the CLI, or the API are covered in the following sections.
For information about creating the objects and setting the permissions needed to create a job flow see
Setting Up Your Environment to Run a Job Flow (p. 17). For information on the job flow process and how
individual steps are processed see Job Flows and Steps (p. 6).
Choose a Job Flow Type
Choose one of the supported job flow types: your choice of job flow type depends on several factors,
including the format of the data and your level of programming knowledge. For information on comparing
the supported job flow types, see Appendix: Compare Job Flow Types (p. 389).
API Version 2009-11-30
23
30. Amazon Elastic MapReduce Developer Guide
Choose Job Flow Interface
Choose Job Flow Interface
Choose the manner in which you want to create your job flow. The description of each job flow type in
this section includes details on how to create a job flow using the Amazon EMR console, the CLI, or
Query API. The Amazon EMR console provides a graphical interface to launch Elastic MapReduce job
flows and monitor their progress. The CLI combines full compatibility with the Elastic MapReduce API
without requiring a programming environment.The Elastic MapReduce API, AWS SDK, and libraries offer
the most flexibility, but require a programming environment and software development skills.
Identify Data, Scripts, and Log File locations
You need to plan the job flow you want to run and specify where Amazon EMR finds the information.
Typically, the MapReduce program or script is located in a bucket on Amazon S3.Your job flow input,
output, and job flow logs are also typically located on Amazon S3.
Required Amazon S3 buckets must exist before you can create a job flow.You must upload any required
scripts or data referenced in the job flow to Amazon S3. The following table describes example data,
scripts, and log file locations.
Information Example Location on Amazon S3
script or program s3://myawsbucket/wordcount/wordSplitter.py
log files s3://myawsbucket/wordcount/logs
input data s3://myawsbucket/wordcount/input
output data s3://myawsbucket/wordcount/output
For information on how to upload objects to Amazon S3, go to Add an Object to Your Bucket in the
Amazon Simple Storage Service Getting Started Guide.
How to Create a Streaming Job Flow
This section covers the basics of creating and launching a streaming job flow using Amazon Elastic
MapReduce (Amazon EMR).You'll step through how to create a streaming job flow using either the
Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create
objects and permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17).
A streaming job flow reads input from standard input and then runs a script or executable (called a mapper)
against each input. The result from each of the inputs is saved locally, typically on a Hadoop Distributed
File System (HDFS) partition. Once all the input is processed by the mapper, a second script or executable
(called a reducer) processes the mapper results.The results from the reducer are sent to standard output.
You can chain together a series of streaming job flows, where the output of one streaming job flow
becomes the input of another job flow.
The mapper and the reducer can each be referenced as a file or you can supply a Java class.You can
implement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python,
PHP, or Bash.
The example that follows is based on the Amazon EMR Word Count Example. This example shows how
to use Hadoop streaming to count the number of times each word occurs within a text file. In this example,
the input is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/wordcount/input.
The mapper is a Python script that counts the number of times a word occurs in each input string and is
located at s3://elasticmapreduce/samples/wordcount/wordSplitter.py.The reducer references
a standard Hadoop library package called aggregate. Aggregate provides a special Java class and a list
API Version 2009-11-30
24
31. Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
of simple aggregators that perform aggregations such as sum, max, and min over a sequence of values.
The output is saved to an Amazon S3 bucket you created in Setting Up Your Environment to Run a Job
Flow (p. 17).
Amazon EMR Console
This example describes how to use the Amazon EMR console to create a streaming job flow.
To create a streaming job flow
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at
https://console.aws.amazon.com/elasticmapreduce/.
2. Click Create New Job Flow.
3. In the DEFINE JOB FLOW page, do the following:
a. Enter a name in the Job Flow Name field. This name is optional, and does not need to be
unique.
b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to
run the Amazon distribution of Hadoop or one of two MapR distributions. For more information
about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for
Hadoop (p. 260).
c. Select Run your own application.
d. Select Streaming in the drop-down list.
e. Click Continue.
API Version 2009-11-30
25
32. Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
4. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide,
and then click Continue.
Field Action
Specify the URI where the input data resides in Amazon S3. The value must
be in the form BucketName/path.
Input Location*
Specify the URI where you want the output stored in Amazon S3. The value
must be in the form BucketName/path.
Output Location*
Specify either a class name that refers to a mapper class in Hadoop, or a path
on Amazon S3 where the mapper executable, such as a Python program,
resides. The path value must be in the form
BucketName/path/MapperExecutable.
Mapper*
Specify either a class name that refers to a reducer class in Hadoop, or a path
on Amazon S3 where the reducer executable, such as a Python program,
resides. The path value must be in the form
BucketName/path/ReducerExecutable. Amazon EMR supports the special
aggregate keyword. For more information, go to the Aggregate library supplied
by Hadoop.
Reducer*
Optionally, enter a list of arguments (space-separated strings) to pass to the
Hadoop streaming utility. For example, you can specify additional files to load
into the distributed cache.
Extra Args
* Required parameter
API Version 2009-11-30
26
33. Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the
following table as a guide, and then click Continue.
Note
Twenty is the default maximum number of nodes per AWS account. For example, if you
have two job flows running, the total number of nodes running for both job flows must be
20 or less. If you need more than 20 nodes, you must submit a request to increase your
Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon
EC2 Instance Limit Form.
Field Action
Specify the number of nodes to use in the Hadoop cluster. There is always
one master node in each job flow.You can specify the number of core and
tasks nodes.
Instance Count
Specify the Amazon EC2 instance types to use as master, core, and task
nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium,
c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, and
cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East
(Northern Virginia), US West (Oregon), and EU (Ireland) Regions. The
cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East
(Northern Virginia) Region.
Instance Type
Specify whether to run master, core, or task nodes on Spot Instances. For
more information, see Lower Costs with Spot Instances (p. 141)
Request Spot
Instances
* Required parameter
API Version 2009-11-30
27
34. Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table
as a guide, and then click Continue.
Field Action
Optionally, specify a key pair that you created previously. For more information,
see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not enter
a value in this field, you cannot SSH into the master node.
Amazon EC2 Key
Pair
Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon
VPC. For more information, see Running Job Flows on an Amazon VPC (p. 381).
Amazon VPC
Subnet Id
Optionally, specify a path in Amazon S3 to store the Amazon EMR log files.
The value must be in the form BucketName/path. If you do not supply a
location, Amazon EMR does not log any files.
Amazon S3 Log
Path
Select Yes to store Amazon Elastic MapReduce-generated log files.You must
enable debugging at this level if you want to store the log files generated by
Amazon EMR.
If you select Yes, you must supply an Amazon S3 bucket name where Amazon
Elastic MapReduce can upload your log files.
For more information, see Troubleshooting (p. 183).
Important
You can enable debugging for a job flow only when you initially create
the job flow.
Enable
Debugging
Select Yes to cause the job flow to continue running when all processing is
completed.
Keep Alive
API Version 2009-11-30
28
35. Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
Field Action
Select Yes to ensure the job flow is not shut down due to accident or error. For
more information, see Protect a Job Flow from Termination (p. 136).
Termination
Protection
Select Yes to make the job flow visible and accessible to all IAM users on the
AWS account. For more information, see Configure User Permissions with
IAM (p. 274).
Visible To All IAM
Users
7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click
Continue.
For more information about bootstrap actions, see Bootstrap Actions (p. 84).
API Version 2009-11-30
29
36. Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and
then click Create Job Flow when the information is correct.
After you click Create Job Flow your request is processed; when it succeeds, a message appears.
API Version 2009-11-30
30
37. 9. Click Close.
Amazon Elastic MapReduce Developer Guide
How to Create a Streaming Job Flow
The Amazon EMR console shows the new job flow starting. Starting a new job flow may take several
minutes, depending on the number and type of EC2 instances Amazon EMR is launching and
configuring. Click the Refresh button for the latest view of the job flow's progress.
CLI
This example describes how to use the CLI to create a streaming job flow. Replace the red text with your
Amazon S3 bucket information.
To create a job flow
• Use the information in the following table to create your job flow:
If you are Enter the following...
using...
& ./elastic-mapreduce --create --stream
--input s3n://elasticmapreduce/samples/wordcount/input
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--reducer aggregate
--output s3n://myawsbucket
Linux or
UNIX
c:ruby elastic-mapreduce --create --stream
--input s3n://elasticmapreduce/samples/wordcount/input
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--reducer aggregate
--output s3n://myawsbucket
Microsoft
Windows
The output looks similar to the following.
Created jobflow JobFlowID
API Version 2009-11-30
31
38. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
By default, this command launches a job flow to run on a single-node cluster using an Amazon EC2
m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can
launch job flows to run on multiple nodes.You can specify the number of nodes and the type of instance
to run with the --num-instances and --instance-type parameters, respectively.
API
This section describes the Amazon EMR API Query request parameters you need to create a streaming
job flow. The response includes a <JobFlowID>, which you use in other Amazon EMR operations, such
as when describing or terminating a job flow. For this reason, it is important to store job flow IDs.
The Args argument contains location information for your input data, output data, mapper, reducer, and
cache file, as shown in the following example.
"Name": "streaming job flow",
"HadoopJarStep":
{
"Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args":
[
"-input", "s3n://elasticmapreduce/samples/wordcount/input",
"-output", "s3n://myawsbucket",
"-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplit
ter.py",
"-reducer", "aggregate"
]
}
Note
All paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If you
use HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as in
hdfs:///home/hadoop/sampleInput2/.
How to Create a Job Flow Using Hive
This section covers the basics of creating a job flow using Hive in Amazon Elastic MapReduce (Amazon
EMR).You'll step through how to create a job flow using Hive with either the Amazon EMR console, the
CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for
more information see Setting Up Your Environment to Run a Job Flow (p. 17).
For advanced information on Hive configuration options, see Hive Configuration (p. 348).
A job flow using Hive enables you to create a data analysis application using a SQL-like language. The
example that follows is based on the Amazon EMR sample: Contextual Advertising using Apache Hive
and Amazon EMR with High Performance Computing instances. This sample describes how to correlate
customer click data to specific advertisements.
In this example, the Hive script is located in an Amazon S3 bucket at
s3n://elasticmapreduce/samples/hive-ads/libs/model-build. All of the data processing
instructions are located in the Hive script. The script requires additional libraries that are located in an
Amazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs.The input data is located
in the Amazon S3 bucket s3n://elasticmapreduce/samples/hive-ads/tables. The output is
saved to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job
Flow (p. 17).
API Version 2009-11-30
32
39. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
Amazon EMR Console
This example describes how to use the Amazon EMR console to create a job flow using Hive.
To create a job flow using Hive
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at
https://console.aws.amazon.com/elasticmapreduce/.
2. Click Create New Job Flow.
3. In the DEFINE JOB FLOW page, do the following:
a. Enter a name in the Job Flow Name field.
We recommended you use a descriptive name. It does not need to be unique.
b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to
run the Amazon distribution of Hadoop or one of two MapR distributions. For more information
about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for
Hadoop (p. 260).
c. Select Run your own application.
d. Select Hive in the drop-down list.
e. Click Continue.
API Version 2009-11-30
33
40. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
4. In SPECIFY PARAMETERS page, specify whether you want to run the Hive job from a script or
interactively from the master node. If you are running Hive from a script, enter values in the boxes
using the following table as a guide. Click Continue.
Field Action
Specify the URI where your script resides in Amazon S3. The value must be in
the form BucketName/path/ScriptName.
Script Location*
Optionally, specify the URI where your input files reside in Amazon S3. The
value must be in the form BucketName/path/. If specified, this will be passed
to the Hive script as a parameter named INPUT.
Input Location
Optionally, specify the URI where you want the output stored in Amazon S3.
The value must be in the form BucketName/path. If specified, this will be passed
to the Hive script as a parameter named OUTPUT.
Output Location
Extra Args Optionally, enter a list of arguments (space-separated strings) to pass to Hive.
* Required parameter
API Version 2009-11-30
34
41. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the
following table as a guide, and then click Continue.
Note
Twenty is the default maximum number of nodes per AWS account. For example, if you
have two job flows running, the total number of nodes running for both job flows must be
20 or less. If you need more than 20 nodes, you must submit a request to increase your
Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon
EC2 Instance Limit Form.
Field Action
Specify the number of nodes to use in the Hadoop cluster. There is always
one master node in each job flow.You can specify the number of core and
tasks nodes.
Instance Count
Specify the Amazon EC2 instance types to use as master, core, and task
nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium,
c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, and
cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East
(Northern Virginia), US West (Oregon), and EU (Ireland) Regions. The
cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East
(Northern Virginia) Region.
Instance Type
Specify whether to run master, core, or task nodes on Spot Instances. For
more information, see Lower Costs with Spot Instances (p. 141)
Request Spot
Instances
* Required parameter
API Version 2009-11-30
35
42. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table
as a guide, and then click Continue.
Field Action
Optionally, specify a key pair that you created previously. For more information,
see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not enter
a value in this field, you cannot SSH into the master node.
Amazon EC2 Key
Pair
Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon
VPC. For more information, see Running Job Flows on an Amazon VPC (p. 381).
Amazon VPC
Subnet Id
Optionally, specify a path in Amazon S3 to store the Amazon EMR log files.
The value must be in the form BucketName/path. If you do not supply a
location, Amazon EMR does not log any files.
Amazon S3 Log
Path
Select Yes to store Amazon Elastic MapReduce-generated log files.You must
enable debugging at this level if you want to store the log files generated by
Amazon EMR.
If you select Yes, you must supply an Amazon S3 bucket name where Amazon
Elastic MapReduce can upload your log files.
For more information, see Troubleshooting (p. 183).
Important
You can enable debugging for a job flow only when you initially create
the job flow.
Enable
Debugging
Select Yes to cause the job flow to continue running when all processing is
completed.
Keep Alive
API Version 2009-11-30
36
43. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
Field Action
Select Yes to ensure the job flow is not shut down due to accident or error. For
more information, see Protect a Job Flow from Termination (p. 136).
Termination
Protection
Select Yes to make the job flow visible and accessible to all IAM users on the
AWS account. For more information, see Configure User Permissions with
IAM (p. 274).
Visible To All IAM
Users
7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click
Continue.
For more information about bootstrap actions, see Bootstrap Actions (p. 84).
API Version 2009-11-30
37
44. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and
then click Create Job Flow when the information is correct.
After you click Create Job Flow your request is processed; when it succeeds, a message appears.
API Version 2009-11-30
38
45. 9. Click Close.
Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Hive
The Amazon EMR console shows the new job flow starting. Starting a new job flow may take several
minutes, depending on the number and type of EC2 instances Amazon EMR is launching and
configuring. Click the Refresh button for the latest view of the job flow's progress.
CLI
This example describes how to use the CLI to create a job flow using Hive.
To create a job flow using Hive
• Use the information in the following table to create your job flow:
If you are Enter the following...
using...
& ./elastic-mapreduce --create --name "Test Hive"
--hive-script
s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q
--args "-d",
"LIBS=s3n://elasticmapreduce/samples/hive-ads/libs","-d",
"INPUT=s3n://elasticmapreduce/samples/hive-ads/tables",
"-d","OUTPUT=s3n://myawsbucket/hive-ads/output/"
Linux or
UNIX
c: ruby elastic-mapreduce --create --name "Test Hive"
--hive-script
s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q
--args "-d","LIBS=s3n://elasticmapreduce/samples/hive-ads/libs",
"-d","INPUT=s3n://elasticmapreduce/samples/hive-ads/tables",
"-d","OUTPUT=s3n://myawsbucket/hive-ads/output/"
Microsoft
Windows
The output looks similar to the following.
Created job flow JobFlowID
API Version 2009-11-30
39
46. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Pig
By default, this command launches a job flow to run on a two-node cluster using an Amazon EC2 m1.small
instance. Later, when your steps are running correctly on a small set of sample data, you can launch job
flows to run on multiple nodes.You can specify the number of nodes and the type of instance to run with
the --num-instances and --instance-type parameters, respectively.
API
This section describes the Amazon EMR API Query request parameters you need to create a job flow
using Hive. For an explanation of the parameters unique to RunJobFlow, go to RunJobFlow in the Amazon
Elastic MapReduce (Amazon EMR) API Reference. The response includes a <JobFlowID>, which you
use in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason,
it is important to store job flow IDs.
The Args argument contains location information for your input data, output data, and LIBS, as shown
in the following example.
"Name": "Hive job flow",
"HadoopJarStep":
{
"Jar":"s3://us-west-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"Args":[
"s3://us-west-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-west-1.elasticmapreduce/libs/hive/",
"--run-hive-script",
"--args",
"-f",
"s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q",
"-d LIBS=s3n://elasticmapreduce/samples/hive-ads/libs"
]
}
Note
All paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If you
use HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as in
hdfs:///home/hadoop/sampleInput2/.
How to Create a Job Flow Using Pig
This section covers the basics of creating a job flow using Pig in Amazon Elastic MapReduce (Amazon
EMR).You'll step through how to create a job flow using Pig with either the Amazon EMR console, the
CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for
more information see Setting Up Your Environment to Run a Job Flow (p. 17).
A job flow using Pig takes SQL-like commands written in Pig Latin, and converts those commands into
Hadoop MapReduce algorithms. The examples that follow are based on the Amazon EMR sample:
Apache Log Analysis using Pig. The sample evaluates Apache log files and then generates a report
containing the total bytes transferred, a list of the top 50 IP addresses, a list of the top 50 external referrers,
and the top 50 search terms using Bing and Google. The Pig script is located in the Amazon S3 bucket
s3n://elasticmapreduce/samples/pig-apache/do-reports2.pig. Input data is located in the
Amazon S3 bucket s3n://elasticmapreduce/samples/pig-apache/input. The output is saved
to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17).
Amazon EMR Console
This example describes how to use the Amazon EMR console to create a job flow using Pig.
API Version 2009-11-30
40
47. Amazon Elastic MapReduce Developer Guide
How to Create a Job Flow Using Pig
To create a job flow using Pig
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at
https://console.aws.amazon.com/elasticmapreduce/.
2. Click Create New Job Flow.
3. In the DEFINE JOB FLOW page, enter the following:
a. Enter a name in the Job Flow Name field.
We recommended you use a descriptive name. It does not need to be unique.
b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to
run the Amazon distribution of Hadoop or one of two MapR distributions. For more information
about MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution for
Hadoop (p. 260).
c. Select Run your own application.
d. Select Pig Program in the drop-down list.
e. Click Continue.
API Version 2009-11-30
41