This document discusses using Amazon Elastic MapReduce (EMR) for scalable data processing. EMR allows running Apache Hadoop on the scalable resources of Amazon EC2 and storing data in Amazon S3. It provides a simple web interface and command line tools to define MapReduce jobs that can process large amounts of data across many servers. Examples shown include using EMR to perform word counting on text data and single-nucleotide polymorphism analysis on genomic sequencing data stored in S3.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Data Processing with Amazon EMR
1. DATA PROCESSING
WITH AMAZON ELASTIC
MAPREDUCE,
AMAZON AWS USE CASES
Sergey Sverchkov
Project Manager
Altoros Systems
sergey.sverchkov@altoros.com
Skype: sergey.sverchkov
3. MAPREDUCE
• Simple data-parallel programming model designed for
scalability and fault-tolerance
• Pioneered by Google
– Processes 20 petabytes of data per day
• Popularized by open-source Hadoop project
– Used at Yahoo!, Facebook, Amazon, …
4. AMAZON EC2 SERVICE
• Elastic - Increase or decrease capacity within
minutes, not hours or days
• Completely controlled
• Flexible –multiple instance types
(CPU, memory, storage), operating systems, and
software packages.
• Reliable –99.95% availability for each Amazon EC2
Region.
• Secure – numerous mechanisms for securing your
compute resources.
• Inexpensive: Reserved Instance and Spot Instances
• Easy to Start.
5. AMAZON S3 STORAGE
• Write, read, and delete objects containing from 1 byte to
5 terabytes
• Objects are stored in a bucket
• Authentication mechanisms
• Options for secure data upload/download and encryption
of data at rest
• Designed to provide 99.999999999% durability and
99.99% availability of objects over a given year
• Reduced Redundancy Storage (RRS)
6. AMAZON EMR FEATURES
• Web-based interface and command-line tools for running
Hadoop jobs on Amazon EC2
• Data stored in Amazon S3
• Monitors job and shuts down machines after use
• Small extra charge on top of EC2 pricing
• Significantly reduces the complexity of the time-
consuming set-up, management and tuning of Hadoop
clusters
7. GETTING STARTED – SIGN UP
• Sign up for Amazon EMR / AWS at
http://aws.amazon.com
• Need to be signed also for Amazon S3 and Amazon EC2
• Locate and save AWS credentials:
– AWS Access Key ID
– AWS Secret Access Key
– EC2 Key Pair
• Optionally install on desktop:
– EMR command line client
– S3 command line
13. EMR WORD COUNT SAMPLE
• Start the word count sample job from EMR command line:
$ ./elastic-mapreduce --create --name "word count
commandline test" --stream --input
s3n://elasticmapreduce/samples/wordcount/input --mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--reducer aggregate --output
s3n://test.emr.bucket/wordcount/output2
• Output contains job number:
Created job flow j-317IN1TUMRQ5B
14. WORD COUNT – OUTPUT DATA
• Locate and download output data in the specified output S3 bucket:
15. REAL-WORLD EXAMPLE - GENOTYPING
• Crossbow is a scalable, portable, and automatic Cloud
Computing tool for finding SNPs from short read data.
• Crossbow is designed to be easy to run (a) in "the cloud"
Amazon's Elastic MapReduce service, (b) on any
Hadoop cluster, or (c) on any single computer, without
Hadoop.
• Open-source available to anyone
http://bowtie-bio.sourceforge.net/crossbow/
16. SINGLE-NUCLEOTIDE POLYMORPHISM
• A single-nucleotide
polymorphism
(SNP, pronounced snip) is a
DNA sequence variation
occurring when a single
nucleotide — A, T, C or G
— in the genome (or other
shared sequence) differs
between members of a
biological species or paired
chromosomes in an
individual.
17. SNP ANALYSIS IN AMAZON EMR
• Crossbow web inteface
http://bowtie-bio.sourceforge.net/crossbow/ui.html
18. SNP ANALYSIS – DATA IN AMAZON S3
• Data for SNP analysis is uploaded to Amazon S3 bucket
• Output of analysis is placed in S3
19. SNP ANALYSIS – INPUT / OUTPUT DATA
• Input data – single file ~ 1.4GB
@E201_120801:4:1:1208:14983#ACAGTG/1
GAAGGAATAATGAGACCTNACGTTTCTGNNCNNNNNNNNNNNNNNNNNNN
+E201_120801:4:1:1208:14983#ACAGTG/1
gggfggdgfgdgg_e^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@E201_120801:4:1:1208:6966#ACAGTG/1
GCTGGGATTACAGACACANGCCACCACANNTNNNNNNNNNNNNNNNNNNN
+E201_120801:4:1:1208:6966#ACAGTG/1
• Output data – multiple files
chr1 841900 G A 3 A 68
2 2 G 0 0 0
2 0 1.00000 1.00000 1
chr1 922615 T G 2 G 38
3 3 T 67 1 1
4 0 1.00000 1.00000 0
chr1 1011278 A G 12 G 69
1 1 A 0 0 0
1 0 1.00000 1.00000 1
20. SNP ANALYSIS - TIME
• To process 1.4GB on 1 EMR instance – 6 hours
• To process 1.4GB on 2 EMR instances – 4 hours
• To process 1.4GB on 4 EMR instances – 2.5 hours
• Haven’t tried more instances…
21. AND MORE CASES FOR AMAZON AWS
Customer 1 successful migration from dedicated hosting to
Amazon:
1 EC2 xlarge Linux instance (15 GB, 4 cores, 64bit) with 4
EBS volumes 250GB in US West (North California) region
Runs 1 heavy web sites with > 1К concurrent users
Tomcat app server and Oracle SE 11.2
Amazon Elastip IP for web site
Continuous Oracle backup to Amazon S3 through Oracle
secure backup for S3
And it costs for customer only …wow
<2 days for LIVE migration on weekend
22. AND MORE CASES FOR AMAZON AWS
Customer 2 successful migration from Rackspace to
Amazon:
Rackspace hosting + service cost $..К, and service level
very low. Rackspace server was fixed.
Migrated to 1 Amazon 2xlarge (34.2 GB, 4 virtual cores) EC2
Windows 2008 R2 instance. >100 web sites for corporate
customers. 2 EBS volumes 1.5TB
Amazon Oracle RDS as backend – fully automated Oracle
database with expandable storage.
200GB of user data in RDS.
Full LIVE migration completed in 48 hours with DNS
names switch.
And budget is significantly lower!!
23. THANK YOU
WELCOME FOR DISCUSSION…
Sergey Sverchkov
sergey.sverchkov@altoros.com
skype: sergey.sverchkov
Editor's Notes
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.Amazon Elastic MapReduce is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows, and programmatically monitor progress of running job flows. In addition, you can use the simple web interface of the AWS Management Console to launch your job flows and monitor processing-intensive computation on clusters of Amazon EC2 instances.Q: Who can use Amazon Elastic MapReduce? Anyone who requires simple access to powerful data analysis can use Amazon Elastic MapReduce. Customers don’t need any software development experience to experiment with several sample applications available in the Developer Guide and in our Resource Center.
To sign up for Amazon Elastic MapReduce, click the “Sign up for This Web Service” button on the Amazon Elastic MapReduce detail page http://aws.amazon.com/elasticmapreduce. You must be signed up for Amazon EC2 and Amazon S3 to access Amazon Elastic MapReduce; if you are not already signed up for these services, you will be prompted to do so during the Amazon Elastic MapReduce sign-up process. After signing up, please refer to the Amazon Elastic MapReduce documentation, which includes our Getting Started Guide – the best place to get going with the service.
If you already have an AWS account, skip to the next procedure. If you don't already have an AWS account, use the following procedure to create one.NoteWhen you create an account, AWS automatically signs up the account for all services. You are charged only for the services you use. To create an AWS accountGo to http://aws.amazon.com, and then click Sign Up Now.Follow the on-screen instructions.Part of the sign-up procedure involves receiving a phone call and entering a PIN using the phone keypad.Install the Amazon EMR Command Line InterfaceTopicsInstalling Ruby Installing the Command Line InterfaceConfiguring CredentialsSSH Setup and ConfigurationAWS Security CredentialsAWS uses security credentials to help protect your data. This section, shows you how to view your security credentials so you can add them to your credentials.json file. AWS assigns you an Access Key ID and a Secret Access Key. You include your Access Key ID in all AWS service requests to identify yourself as the sender of the request. NoteYour Secret Access Key is a shared secret between you and AWS. Keep this ID secret; we use it to bill you for the AWS services you use. Never include the ID in your requests to AWS and never email the ID to anyone even if an inquiry appears to originate from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key.To locate your AWS Access Key ID and AWS Secret Access KeyGo to the AWS web site at http://aws.amazon.com. Click My Account to display a list of options.Click Security Credentials and log in to your AWS account. Your Access Key ID is displayed in the Access Credentials section. Your Secret Access Key remains hidden as a further precaution. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown in the following figure.
Upload your data and your processing application into Amazon S3. Amazon S3 provides reliable, scalable, easy-to-use storage for your input and output data.Log in to the AWS Management Console to start an Amazon Elastic MapReduce “job flow.” Simply choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3, and then click the “Create Job Flow” button. Alternatively you can start a job flow by specifying the same information mentioned above via our Command Line Tools or APIs. For more sophisticated workloads you can choose to install additional software or alter configuration of your Amazon EC2 instances using Bootstrap Actions.Monitor the progress of your job flow(s) directly from the AWS Management Console, Command Line Tools or APIs. And, after the job flow is done, retrieve the output from Amazon S3. You can optionally track progress and identify issues in steps, jobs, tasks, or task attempts of your job flows directly from the job flow debug window in the AWS Management Console. Amazon Elastic MapReduce uses Amazon SimpleDB to store job flow state information.Pay only for the resources that you actually consume. Amazon Elastic MapReduce monitors your job flow, and unless you specify otherwise, shuts down your Amazon EC2 instances after the job completes.
The sample input for this job flow is available at s3://elasticmapreduce/samples/wordcount/input.This example uses the built-in reducer called aggregate. This reducer adds up the counts of words being output by the wordSplitter mapper function. It knows to use data type Long from the prefix on the words. To run a streaming job flowEnter the following commands from the command-line prompt:Linux and UNIX users:$ ./elastic-mapreduce --create --stream \\ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \\ --input s3://elasticmapreduce/samples/wordcount/input \\ --output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \\ --reducer aggregate
Word count input data size in sample S3 bucket:[root@ip-10-166-230-67 ~]# /elimbio/s3cmd-1.0.1/s3cmd du s3://elasticmapreduce/samples/wordcount/input/19105856 s3://elasticmapreduce/samples/wordcount/input/Word count input data files:[root@ip-10-166-230-67 ~]# /elimbio/s3cmd-1.0.1/s3cmd ls s3://elasticmapreduce/samples/wordcount/input/2009-04-02 02:55 2392524 s3://elasticmapreduce/samples/wordcount/input/00012009-04-02 02:55 2396618 s3://elasticmapreduce/samples/wordcount/input/00022009-04-02 02:55 1593915 s3://elasticmapreduce/samples/wordcount/input/00032009-04-02 02:55 1720885 s3://elasticmapreduce/samples/wordcount/input/00042009-04-02 02:55 2216895 s3://elasticmapreduce/samples/wordcount/input/00052009-04-02 02:55 1906322 s3://elasticmapreduce/samples/wordcount/input/00062009-04-02 02:55 1930660 s3://elasticmapreduce/samples/wordcount/input/00072009-04-02 02:55 1913444 s3://elasticmapreduce/samples/wordcount/input/00082009-04-02 02:55 2707527 s3://elasticmapreduce/samples/wordcount/input/00092009-04-02 02:55 327050 s3://elasticmapreduce/samples/wordcount/input/00102009-04-02 02:55 8 s3://elasticmapreduce/samples/wordcount/input/00112009-04-02 02:55 8 s3://elasticmapreduce/samples/wordcount/input/0012
Q: What is Amazon Elastic MapReduce Bootstrap Actions? Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow. Q: How can I use Bootstrap Actions? You can write a Bootstrap Action script in any language already installed on the job flow instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a job flow. Please refer to the “Developer’s Guide”: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions. Q: How do I configure Hadoop settings for my job flow? The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions. Q: Can I modify the number of slave nodes in a running job flow? Yes. Slave nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a job flow is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running Job Flows section in the Developer’s Guide for details on how to modify the size of your running job flow. Q: When would I want to use core nodes versus task nodes? As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your job flow completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis.
The sample input for this job flow is available at s3://elasticmapreduce/samples/wordcount/input.This example uses the built-in reducer called aggregate. This reducer adds up the counts of words being output by the wordSplitter mapper function. It knows to use data type Long from the prefix on the words. To run a streaming job flowEnter the following commands from the command-line prompt:Linux and UNIX users:$ ./elastic-mapreduce --create --stream \\ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \\ --input s3://elasticmapreduce/samples/wordcount/input \\ --output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \\ --reducer aggregateThe output looks similar to the following.Created jobflowj-317IN1TUMRQ5BBy default, this command launches a job flow to run on a single-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.
Your job flow results are stored in a text file. The results file contains a list of all words found with the number of times the word occurred in the data set. Excerpt from output data:abandon 3abandoned 46abandoning 3abandonment6
Crossbow is a scalable, portable, and automatic Cloud Computing tool for finding SNPs from short read data. Crossbow employs Bowtie and a modified version of SOAPsnp to perform the short read alignment and SNP calling respectively. Crossbow is designed to be easy to run (a) in "the cloud" (in this case, Amazon's Elastic MapReduce service), (b) on any Hadoop cluster, or (c) on any single computer, without Hadoop. Crossbow exploits the availability of multiple computers and processors where possible.
A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual. For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles: C and T. Almost all common SNPs have only two alleles. The genomic distribution of SNPs is not homogenous, SNPs usually occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and fixating the allele of the SNP that constitutes the most favorable genetic adaptation.[1] Besides natural selection other factors like recombination and mutation rate can also determine SNP density. SNP density can be predicted by the presence of microsatellites as regions of thousands of nucleotides flanking microsatellites have an increased or decreased density of SNPs depending on the microsatellite sequence.[2]
Before running Crossbow on EMR, you must have an AWS account with the appropriate features enabled. You may also need to install Amazon's elastic-mapreduce tool. In addition, you may want to install an S3 tool, though most users can simply use Amazon's web interface for S3, which requires no installation.If you plan to run Crossbow exclusively on a single computer or on a Hadoop cluster, you can skip this section.Create an AWS account by navigating to the AWS page. Click "Sign Up Now" in the upper right-hand corner and follow the instructions. You will be asked to accept the AWS Customer Agreement.Sign up for EC2 and S3. Navigate to the Amazon EC2 page, click on "Sign Up For Amazon EC2" and follow the instructions. This step requires you to enter credit card information. Once this is complete, your AWS account will be permitted to use EC2 and S3, which are required.Sign up for EMR. Navigate to the Elastic MapReduce page, click on "Sign up for Elastic MapReduce" and follow the instructions. Once this is complete, your AWS account will be permitted to use EMR, which is required.Sign up for SimpleDB. With SimpleDB enabled, you have the option of using the AWS Console's Job Flow Debugging feature. This is a convenient way to monitor your job's progress and diagnose errors.Optional: Request an increase to your instance limit. By default, Amazon allows you to allocate EC2 clusters with up to 20 instances (virtual computers). To be permitted to work with more instances, fill in the form on the Request to Increase page. You may have to speak to an Amazon representative and/or wait several business days before your request is granted.To see a list of AWS services you've already signed up for, see your Account Activity page. If "Amazon Elastic Compute Cloud", "Amazon Simple Storage Service", "Amazon Elastic MapReduce" and "Amazon SimpleDB" all appear there, you are ready to proceed.To runIf the input reads have not yet been preprocessed by Crossbow (i.e. input is FASTQ or .sra), then first (a) prepare a manifest file with URLs pointing to the read files, and (b) upload it to an S3 bucket that you own. See your S3 tool's documentation for how to create a bucket and upload a file to it. The URL for the manifest file will be the input URL for your EMR job.If the input reads have already been preprocessed by Crossbow, make a note of of the S3 URL where they're located. This will be the input URL for your EMR job.If you are using a pre-built reference jar, make a note of its S3 URL. This will be the reference URL for your EMR job. See the Crossbow website for a list of pre-built reference jars and their URLs.If you are not using a pre-built reference jar, you may need to build the reference jars and/or upload them to an S3 bucket you own. See your S3 tool's documentation for how to create a bucket and upload to it. The URL for the main reference jar will be the reference URL for your EMR job.In a web browser, go to the Crossbow web interface.Fill in the form according to your job's parameters. We recommend filling in and validating the "AWS ID" and "AWS Secret Key" fields first. Also, when entering S3 URLs (e.g. "Input URL" and "Output URL"), we recommend that users validate the entered URLs by clicking the link below it. This avoids failed jobs due to simple URL issues (e.g. non-existence of the "Input URL"). For examples of how to fill in this form, see the E. coli EMR and Mouse chromosome 17 EMR examples.Be sure to make a note of the various numbers and names associated with your accounts, especially your Access Key ID, Secret Access Key, and your EC2 key pair name. You will have to refer to these and other account details in the future.