MAPREDUCE• Simple data-parallel programming model designed for scalability and fault-tolerance• Pioneered by Google – Processes 20 petabytes of data per day• Popularized by open-source Hadoop project – Used at Yahoo!, Facebook, Amazon, …
AMAZON EC2 SERVICE• Elastic - Increase or decrease capacity within minutes, not hours or days• Completely controlled• Flexible –multiple instance types (CPU, memory, storage), operating systems, and software packages.• Reliable –99.95% availability for each Amazon EC2 Region.• Secure – numerous mechanisms for securing your compute resources.• Inexpensive: Reserved Instance and Spot Instances• Easy to Start.
AMAZON S3 STORAGE• Write, read, and delete objects containing from 1 byte to 5 terabytes• Objects are stored in a bucket• Authentication mechanisms• Options for secure data upload/download and encryption of data at rest• Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year• Reduced Redundancy Storage (RRS)
AMAZON EMR FEATURES• Web-based interface and command-line tools for running Hadoop jobs on Amazon EC2• Data stored in Amazon S3• Monitors job and shuts down machines after use• Small extra charge on top of EC2 pricing• Significantly reduces the complexity of the time- consuming set-up, management and tuning of Hadoop clusters
GETTING STARTED – SIGN UP • Sign up for Amazon EMR / AWS at http://aws.amazon.com • Need to be signed also for Amazon S3 and Amazon EC2 • Locate and save AWS credentials: – AWS Access Key ID – AWS Secret Access Key – EC2 Key Pair • Optionally install on desktop: – EMR command line client – S3 command line
GETTING STARTED – SECURITY, TOOLS
EMR JOB FLOW - BASIC STEPS 1. Upload input data to S3 2. Create job flow by defining Map and Reduce 3. Download output data from S3
EMR WORD COUNT SAMPLE
WORD COUNT – INPUT DATA• Word count input data size in sample S3 bucket:./s3cmd dus3://elasticmapreduce/samples/wordcount/input/19105856 s3://elasticmapreduce/samples/wordcount/input/• Word count input data files./s3cmd ls s3://elasticmapreduce/samples/wordcount/input/2009-04-02 02:55 2392524s3://elasticmapreduce/samples/wordcount/input/00012009-04-02 02:55 2396618s3://elasticmapreduce/samples/wordcount/input/00022009-04-02 02:55 1593915s3://elasticmapreduce/samples/wordcount/input/00032009-04-02 02:55 1720885s3://elasticmapreduce/samples/wordcount/input/00042009-04-02 02:55 2216895s3://elasticmapreduce/samples/wordcount/input/0005
EMR WORD COUNT SAMPLE• Starting instances, bootstrapping, running job steps:
EMR WORD COUNT SAMPLE• Start the word count sample job from EMR command line:$ ./elastic-mapreduce --create --name "word countcommandline test" --stream --inputs3n://elasticmapreduce/samples/wordcount/input --mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--reducer aggregate --outputs3n://test.emr.bucket/wordcount/output2• Output contains job number:Created job flow j-317IN1TUMRQ5B
WORD COUNT – OUTPUT DATA• Locate and download output data in the specified output S3 bucket:
REAL-WORLD EXAMPLE - GENOTYPING• Crossbow is a scalable, portable, and automatic Cloud Computing tool for finding SNPs from short read data.• Crossbow is designed to be easy to run (a) in "the cloud" Amazons Elastic MapReduce service, (b) on any Hadoop cluster, or (c) on any single computer, without Hadoop.• Open-source available to anyonehttp://bowtie-bio.sourceforge.net/crossbow/
SINGLE-NUCLEOTIDE POLYMORPHISM• A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual.
SNP ANALYSIS IN AMAZON EMR• Crossbow web intefacehttp://bowtie-bio.sourceforge.net/crossbow/ui.html
SNP ANALYSIS – DATA IN AMAZON S3• Data for SNP analysis is uploaded to Amazon S3 bucket• Output of analysis is placed in S3
SNP ANALYSIS – INPUT / OUTPUT DATA• Input data – single file ~ 1.4GB@E201_120801:4:1:1208:14983#ACAGTG/1GAAGGAATAATGAGACCTNACGTTTCTGNNCNNNNNNNNNNNNNNNNNNN+E201_120801:4:1:1208:14983#ACAGTG/1gggfggdgfgdgg_e^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@E201_120801:4:1:1208:6966#ACAGTG/1GCTGGGATTACAGACACANGCCACCACANNTNNNNNNNNNNNNNNNNNNN+E201_120801:4:1:1208:6966#ACAGTG/1• Output data – multiple fileschr1 841900 G A 3 A 68 2 2 G 0 0 0 2 0 1.00000 1.00000 1chr1 922615 T G 2 G 38 3 3 T 67 1 1 4 0 1.00000 1.00000 0chr1 1011278 A G 12 G 69 1 1 A 0 0 0 1 0 1.00000 1.00000 1
SNP ANALYSIS - TIME• To process 1.4GB on 1 EMR instance – 6 hours• To process 1.4GB on 2 EMR instances – 4 hours• To process 1.4GB on 4 EMR instances – 2.5 hours• Haven’t tried more instances…
AND MORE CASES FOR AMAZON AWSCustomer 1 successful migration from dedicated hosting toAmazon: 1 EC2 xlarge Linux instance (15 GB, 4 cores, 64bit) with 4 EBS volumes 250GB in US West (North California) region Runs 1 heavy web sites with > 1К concurrent users Tomcat app server and Oracle SE 11.2 Amazon Elastip IP for web site Continuous Oracle backup to Amazon S3 through Oracle secure backup for S3 And it costs for customer only …wow <2 days for LIVE migration on weekend
AND MORE CASES FOR AMAZON AWSCustomer 2 successful migration from Rackspace toAmazon: Rackspace hosting + service cost $..К, and service level very low. Rackspace server was fixed. Migrated to 1 Amazon 2xlarge (34.2 GB, 4 virtual cores) EC2 Windows 2008 R2 instance. >100 web sites for corporate customers. 2 EBS volumes 1.5TB Amazon Oracle RDS as backend – fully automated Oracle database with expandable storage. 200GB of user data in RDS. Full LIVE migration completed in 48 hours with DNS names switch. And budget is significantly lower!!
THANK YOUWELCOME FOR DISCUSSION… Sergey Sverchkov email@example.com skype: sergey.sverchkov