Overview
1. Introduction to HPC
2. Hadoop (HDFS, MapReduce)
3. AWS toolkit (Amazon S3, Amazon EMR, Amazon Redshift)
4. Case study
Why?
Large data files from sequencers.
Computational bottleneck.
Processing time.
Data persistence and reliability.
Data security.
Bottlenecks in Genome Analysis
How?
Introduction
“High Performance Computing (HPC) most generally refers to the practice of
aggregating computing power in a way that delivers much higher performance
than one could get out of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business. ”
Dedicated supercomputer.
Commodity HPC cluster.
Grid computing.
HPC in cloud.
Forms of HPC
What?
Hadoop
Open source Java based framework for reliable, scalable and distributed
computing.
Doug Cutting and Mike Cafarella, 2006-08 in Yahoo!- inspired by Google (GFS)
in 2003.
Key Components
Hadoop Distributed File System (HDFS)
MapReduce
Hadoop
HDFS (Hadoop Distributed File System)
Data management layer
Master-Slave architecture
Fault Tolerant
Key Components:
NameNode
SecondaryNamenode
Hadoop- Continued
MapReduce
Mappers and Reducers
Batch oriented
Key Components
JobTracker
TaskTracker
Hadoop- Architecture
AWS ToolKit - Amazon Elastic MapReduce (EMR)
Managed Hadoop framework.
Runs almost all popular distributed frameworks such as Apache Spark, HBase,
Presto, and Flink.
Elastic.
Flexible Data storage (S3, HDFS, RedShift, Glacier, RDS).
Secure and reliable.
Full control and root access.
AWS ToolKit - Amazon EMR
aws emr create-cluster 
--name "demo" 
--release-label emr-4.5.0 
--instance-type m3.xlarge 
--instance-count 2 
--ec2-attributes KeyName=YOUR-AWS-SSH-KEY 
--use-default-roles 
--applications Name=Hive Name=Spark
aws emr create-cluster 
--name "Test cluster" 
--ami-version 2.4 
--applications Name=Hive Name=Pig 
--use-default-roles --ec2-attributes KeyName=myKey 
--instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge 
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge 
--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,
Args=[-f,s3://mybucket/scripts/pigscript.pig,-p, 
INPUT=s3://mybucket/inputdata/,-p, 
OUTPUT=s3://mybucket/outputdata/, 
$INPUT=s3://mybucket/inputdata/, 
$OUTPUT=s3://mybucket/outputdata/]
AWS ToolKit - Amazon S3 (Simple Storage Service)
Virtually infinite storage.
Single object size up to 5TB.
Why use S3?
Durable, Low Cost, Scalable, High Performance, Secure, Integrated, Easy to Use.
Decouple storage and computation resources.
HDFS requirements and implements EMRFS.
AWS ToolKit - Amazon Redshift
Fast, simple petabyte-scale data warehouse.
Use SQL query to interact.
Massively parallel.
Relational.
Architecture - Leader Node and Compute node.
Fast - 4 GB/sec/node.
Case Study- Rail RNA
Cloud-enabled spliced aligner that analyzes many samples at once.
Architecture - Amazon S3, Amazon EMR.
~50000 (from NCBI archive) human RNA sample using Rail-RNA - 150 Tbps.
Input to result - 2 weeks.
Cost- ~$1.40/sample.
Paper- Splicing across SRA.
Thank you

High Performance Computing (HPC) in cloud

  • 2.
    Overview 1. Introduction toHPC 2. Hadoop (HDFS, MapReduce) 3. AWS toolkit (Amazon S3, Amazon EMR, Amazon Redshift) 4. Case study
  • 3.
  • 4.
    Large data filesfrom sequencers. Computational bottleneck. Processing time. Data persistence and reliability. Data security. Bottlenecks in Genome Analysis
  • 5.
  • 6.
    Introduction “High Performance Computing(HPC) most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. ”
  • 7.
    Dedicated supercomputer. Commodity HPCcluster. Grid computing. HPC in cloud. Forms of HPC
  • 8.
  • 9.
    Hadoop Open source Javabased framework for reliable, scalable and distributed computing. Doug Cutting and Mike Cafarella, 2006-08 in Yahoo!- inspired by Google (GFS) in 2003. Key Components Hadoop Distributed File System (HDFS) MapReduce
  • 10.
    Hadoop HDFS (Hadoop DistributedFile System) Data management layer Master-Slave architecture Fault Tolerant Key Components: NameNode SecondaryNamenode
  • 11.
    Hadoop- Continued MapReduce Mappers andReducers Batch oriented Key Components JobTracker TaskTracker
  • 12.
  • 13.
    AWS ToolKit -Amazon Elastic MapReduce (EMR) Managed Hadoop framework. Runs almost all popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink. Elastic. Flexible Data storage (S3, HDFS, RedShift, Glacier, RDS). Secure and reliable. Full control and root access.
  • 14.
    AWS ToolKit -Amazon EMR aws emr create-cluster --name "demo" --release-label emr-4.5.0 --instance-type m3.xlarge --instance-count 2 --ec2-attributes KeyName=YOUR-AWS-SSH-KEY --use-default-roles --applications Name=Hive Name=Spark
  • 15.
    aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE, Args=[-f,s3://mybucket/scripts/pigscript.pig,-p, INPUT=s3://mybucket/inputdata/,-p, OUTPUT=s3://mybucket/outputdata/, $INPUT=s3://mybucket/inputdata/, $OUTPUT=s3://mybucket/outputdata/]
  • 16.
    AWS ToolKit -Amazon S3 (Simple Storage Service) Virtually infinite storage. Single object size up to 5TB. Why use S3? Durable, Low Cost, Scalable, High Performance, Secure, Integrated, Easy to Use. Decouple storage and computation resources. HDFS requirements and implements EMRFS.
  • 17.
    AWS ToolKit -Amazon Redshift Fast, simple petabyte-scale data warehouse. Use SQL query to interact. Massively parallel. Relational. Architecture - Leader Node and Compute node. Fast - 4 GB/sec/node.
  • 18.
    Case Study- RailRNA Cloud-enabled spliced aligner that analyzes many samples at once. Architecture - Amazon S3, Amazon EMR. ~50000 (from NCBI archive) human RNA sample using Rail-RNA - 150 Tbps. Input to result - 2 weeks. Cost- ~$1.40/sample. Paper- Splicing across SRA.
  • 19.