High Performance Computing (HPC) in cloud

Overview
1. Introduction to HPC
2. Hadoop (HDFS, MapReduce)
3. AWS toolkit (Amazon S3, Amazon EMR, Amazon Redshift)
4. Case study

Large data files from sequencers.
Computational bottleneck.
Processing time.
Data persistence and reliability.
Data security.
Bottlenecks in Genome Analysis

Introduction
“High Performance Computing (HPC) most generally refers to the practice of
aggregating computing power in a way that delivers much higher performance
than one could get out of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business. ”

Dedicated supercomputer.
Commodity HPC cluster.
Grid computing.
HPC in cloud.
Forms of HPC

Hadoop
Open source Java based framework for reliable, scalable and distributed
computing.
Doug Cutting and Mike Cafarella, 2006-08 in Yahoo!- inspired by Google (GFS)
in 2003.
Key Components
Hadoop Distributed File System (HDFS)
MapReduce

Hadoop
HDFS (Hadoop Distributed File System)
Data management layer
Master-Slave architecture
Fault Tolerant
Key Components:
NameNode
SecondaryNamenode

Hadoop- Continued
MapReduce
Mappers and Reducers
Batch oriented
Key Components
JobTracker
TaskTracker

AWS ToolKit - Amazon Elastic MapReduce (EMR)
Managed Hadoop framework.
Runs almost all popular distributed frameworks such as Apache Spark, HBase,
Presto, and Flink.
Elastic.
Flexible Data storage (S3, HDFS, RedShift, Glacier, RDS).
Secure and reliable.
Full control and root access.

AWS ToolKit - Amazon EMR
aws emr create-cluster
--name "demo"
--release-label emr-4.5.0
--instance-type m3.xlarge
--instance-count 2
--ec2-attributes KeyName=YOUR-AWS-SSH-KEY
--use-default-roles
--applications Name=Hive Name=Spark

aws emr create-cluster
--name "Test cluster"
--ami-version 2.4
--applications Name=Hive Name=Pig
--use-default-roles --ec2-attributes KeyName=myKey
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,
Args=[-f,s3://mybucket/scripts/pigscript.pig,-p,
INPUT=s3://mybucket/inputdata/,-p,
OUTPUT=s3://mybucket/outputdata/,
$INPUT=s3://mybucket/inputdata/,
$OUTPUT=s3://mybucket/outputdata/]

AWS ToolKit - Amazon S3 (Simple Storage Service)
Virtually infinite storage.
Single object size up to 5TB.
Why use S3?
Durable, Low Cost, Scalable, High Performance, Secure, Integrated, Easy to Use.
Decouple storage and computation resources.
HDFS requirements and implements EMRFS.

AWS ToolKit - Amazon Redshift
Fast, simple petabyte-scale data warehouse.
Use SQL query to interact.
Massively parallel.
Relational.
Architecture - Leader Node and Compute node.
Fast - 4 GB/sec/node.

Case Study- Rail RNA
Cloud-enabled spliced aligner that analyzes many samples at once.
Architecture - Amazon S3, Amazon EMR.
~50000 (from NCBI archive) human RNA sample using Rail-RNA - 150 Tbps.
Input to result - 2 weeks.
Cost- ~$1.40/sample.
Paper- Splicing across SRA.

High Performance Computing (HPC) in cloud

More Related Content

What's hot

Similar to High Performance Computing (HPC) in cloud

More from Accubits Technologies

Recently uploaded

High Performance Computing (HPC) in cloud