Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High Performance Computing (HPC) in cloud

500 views

Published on

Learn from Accubits Technologies

High Performance Computing (HPC) most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.

Published in: Data & Analytics
  • Be the first to comment

High Performance Computing (HPC) in cloud

  1. 1. Overview 1. Introduction to HPC 2. Hadoop (HDFS, MapReduce) 3. AWS toolkit (Amazon S3, Amazon EMR, Amazon Redshift) 4. Case study
  2. 2. Why?
  3. 3. Large data files from sequencers. Computational bottleneck. Processing time. Data persistence and reliability. Data security. Bottlenecks in Genome Analysis
  4. 4. How?
  5. 5. Introduction “High Performance Computing (HPC) most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. ”
  6. 6. Dedicated supercomputer. Commodity HPC cluster. Grid computing. HPC in cloud. Forms of HPC
  7. 7. What?
  8. 8. Hadoop Open source Java based framework for reliable, scalable and distributed computing. Doug Cutting and Mike Cafarella, 2006-08 in Yahoo!- inspired by Google (GFS) in 2003. Key Components Hadoop Distributed File System (HDFS) MapReduce
  9. 9. Hadoop HDFS (Hadoop Distributed File System) Data management layer Master-Slave architecture Fault Tolerant Key Components: NameNode SecondaryNamenode
  10. 10. Hadoop- Continued MapReduce Mappers and Reducers Batch oriented Key Components JobTracker TaskTracker
  11. 11. Hadoop- Architecture
  12. 12. AWS ToolKit - Amazon Elastic MapReduce (EMR) Managed Hadoop framework. Runs almost all popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink. Elastic. Flexible Data storage (S3, HDFS, RedShift, Glacier, RDS). Secure and reliable. Full control and root access.
  13. 13. AWS ToolKit - Amazon EMR aws emr create-cluster --name "demo" --release-label emr-4.5.0 --instance-type m3.xlarge --instance-count 2 --ec2-attributes KeyName=YOUR-AWS-SSH-KEY --use-default-roles --applications Name=Hive Name=Spark
  14. 14. aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE, Args=[-f,s3://mybucket/scripts/pigscript.pig,-p, INPUT=s3://mybucket/inputdata/,-p, OUTPUT=s3://mybucket/outputdata/, $INPUT=s3://mybucket/inputdata/, $OUTPUT=s3://mybucket/outputdata/]
  15. 15. AWS ToolKit - Amazon S3 (Simple Storage Service) Virtually infinite storage. Single object size up to 5TB. Why use S3? Durable, Low Cost, Scalable, High Performance, Secure, Integrated, Easy to Use. Decouple storage and computation resources. HDFS requirements and implements EMRFS.
  16. 16. AWS ToolKit - Amazon Redshift Fast, simple petabyte-scale data warehouse. Use SQL query to interact. Massively parallel. Relational. Architecture - Leader Node and Compute node. Fast - 4 GB/sec/node.
  17. 17. Case Study- Rail RNA Cloud-enabled spliced aligner that analyzes many samples at once. Architecture - Amazon S3, Amazon EMR. ~50000 (from NCBI archive) human RNA sample using Rail-RNA - 150 Tbps. Input to result - 2 weeks. Cost- ~$1.40/sample. Paper- Splicing across SRA.
  18. 18. Thank you

×