Supporting bioinformatics applications with hybrid multi-cloud services

Supporting bioinformatics
applications with hybrid
multi-cloud services
Mohamed Abouelhoda
Joint work with
Ahmed Abdullah Ali and Mohamed Elkalioby
1

ElasticHPC Overview
ElasticHPC
Configuration
ElasticHPC
Web Interface
IaaS controller & Mapper
• ElasticHPC Supports the creation and management of cloud computing resources
over multiple public cloud Providers Including Amazon, Azure, Google and Clouds
supporting OpenStack.
Cluster Manager
Security & Networking
Storage Manager
Job&DataManager
Cloud
Provider
#1
Cloud
Provider
#2
Cloud
Provider
#3
2

Cloud Computing
Infrastructure
Virtual Servers
Business
Models
Software
Platform
Cloud Management Provider
• Cloud Computing provides the access to hardware, platforms and software, it
takes care of hosting and storage.
• User has no clue where his/her data is.
3

Cloud Computing
Cloud deployment models
 Public Cloud
 Private Cloud
 Hybrid Cloud
 Community Cloud
Private Cloud
Public Cloud
Hybrid Cloud
4

Cloud Computing
Advantages
 Service automation and self-
service models
 Easy to deploy
 It is an immigration from CapEx
to OpEx
 Data recovery and backup
Disadvantages
 Security Issues
 User has no clue where his/her
data is
 Legacy systems incompatibility
 Higher operational cost for long
term usage
Advantages and disadvantages of cloud computing
5

Cloud Computing
Cloud Computing for Bioinformatics Applications
Some tools already developed for bioinformatics applications
 Crossbow,
 Myrna
 CloudBrust,
 CloudBlast,
 Cloud–RNA,
 etc.
These tools are demonstrated on cloud computing and their techniques are not
generic to other tools and supports only Amazon Web Services
6

Cloud Computing
Computer Cluster middleware packages over cloud
Middleware packages support computer cluster management
over cloud
 StarCluster
 Vappio
 CloudMan
 etc.
These middleware packages do not support running computer cluster over multiple
Cloud providers
7

Cloud Computing
Cloud computing providers
8

Our contribution
Cluster 1 Cluster 2 Cluster 3
Provider 1 Provider 2 Provider 1 Provider 2
Non-Federated Cloud Cluster Federated Cloud Cluster
Our contribution is to extend bioinformatics applications to run over multiple
clusters on different cloud service
providers and supporting two types of compute cluster
 Non-Federated Cloud Cluster
 Federated Cloud Cluster
9

Our contribution
ElasticHPC supports creation and management of computer cluster for
bioinformatics solutions on:
– Amazon Web Services
– Microsoft Windows Azure
– Google Compute Engine
– OpenStack based clouds
Provider 2 Provider 1 Provider 2
10

Use case scenarios
Simplified version of the variant analysis workflow based on NGS technology as an
example for our use case scenarios
12
The variant analysis workflow: the tools BWA, Picard, GATK are usually used for the
three steps of the workflow. On the arrows, we write the different file formats of
the processed data

Multiple clusters over multiple clouds
Multiple independent clusters over multiple clouds and each cluster
processes part of the input data
13

Using this scenario depends on:
 Time constraint or not.
 Reducing the cost within specific time (Spot instances)
14
Input File 3
Cloud 1
Input File 1
Cluster 1 Cluster 2
Input File 2
Cloud 2
Cluster 3
Storing
Output files
On Object storage or S3

Each cluster is created in one cloud and solves a step of the workflow.
15

In the case of technical limitations
Some technical specification preventing a step from
running in one cloud, but the other steps can run in cheaper cloud.
16
Cloud 1
Cloud 2 Cloud 3
Cluster 1
Cluster 3Cluster 2
Read Mapping Step
Mark Duplicates Step
Variant Calling Step
Storing
Output files
On Object storage or S3

One cluster of federated cloud machines
One cluster composed of different machines from different clouds where
one master job queue which dispatches the jobs among the nodes in
different clouds.
17

Cloud 1 Cloud 2
Persistent
Process
Communication Layer
Master
Node
master job queue dispatches the jobs among the nodes in different clouds
that works on the job level rather than the whole (sub-) workflow level
18

• Using this scenario depends on
• The processing time differs from one job to another.
• The characteristics of the processed data
• Internet connection among the cloud sites
• Good management of input data according to its
characteristics
19

Elastic-HPC
• Software library facilitates creation & use of high
performance cloud computing resources for bioinformatics
over multiple cloud service providers
• Basic Features
• Creation of multi-cloud clusters
• Management of cluster
• Data management options
» NFS
» S3FS
» GlusterFS
• Job submission options
» PBS Torque
» Sun Grid Engine (SGE)
• Bioinformatics tool set
» 200 sequence analysis tools coming from BioLinux
» EMBOSS
» NCBI Toolkit
» SHRiMP, Bowtie2, GATK, BWA, ..etc.
21

Implementation of multi-cloud
elasticHPC
22

elasticHPC
23
The three major commercial providers Amazon, Azure, and Google
Amazon Web Services (AWS)
Execution Model:
• Highest CPU virtual machine of type c3.8xlarge (32 Cores and 108 GB
RAM $1.68/hr)
Storage Model:
• EBS “Elastic Block Storage” such as Hard disks and block devices
• S3 “Simple Storage Services” it is some sort of object storage.
Pricing models:
• Pay as you go
• Reserved instances
• Spot instances

elasticHPC
24
Microsoft Windows Azure
Execution Model:
• Highest CPU virtual machine of type A9 (16 cores, 112 GB RAM
$4.47/hr)
Storage Models:
• Page Blobs such as Hard disks and block devices as a file system with a
maximum size of 1 TB
• Block Blobs with maximum size of 200 GB.
Pricing models:
• Pay as you go “pay per minute”

elasticHPC
25
Google Compute Engine /Google Cloud
Execution Model:
• Highest CPU virtual machine of type n1-highmem-16 (16 cores , 104 GB
RAM, $1.18/hr)
• Also Google provides hard disks, snapshots and images within execution
models
Storage Models:
• Object Storage
Pricing models:
• Pay as you go “pay per minute”
• sustained use

elasticHPC
Comparing the features among Amazon Web Services, Windows Azure and Google
Compute Engine, including the business model
26

elasticHPC
Implementation Details
 The elasticHPC follows a server client architecture.
27

elasticHPC
elasticHPC Interface
1. Create Federated/Non-
Federated Clusters
2. Upload Configuration File
3. Upload Cloud Specific
Credentials and Start Clusters
3
1
2
28

elasticHPC
IaaS Controller and Mapper
translates the request to the
corresponding APIs specific to
each cloud platform.
29

elasticHPC
Cluster Manager
handles all functions related to
the creation and management
of clusters at that cloud site
including security settings
and storage devices
30

elasticHPC
Job and Data Manager
handles job submission and
data transfer management
between cluster’s nodes and
different storage types
(Block/Object) storage.
31

Experiments
Variant Analysis Workflow
 Input exome dataset of size ≈ 9 GB
 using BWA for read mapping, Picard for marking duplicates,
and GATK for variant calling
33

Experiments
Experiment 1
The workflow was executed 3 times independently on:
 Google
 n1-highmem-8 (8 Cores, 52 GB RAM, $0.452/hour)
 AWS
 m3.2xlarge (8 Cores, 30 GB RAM,$0.56/hour)
 Azure
 Standard A7 (8 Cores, 56 GB RAM, $1.00/hour)
The 9 GB input data is divided into blocks to be processed in parallel over
the cluster nodes
34

Experiments
Experiment 1
Google and Amazon have the same performance, on the other Hand Azure has the
Worst performance
35
Running times in minutes. “MarkD “ stands for mark duplicate step. The numbers
Between backets are the cost in USD

Experiments
Experiment 1
Noted that Mark duplicate has no performance improvement when
adding More nodes (increasing computing power) because Picard
requires all reads to be a one set of input.
36

Experiments
Experiment 2
Using the same input dataset but with stronger machine for the Mark Duplicate
step on Amazon
c3.8xlarge
Amazon c3.8xlarge,
which has 32 cores
and 108 GB RAM
and costs $1.68
Mark Duplicate
Google cluster
n1-highmem-8
8 Cores, 52GB
RAM, $0.452
n1-highmem-8
READ MAPPING
VARIANT CALLING
Uploading
VCF output
File to Object
Storage
S3/Google
Objects
Transfer
Mapped
BAM File
1 2
34
37

Experiments
Experiment 2
Google will always retrieve better cost when the parallelization leads to
fractions of hour. So the best cost with comparable performance for these
three steps workflow is when we use hybrid cloud of Amazon and Google.
38
Running times in minutes using single provider and multicloud scenario of
two providers. The numbers between brackets are the cost in USD

Conclusion
 Introducing ElasticHPC that creates and manage computer cluster over multiple
cloud platforms for bioinformatics applications
 Google and Azure offer “The charge per minutes” pricing model
 Amazon charges per hour as a pricing model
 ElasticHPC enables the data analyst to use cloud with best offer at the time of
analysis
 elasticHPC opens the way for the development of more advanced layers for task
scheduling and cost-time optimization
 Future work, we will include different ideas to use shared storage from
multi-cloud as a shared file system
39

Availability and requirements
• Project name: elasticHPC.
• Project home page: http://www.elastichpc.org.
• Operating system(s): Linux.
• Programming language: Python, C, Java script, HTML, Shell
script.
• Other requirements: Compatible with the browsers
FireFox, Chrome, Safari, and Opera. See the manual for
more details.
• License: Free for academics. Authorization license needed
for commercial usage (Please contact the corresponding
author for more details).
• Any restrictions to use by non-academics: No restrictions.
41

Configurations file
################## BASIC SETTING FOR CLOUD PLATFORMS ##############
[GCE]
# GOOGLE COMPUTE ENGINE CONFIGURATION
PROJECT_ID =
ZONE = us-central1-a
CLIENT_SECRET = config/client_secret.json
COMPUTE_SCOPE = https://www.googleapis.com/auth/compute
OAUTH_STORAGE = oauth2.dat
IMAGE_PROJECT =
SERVICE_EMAIL = default
NETWORK = default
SCOPES = https://www.googleapis.com/auth/devstorage.full_control
API_VERSION = v1
CLUSTER_CLIENT_KEY = keys/key
ROOT_DISK=disks
Configuration s file Sample
Google Specific configuration section
42

Configurations file
################## BASIC SETTING FOR CLOUD PLATFORMS ##############
[AZURE]
# MICROSOFT WINDOWS AZURE CONFIGURATIONS
SUBSCRIPTION_ID =
THUMBPRINT =
STORAGE_ACCOUNT =
STORAGE_KEY =
CERTIFICATE_PATH = mycert.pem
PKFILE = mycert.cer
CERT_DATA_PATH = mycert.pfx
CERT_PASSWORD =
REGION = WUS
CONTAINER=newcontainer
Azure Specific configuration section
43

Configurations file
########## BASIC SETTING FOR CLOUD PLATFORMS ########
[AWS]
# AMAZON WEB SERVICES CONFIGURATIONS
pkey= pk.pem
cert= cert.pem
accessKey=
secretKey=
keyPair= instance-key
securityGroup =
keyPairPath= instance-key.pem
INSTANCE_TYPE = m3.medium
MASTER_TYPE = m3.medium
REGION = USW1
ZONE = us-west-1c
Amazon Specific configuration section
44

Configurations file
###### DEFINE CLUSTERS #######
[CLUSTERS]
CLUSTERS_LIST= CLUSTER1, CLUSTER 2
[CLUSTER1] ### CLUSTER 1 is hybrid cluster over multi-
cloud
# CLUSTER CONFIGURATION
CLUSTER_NAME= cluster1
CLUSTER_PREFIX = cluster1
MachineSets=MachineSet2,MachineSet3,MachineSet1
MASTER_NODE_LOCATION= MachineSet2
NFS = True
# NFS CONFIGURATION
NFS_MOUNTING_POINT=/home
NFS_DEVICE=/dev/xvdf
NFS_FSID=0
NFS_EBS_Mode=NEW_VOLUME
# attach new volume
NFS_NEW_VOLUME_SIZE=10
# in case of attach an exist volume
GLUSTER=False
GLUSTER_MOUNT_POINT = /gluster/WGA/
GLUSTER_VOLUME_NAME = gv0
GLUSTER_STRIPE = 1
GLUSTER_REPLICATE = 1
GLUSTER_FORMAT_DISK = False
Cluster s Section defines multiple clusters where each one has multiple Machine sets,
every Machine sets represents a cluster on different cloud service provider
[MachineSet1]
NODES = 2
PROVIDER = GCE
# IMAGE CONFIGURATION
IMAGE_ID = tavaxy2
……
FIREWALL=ehpc,http2,apache2
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
[MachineSet2]
NODES = 0
PROVIDER = AWS
IMAGE_ID = ami-077d9a43
……..
FW_PORTS=5000,8080,80
[MachineSet3]
NODES=0
Provider = AZURE
IMAGE_ID = ehpc-generic26
OS_URL =
……..
FW_PORTS=5000,8080,80
45

Supporting bioinformatics applications with hybrid multi-cloud services

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Supporting bioinformatics applications with hybrid multi-cloud services

Similar to Supporting bioinformatics applications with hybrid multi-cloud services (20)

Recently uploaded

Recently uploaded (20)

Supporting bioinformatics applications with hybrid multi-cloud services