SlideShare a Scribd company logo
Cloud Computing:
Safe Haven from the Data Deluge?
Toby Bloom, Ph.D.
Clouds: the solution to all problems?
Agenda
• What is the “cloud”?
• When to use it?
• An example: moving our analysis pipeline to
the cloud
• What works; what doesn’t
What is Cloud Computing?
• Pay-as-you-go compute infrastructure
– Compute servers by the hour
– Storage services by the month
– Network transfers by the byte
• Wide range of other services offered by cloud
providers
• Other definitions:
– Google cloud
• Google apps, pay-as-you-go
• “applications as a service”
–
Why clouds?
• Small research centers
– 1 or 2 Illuminas can overwhelm IT infrastructure
• Spikes in load
– The week before Marco, the compute queues get very long
• Uneven load
– If load goes up and down unpredictably, don’t want to buy
resources to handle the peaks and leave them idle much of
the time
• Large collaborative projects
– Avoid repeatedly transferring data between centers
– make computational resources available in one place
– easier to share all results quickly
The advantage for large projects
1000G Pilot - Fastq lifecycle
Generate
Fastq
Fastq to
NCBI
Replicate to
EBI
Download to
Sanger
Upload BAM
to EBI
Replicate to
NCBI
Mirror to 3+
analyis sites
Goal on Cloud
10+ copies +
backups
Generate Fastq
NCBI EBI
All further
processing on
Cloud
2 files + replicas
Our Experiment: Analysis on the Cloud
• Implement our current Illumina production
analysis pipeline (Picard) on the Amazon cloud
• Compare performance & cost to local
pipelines.
• Tune architecture for the cloud
– How to change the implementation to work best
on the cloud
– Identify general “rules” for cloud implementations
• Test use on some real projects
The Pipeline
Extract
Illumina Data
to Standard
Format
Align reads
with BWA or
MAQ
Mark
Duplicate
Reads
Re-align reads
around known
indels
Calibrate
Quality Scores
Collect Metrics
about Libraries
and Run
Verify Sample
Identity
Summary
Report
Aggregation
Pipeline
Merge all data
for each library
Mark Duplicate
Reads per
library
Collect Metrics
per library
Merge all
libraries for a
sample
Collect Metrics
about the
Sample
Downstream
pipelines and
analysts
Run Level
Pipeline
Lane-Level Analysis
Sample-Level Aggregation
Current Status:
• Pipeline Manager and Picard Alignment
Pipeline are running on the Amazon cloud
• Currently running 1000 Genomes Exomes
through Picard on the cloud
– As a high-volume test case
– But also the actual pipeline for the Exome DCC
– ~110 Exomes processed.
• Still restructuring / optimizing
• Cloud capabilities always changing
Challenges of porting to the cloud
• May require substantial re-architecture of
your application
• Getting the data there
• Security/ privacy issues
• Efficient utilization of cloud resources
• Predicting usage needs and costs
IT Architecture Differences
Isilon Storage –
Petabytes in one file system
Compute Blades:
One farm, little local storage
Photos from Chris Dagdigian
Broad IT Architecture:
Load Management Software (LSF/ SGE)
Amazon Cloud Virtual Architecture
Elastic Block Storage
(EBS)
EBS
EBS
Compute servers
Simple Storage Service (S3)
Load Management Software (LSF/ SGE)
Quick Comparison
Broad
• Ease of development
– Data is all in the same place
all the time
– All servers can access all data
uniformly
– LSF does lots of the work
• Very high throughput
• Easy to add more compute
or more storage, but costly
• But
– Heavy network load
– Response time secondary to
throughput
Amazon Cloud
• Can add more compute or
storage as needed
• Don’t pay for what you don’t
use
• Need to explicitly assign
analyses to specific servers
– And move data there
• Faster turnaround
– Local storage
• But
– Need to make sure you have
enough local storage for each
job
Why does system architecture matter?
Extract Illumina
Data to Standard
Format
Align reads
with BWA
or MAQ
Mark Duplicate Reads
Re-align reads
around known
indels
…
.
Merge all data for
each library
Mark Duplicate Reads per
library
…
Disk needed
Compute needed
Possible Solutions
• NFS
• Gluster
• Move EBS drives
• Use S3 for interchange
• Custom inter-node transfer
Moving the Alignment Pipeline to the
Cloud
Elastic Block Storage
(EBS)
EBS
EBS
Compute servers
Simple Storage Service (S3)
Move Fastq’s
from Broad
to S3
Find
allocated
server with
capacity
OR request &
initialize new
server
Move fastqs
to server
Run lane-level
pipeline
Write BAM
results back
to S3
Release
Server?
Ready to
aggregate?
Copy BAMs
from S3 to
server
Allocate existing
server or request
new one
Run aggregation
pipeline
Pipeline Manager
Move
BAMs back
to S3
Release
Servers
as needed
Challenges of porting to the cloud
• May require substantial re-architecture of
your application
• Getting the data there: network issues
• Security/ privacy issues
• Efficient utilization of cloud resources
• Predicting usage needs and costs
Network Capacity and Data Transfer
• Latest test:
– Transfer of 110 exome fastqs, 800GBytes zipped
– 15 hours to upload, using 2 cores (and 2 streams)
• Transfer times are very variable
• Pay for transfer in&out, and storage monthly
 A small center should not have difficult transferring
data cycle by cycle for a single machine
Broad
Amazon
S3
1Gb, S3FTP
Security!!
• Neither the Amazon cloud nor any other cloud
is currently approved for storing controlled-
access genomic data
• Okay for 1000 Genomes, not for TCGA
• Major limitation of cloud right now
• Not necessarily a technical issue
Job Times and Node Utilization for BWA Alignment of 4 lanes on 1 CC1 node
0
10
20
30
40
50
60
70
80
90
100
4:43:21PM
5:28:21PM
6:13:21PM
6:58:21PM
7:43:21PM
8:28:22PM
9:13:22PM
9:58:22PM
10:43:22PM
11:28:22PM
12:13:22AM
12:58:22AM
1:43:22AM
2:28:22AM
3:13:22AM
3:58:22AM
4:43:22AM
5:28:23AM
6:13:23AM
6:58:23AM
7:43:23AM
8:28:23AM
9:13:23AM
9:58:23AM
10:43:23AM
11:28:23AM
12:13:23PM
12:58:23PM
1:43:23PM
2:28:24PM
3:13:24PM
3:58:24PM
4:43:24PM
5:28:24PM
6:13:24PM
6:58:24PM
7:43:24PM
8:28:24PM
9:13:24PM
9:58:24PM
10:43:24PM
11:28:25PM
12:13:25AM
12:58:25AM
1:43:25AM
2:28:25AM
3:13:25AM
3:58:25AM
%user
%iowait
Costs??
• Best estimate:
– Cloud is 2-4X the cost of local compute for our
pipeline
• BUT apples to apples comparison is difficult
• Comparison is more favorable for smaller centers
• Potential big savings for big collaborations
• Cloud costs going down more rapidly than local costs
• Much cheaper if you can predict capacity for next 1-3
years.
Costs
• Efficient utilization of compute can be difficult
• Noisy neighbors affect utilization & efficiency
• Changing data sizes affect utilization rates and
resource constraints
The Gotchas
• No way to share data among multiple
compute servers at once.
– Need to move data if using different servers for
different steps.
• Network speed variability
• Noisy neighbors
– Need to use the largest machines always
• Security regulations
Conclusions
• Definitely a viable option for small centers using
standard software
• Potential to save costs for large collaborations
• Maybe not cost effective for spikes
• Moving to the cloud is non-trivial
• Large datasets pose challenges
• Security rules need to be resolved
• Costs are hard to predict/ difficult to compare
Acknowledgements
• Zach Leber
• Seva Kashin
• Thaniel Novod
• Frans Lawaetz
• John Hanks
• Matthew Trunnell
• Tim Fennell
• Kathleen Tibbetts
• Alex Wysoker
• Kiran Giramella
• Chris Dagdigian
• Vivien Bonazzi
Funding from NHGRI

More Related Content

What's hot

Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Amazon Web Services
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
Andrew Yongjoon Kong
 
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebula Project
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
Igor Sfiligoi
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Embracing clouds
Embracing cloudsEmbracing clouds
Embracing clouds
Andrew Yongjoon Kong
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
Igor Sfiligoi
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
Otto Mok
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Andrew Yongjoon Kong
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
Suning OpenStack Cloud and Heat
Suning OpenStack Cloud and HeatSuning OpenStack Cloud and Heat
Suning OpenStack Cloud and Heat
Qiming Teng
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
Philip Fisher-Ogden
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
Shashi Shekar Madappa
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
anynines GmbH
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
Monal Daxini
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Embracing clouds
Embracing cloudsEmbracing clouds
Embracing clouds
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Suning OpenStack Cloud and Heat
Suning OpenStack Cloud and HeatSuning OpenStack Cloud and Heat
Suning OpenStack Cloud and Heat
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 

Viewers also liked

Metro nome agbt-poster
Metro nome agbt-posterMetro nome agbt-poster
Metro nome agbt-poster
Toby Bloom
 
AGBT Precision Medicine 2016 Cohort Indentification
AGBT Precision Medicine 2016 Cohort IndentificationAGBT Precision Medicine 2016 Cohort Indentification
AGBT Precision Medicine 2016 Cohort Indentification
Toby Bloom
 
Informatics Infrastructure for Clinical Genomics
Informatics Infrastructure for Clinical GenomicsInformatics Infrastructure for Clinical Genomics
Informatics Infrastructure for Clinical Genomics
Toby Bloom
 
Bio it 2014-published
Bio it 2014-publishedBio it 2014-published
Bio it 2014-published
Toby Bloom
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
SlideShare
 

Viewers also liked (7)

Metro nome agbt-poster
Metro nome agbt-posterMetro nome agbt-poster
Metro nome agbt-poster
 
AGBT Precision Medicine 2016 Cohort Indentification
AGBT Precision Medicine 2016 Cohort IndentificationAGBT Precision Medicine 2016 Cohort Indentification
AGBT Precision Medicine 2016 Cohort Indentification
 
Informatics Infrastructure for Clinical Genomics
Informatics Infrastructure for Clinical GenomicsInformatics Infrastructure for Clinical Genomics
Informatics Infrastructure for Clinical Genomics
 
Bio it 2014-published
Bio it 2014-publishedBio it 2014-published
Bio it 2014-published
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011

Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Using Apache Cassandra and Apache Kafka to Scale Next Gen ApplicationsUsing Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Data Con LA
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
Igor Sfiligoi
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the Cloud
Eberhard Wolff
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
DataStax Academy
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
CloudLightning
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
Avere Systems
 
AWS Canberra WWPS Summit 2013 - AWS for Web Applications
AWS Canberra WWPS Summit 2013 - AWS for Web ApplicationsAWS Canberra WWPS Summit 2013 - AWS for Web Applications
AWS Canberra WWPS Summit 2013 - AWS for Web Applications
Amazon Web Services
 
Sql Start! 2020 - SQL Server Lift & Shift su Azure
Sql Start! 2020 - SQL Server Lift & Shift su AzureSql Start! 2020 - SQL Server Lift & Shift su Azure
Sql Start! 2020 - SQL Server Lift & Shift su Azure
Marco Obinu
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
Prakash Chockalingam
 
Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL — The Apache Kafka WayStrategies For Migrating From SQL to NoSQL — The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way
ScyllaDB
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
walk2talk srl
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
balmanme
 
Data Center Network Trends - Lin Nease
Data Center Network Trends - Lin NeaseData Center Network Trends - Lin Nease
Data Center Network Trends - Lin Nease
HPDutchWorld
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
Univa, an Altair Company
 

Similar to Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011 (20)

Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Using Apache Cassandra and Apache Kafka to Scale Next Gen ApplicationsUsing Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the Cloud
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
AWS Canberra WWPS Summit 2013 - AWS for Web Applications
AWS Canberra WWPS Summit 2013 - AWS for Web ApplicationsAWS Canberra WWPS Summit 2013 - AWS for Web Applications
AWS Canberra WWPS Summit 2013 - AWS for Web Applications
 
Sql Start! 2020 - SQL Server Lift & Shift su Azure
Sql Start! 2020 - SQL Server Lift & Shift su AzureSql Start! 2020 - SQL Server Lift & Shift su Azure
Sql Start! 2020 - SQL Server Lift & Shift su Azure
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL — The Apache Kafka WayStrategies For Migrating From SQL to NoSQL — The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Data Center Network Trends - Lin Nease
Data Center Network Trends - Lin NeaseData Center Network Trends - Lin Nease
Data Center Network Trends - Lin Nease
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
 

Recently uploaded

Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptxDoes Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
walterHu5
 
8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx
8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx
8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx
Holistified Wellness
 
OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1
KafrELShiekh University
 
Complementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLSComplementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLS
chiranthgowda16
 
Top Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in IndiaTop Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in India
SwisschemDerma
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
Donc Test
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
rishi2789
 
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
bkling
 
Chapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptxChapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptx
Earlene McNair
 
Journal Article Review on Rasamanikya
Journal Article Review on RasamanikyaJournal Article Review on Rasamanikya
Journal Article Review on Rasamanikya
Dr. Jyothirmai Paindla
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
rightmanforbloodline
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
shivalingatalekar1
 
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotesPromoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
Dr. Jyothirmai Paindla
 
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in IndiaTop 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Swastik Ayurveda
 
TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...
TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...
TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...
rightmanforbloodline
 
Top-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India ListTop-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India List
SwisschemDerma
 
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
NephroTube - Dr.Gawad
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdfCHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
rishi2789
 

Recently uploaded (20)

Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptxDoes Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
 
8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx
8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx
8 Surprising Reasons To Meditate 40 Minutes A Day That Can Change Your Life.pptx
 
OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1
 
Complementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLSComplementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLS
 
Top Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in IndiaTop Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in India
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
 
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
 
Chapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptxChapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptx
 
Journal Article Review on Rasamanikya
Journal Article Review on RasamanikyaJournal Article Review on Rasamanikya
Journal Article Review on Rasamanikya
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
 
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotesPromoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
 
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in IndiaTop 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in India
 
TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...
TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...
TEST BANK For Basic and Clinical Pharmacology, 14th Edition by Bertram G. Kat...
 
Top-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India ListTop-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India List
 
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
 
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdfCHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
 

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011

  • 1. Cloud Computing: Safe Haven from the Data Deluge? Toby Bloom, Ph.D.
  • 2. Clouds: the solution to all problems?
  • 3. Agenda • What is the “cloud”? • When to use it? • An example: moving our analysis pipeline to the cloud • What works; what doesn’t
  • 4. What is Cloud Computing? • Pay-as-you-go compute infrastructure – Compute servers by the hour – Storage services by the month – Network transfers by the byte • Wide range of other services offered by cloud providers • Other definitions: – Google cloud • Google apps, pay-as-you-go • “applications as a service” –
  • 5. Why clouds? • Small research centers – 1 or 2 Illuminas can overwhelm IT infrastructure • Spikes in load – The week before Marco, the compute queues get very long • Uneven load – If load goes up and down unpredictably, don’t want to buy resources to handle the peaks and leave them idle much of the time • Large collaborative projects – Avoid repeatedly transferring data between centers – make computational resources available in one place – easier to share all results quickly
  • 6. The advantage for large projects 1000G Pilot - Fastq lifecycle Generate Fastq Fastq to NCBI Replicate to EBI Download to Sanger Upload BAM to EBI Replicate to NCBI Mirror to 3+ analyis sites Goal on Cloud 10+ copies + backups Generate Fastq NCBI EBI All further processing on Cloud 2 files + replicas
  • 7. Our Experiment: Analysis on the Cloud • Implement our current Illumina production analysis pipeline (Picard) on the Amazon cloud • Compare performance & cost to local pipelines. • Tune architecture for the cloud – How to change the implementation to work best on the cloud – Identify general “rules” for cloud implementations • Test use on some real projects
  • 8. The Pipeline Extract Illumina Data to Standard Format Align reads with BWA or MAQ Mark Duplicate Reads Re-align reads around known indels Calibrate Quality Scores Collect Metrics about Libraries and Run Verify Sample Identity Summary Report Aggregation Pipeline Merge all data for each library Mark Duplicate Reads per library Collect Metrics per library Merge all libraries for a sample Collect Metrics about the Sample Downstream pipelines and analysts Run Level Pipeline Lane-Level Analysis Sample-Level Aggregation
  • 9. Current Status: • Pipeline Manager and Picard Alignment Pipeline are running on the Amazon cloud • Currently running 1000 Genomes Exomes through Picard on the cloud – As a high-volume test case – But also the actual pipeline for the Exome DCC – ~110 Exomes processed. • Still restructuring / optimizing • Cloud capabilities always changing
  • 10. Challenges of porting to the cloud • May require substantial re-architecture of your application • Getting the data there • Security/ privacy issues • Efficient utilization of cloud resources • Predicting usage needs and costs
  • 11. IT Architecture Differences Isilon Storage – Petabytes in one file system Compute Blades: One farm, little local storage Photos from Chris Dagdigian Broad IT Architecture: Load Management Software (LSF/ SGE)
  • 12. Amazon Cloud Virtual Architecture Elastic Block Storage (EBS) EBS EBS Compute servers Simple Storage Service (S3) Load Management Software (LSF/ SGE)
  • 13. Quick Comparison Broad • Ease of development – Data is all in the same place all the time – All servers can access all data uniformly – LSF does lots of the work • Very high throughput • Easy to add more compute or more storage, but costly • But – Heavy network load – Response time secondary to throughput Amazon Cloud • Can add more compute or storage as needed • Don’t pay for what you don’t use • Need to explicitly assign analyses to specific servers – And move data there • Faster turnaround – Local storage • But – Need to make sure you have enough local storage for each job
  • 14. Why does system architecture matter? Extract Illumina Data to Standard Format Align reads with BWA or MAQ Mark Duplicate Reads Re-align reads around known indels … . Merge all data for each library Mark Duplicate Reads per library … Disk needed Compute needed
  • 15. Possible Solutions • NFS • Gluster • Move EBS drives • Use S3 for interchange • Custom inter-node transfer
  • 16. Moving the Alignment Pipeline to the Cloud Elastic Block Storage (EBS) EBS EBS Compute servers Simple Storage Service (S3) Move Fastq’s from Broad to S3 Find allocated server with capacity OR request & initialize new server Move fastqs to server Run lane-level pipeline Write BAM results back to S3 Release Server? Ready to aggregate? Copy BAMs from S3 to server Allocate existing server or request new one Run aggregation pipeline Pipeline Manager Move BAMs back to S3 Release Servers as needed
  • 17. Challenges of porting to the cloud • May require substantial re-architecture of your application • Getting the data there: network issues • Security/ privacy issues • Efficient utilization of cloud resources • Predicting usage needs and costs
  • 18. Network Capacity and Data Transfer • Latest test: – Transfer of 110 exome fastqs, 800GBytes zipped – 15 hours to upload, using 2 cores (and 2 streams) • Transfer times are very variable • Pay for transfer in&out, and storage monthly  A small center should not have difficult transferring data cycle by cycle for a single machine Broad Amazon S3 1Gb, S3FTP
  • 19. Security!! • Neither the Amazon cloud nor any other cloud is currently approved for storing controlled- access genomic data • Okay for 1000 Genomes, not for TCGA • Major limitation of cloud right now • Not necessarily a technical issue
  • 20. Job Times and Node Utilization for BWA Alignment of 4 lanes on 1 CC1 node 0 10 20 30 40 50 60 70 80 90 100 4:43:21PM 5:28:21PM 6:13:21PM 6:58:21PM 7:43:21PM 8:28:22PM 9:13:22PM 9:58:22PM 10:43:22PM 11:28:22PM 12:13:22AM 12:58:22AM 1:43:22AM 2:28:22AM 3:13:22AM 3:58:22AM 4:43:22AM 5:28:23AM 6:13:23AM 6:58:23AM 7:43:23AM 8:28:23AM 9:13:23AM 9:58:23AM 10:43:23AM 11:28:23AM 12:13:23PM 12:58:23PM 1:43:23PM 2:28:24PM 3:13:24PM 3:58:24PM 4:43:24PM 5:28:24PM 6:13:24PM 6:58:24PM 7:43:24PM 8:28:24PM 9:13:24PM 9:58:24PM 10:43:24PM 11:28:25PM 12:13:25AM 12:58:25AM 1:43:25AM 2:28:25AM 3:13:25AM 3:58:25AM %user %iowait
  • 21. Costs?? • Best estimate: – Cloud is 2-4X the cost of local compute for our pipeline • BUT apples to apples comparison is difficult • Comparison is more favorable for smaller centers • Potential big savings for big collaborations • Cloud costs going down more rapidly than local costs • Much cheaper if you can predict capacity for next 1-3 years.
  • 22. Costs • Efficient utilization of compute can be difficult • Noisy neighbors affect utilization & efficiency • Changing data sizes affect utilization rates and resource constraints
  • 23. The Gotchas • No way to share data among multiple compute servers at once. – Need to move data if using different servers for different steps. • Network speed variability • Noisy neighbors – Need to use the largest machines always • Security regulations
  • 24. Conclusions • Definitely a viable option for small centers using standard software • Potential to save costs for large collaborations • Maybe not cost effective for spikes • Moving to the cloud is non-trivial • Large datasets pose challenges • Security rules need to be resolved • Costs are hard to predict/ difficult to compare
  • 25. Acknowledgements • Zach Leber • Seva Kashin • Thaniel Novod • Frans Lawaetz • John Hanks • Matthew Trunnell • Tim Fennell • Kathleen Tibbetts • Alex Wysoker • Kiran Giramella • Chris Dagdigian • Vivien Bonazzi Funding from NHGRI