SlideShare a Scribd company logo
1 of 29
Download to read offline
Enis Afgan and Mo Heydarian
Nov 17, 2016.
Galaxy CloudMan
performance on AWS
- System architecture centered around NFS
- NFS was deemed a bottleneck
- Workarounds
- Launch larger instances to compensate
- Create multiple clones of a cluster and split the workload
The past 5+ years
Master
instance
Worker 1 Worker 2 Worker n
EBS
www
At launch: Instance type → Instance size → Disk type [→ IOPS]
At runtime: Worker instance type → Worker instance size → Number of
workers
Choices need to be made before launch
launch.usegalaxy.org
Instances
- Type: m(3|4), c(3|4), r3, x1
- Size: (x|2x|4x|8x|10x|16x|32x)?large
- Network throughput: Low, Medium, High, 10Gigabit, 20Gigabit
- Enhanced networking (if supported)
- EBS Optimized (if supported)
- Instance EBS throughput
Disk
- AWS EFS or Volume type: gp2, io1, st1
- IOPS: 100 to 20,000, w/ burst credits, counted in block sizes of 16 KiB (1MiB
for st1)
- Throughput: 20 to 500 MB/s, w/ burst credits, requires suitable instance type
Infrastructure options & AWS variables
Goal
Empirically deduce some conclusions about the performance issues.
How?
- Collect base infrastructure performance values
- Run a number of tests simulating real-world scenarios
- Implement conclusive and general findings
- Update documentation for the scenario-based findings
- See if analyses have an ‘execution signature’
Throughput
- The speed at which data is transferred out or onto the storage
- Measured in MB/s
IOPS
- Input/Output Operations Per Second
- How often a device can perform IO tasks
Latency
- How long it takes for a device to start an IO task
- Measures in fractions of a second
- (Not captured in these slides)
Terminology explanation
Test plan: scenarios
Test baseline resource performance
- Run synthetic benchmarks with varying infrastructure options
Test job submission rate
- Run ~20 batch workflows each using many tools
Test data throughput
- Run a long workflow with real-world data
Synthetic tests
Run a number of generic benchmark tools on a broader set of resources to get
resource baseline performance.
Benchmarking tools
- fio: Flexible I/O tester
- iPerf3: measure network throughput
Test variables
- Instance types, instance sizes, volume types, network speed, (IO block size)
Base infrastructure disk performance (fio)
Instance/volume
type
gp2 Io1 st1 | transient
c4.large ~60 MB/s, ~480 IOPS *
c4.xlarge ~90 MB/s, ~720 IOPS *
c4.2xlarge ~120 MB/s, ~950 IOPS *
c4.4xlarge ~160 MB/s, ~1,300 IOPS*
~320 MB/s, ~2,600 IOPS
(16,000 PIOPS)
~70 MB/s, 550 IOPS
c4.8xlarge ~160 MB/s, ~1,300 IOPS
c3.xlarge ~120 MB/s, ~950 IOPS
c3.4xlarge
(EBS optimized)
~160 MB/s, ~1,300 IOPS
~370 MB/s, ~2,900 IOPS
(transient)
x1.32xlarge ~160 MB/s, ~1,300 IOPS
m1.xxlarge
on Jetstream
~900 MB/s, ~7,000 IOPS ~780 MB/s, ~6,000 IOPS
* Source Ilya Sytchev
Base infrastructure network performance (iPerf3)
Master instance type
(worker c4.4xlarge)
Network speed
c4.large ~625 Mbps (80 MB/s) *
c4.xlarge ~1.25 Gbps (160 MB/s) *
c4.2xlarge ~2.5 Gbps (320 MB/s) *
c4.4xlarge ~5 Gbps (640 MB/s) *
c4.8xlarge ~5 Gbps (640 MB/s) (Gigabit driver not configured)
* Source Ilya Sytchev
Tiered network performance (iPerf3)
Tiered network
(with master & worker instance type)
Network speed
Moderate (c3.xlarge) ~1.15 Gbps (140 MB/s)
High (c4.xlarge) ~5 Gbps (640 MB/s)
10 Gigabit (c4.8xlarge) ~5 Gbps (640 MB/s) - driver not configured
- AWS defines a set of network speed steps for instance types but no values
Sample analyses
- RNA-seq (mouse)
- ChIP-seq (human)
- Training workshop (simulate ~80 participants; use downsampled human data,
~200MB across 4 fastq files)
Test data
- Human: ChIP, GSE82295, 119 compressed fastq files, total ~250GB
- Mouse: RNA, 6 samples, 5 replicas, large input files (40M paired end reads),
GSE72018, 45 fastq files, total ~1.2TB
All test data and workflows are available on S3
- RNA-seq: https://s3.amazonaws.com/cloudman-test/tests/rna/index.html
- ChIP-seq: https://s3.amazonaws.com/cloudman-test/tests/chIP/index.html
- Workshop: https://s3.amazonaws.com/cloudman-test/tests/downsampled-workshop-data/index.html
Tests were done on 16.11 release of Galaxy CloudMan.
Test plan: use cases
- Many-node cluster
Master: c4.4xlarge ($0.838/hr)
16x r3.2xlarge (16x: 8 cores, 61GB RAM (~8GB/core))
16x $0.665/hr = $10.64/hr
- Large-node cluster
Master: c4.4xlarge ($0.838/hr)
2x m4.16xlarge (2x: 64 cores, 256GB RAM (4GB/core))
2x $3.83/hr = $7.66/hr
- Single large instance
1x x1.32xlarge (128 cores, 1952GB RAM (~15GB/core))
$13.338/hr
Disk: io1; 6TB; 20,000IOPS, 320MB/s ($2.80/hr)
Test plan: infrastructure
Test matrix: total runtime
RNA-Seq ChIP-Seq Workshop
Many-node
cluster
Upload: 3hrs 40min
Runtime (FASTQ trimmer):
Aborted after ~30hrs runtime
Upload: 1hrs 20min
Runtime: 4hrs 10min
Runtime: 26 min
Large-node
cluster
Upload: 3hrs 40min
Runtime (FASTQ trimmer):
Aborted after ~17hrs runtime
Runtime (Trimmomatic):
25hrs 40min
Upload: 1hrs 20min
Runtime: 4hrs
Runtime: 17 min
Single large
instance
Upload: 1hrs 40min
Runtime (Trimmomatic):
31hrs
Upload: ~7 min
Runtime: 3hrs 30min
Runtime: 16 min
Upload time across instance and volume types
Uploading 119 compressed fastq files totalling ~250GB from S3.
Instance size Upload time CVMFS priming time
(mm10, 4.5 GB)
Volume characteristics
c4.4xlarge 40 min io1, 2TB, 16,000 IOPS
m4.10xlarge 45 min gp2, 2TB (6,000 IOPS)
x1.16xlarge 15 min gp2, 2TB (6,000 IOPS)
c4.xlarge est. 4 hrs (after 2hrs,
120GB transferred)
gp2, 2TB (6,000 IOPS)
m4.2xlarge 15 min
x1.32xlarge 10 min
ChIP-Seq execution signature (large-node cluster)
Master:
network
Master:
CPU & memory
Utilization(%)
Speed(GB/min)
Time
Time
ChIP-Seq execution signature
Master:
network
Master:
CPU & memory
Many-node clusterLarge-node cluster
Real-world workshop performance
- ~60 participants running ChIP-Seq workflow
- The figures are for the master instance (c4.4xlarge) with 2TB gp2 volume
Speed(GB/min)
Utilization(%)Count
Count
Real-world workshop performance (cont.)
- Load on workers (m4.16xlarge)
w1 w2 w3
CPU & memory utilization CPU & memory utilization CPU & memory utilization
Network speed Network speed Network speed
Puzzling 130,000B volume throughput?
gp2 | io1 volume type
Instance (x1.32xl) transient disk
CVMFS experiences (live workshops)
Test matrix: approximate cost
- Costs are approximate, based on the cluster spec listed earlier and the time
taken
RNA-Seq ChIP-Seq Workshop
Many-node
cluster
N/A $90 $15
Large-node
cluster
$360 $70 $12
Single large
instance
$530 $65 $16
Comparison with other readily available resources
ChIP-Seq Workshop
Fastest cloud option
Upload: ~7min
Runtime: 3hrs 30min
16min
Galaxy Main
Upload: between 8 and 14hrs
Runtime: no disk space
N/A
Jetstream
Upload: 25min
Runtime: Ran out of disk space
Using 1TB volume: 25hrs 20min
3hrs
Issues
- Trouble creating /mnt/transient_nfs/slurm/slurm.conf: [Errno 11] Resource
temporarily unavailable: '/mnt/transient_nfs/slurm/slurm.conf’
- Volume throughput stuck at 130,000B/min?!
Outcomes and observations
- Enabled multi-core jobs (this indirectly reduces total node memory
requirements)
- Enabled Enhanced Networking on the base image
- Promote use of EBS-Optimized instances
- Set CVMFS cache to ~30GB
- Master instance hardware components are not really a bottleneck but need
larger instance to ensure network throughput
- AWS EBS throughput is a bottleneck?
- Need to use larger instance type (see above)
- The NFS is a problem for many-node clusters
- Use a smaller number of larger instances, even a single instance
- Enable EBS optimized flag, or use instance type that uses it by default (c4,
m4)
- No need to use provisioned IOPS volumes
- If using provisioned IOPS volumes, match it with appropriate instance EBS throughput
- Eg: PIOPS vol max throughput is 320 MB/s; c4.8xlarge has 500MB/s EBS throughput,
while c4.4xlarge offers 250 MB/s
- Suggested configurations
- c4.4xlarge master with m4.10xlarge workers (use 1 per 10 workshop participants or jobs)
- For temporary clusters, use c3/m3/x1 master and transient storage
Recommendations
Future directions
- Update the docs (after the new wiki is live)
- Recommended configurations
- How to use Jetstream volumes
- Include results from this into the build/runtime environment
- Update instance types on the launcher
- Tune NFS
- Figure out how to measure utilized disk throughput
- Explore AWS EFS
A big thank you to Mo
for creating all the workflows, finding the data and
watching over the running jobs
ChIP-Seq execution signature (many-node cluster)
Master: network
Master: CPU & memory
Speed(GB/min)
Utilization(%)

More Related Content

What's hot

Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Jonathan Katz
 
はじめてのGlusterFS
はじめてのGlusterFSはじめてのGlusterFS
はじめてのGlusterFSTakahiro Inoue
 
Node Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionYunong Xiao
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)
Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)
Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)Ontico
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFZoltan Arnold Nagy
 
Python VS GO
Python VS GOPython VS GO
Python VS GOOfir Nir
 
Object Storage with Gluster
Object Storage with GlusterObject Storage with Gluster
Object Storage with GlusterGluster.org
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
 
Aerospike Go Language Client
Aerospike Go Language ClientAerospike Go Language Client
Aerospike Go Language ClientSayyaparaju Sunil
 
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputerStig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputerDanny Abukalam
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
 
CoreOS + Kubernetes @ All Things Open 2015
CoreOS + Kubernetes @ All Things Open 2015CoreOS + Kubernetes @ All Things Open 2015
CoreOS + Kubernetes @ All Things Open 2015Brandon Philips
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNCeph Community
 
CloudStack Automated Integration Testing with Marvin
CloudStack Automated Integration Testing with Marvin CloudStack Automated Integration Testing with Marvin
CloudStack Automated Integration Testing with Marvin NetApp
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Declare your infrastructure: InfraKit, LinuxKit and Moby
Declare your infrastructure: InfraKit, LinuxKit and MobyDeclare your infrastructure: InfraKit, LinuxKit and Moby
Declare your infrastructure: InfraKit, LinuxKit and MobyMoby Project
 

What's hot (20)

Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!
 
はじめてのGlusterFS
はじめてのGlusterFSはじめてのGlusterFS
はじめてのGlusterFS
 
Node Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In Production
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)
Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)
Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
Object Storage with Gluster
Object Storage with GlusterObject Storage with Gluster
Object Storage with Gluster
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5
 
Aerospike Go Language Client
Aerospike Go Language ClientAerospike Go Language Client
Aerospike Go Language Client
 
Stig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputerStig Telfer - OpenStack and the Software-Defined SuperComputer
Stig Telfer - OpenStack and the Software-Defined SuperComputer
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
 
CoreOS + Kubernetes @ All Things Open 2015
CoreOS + Kubernetes @ All Things Open 2015CoreOS + Kubernetes @ All Things Open 2015
CoreOS + Kubernetes @ All Things Open 2015
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
CloudStack Automated Integration Testing with Marvin
CloudStack Automated Integration Testing with Marvin CloudStack Automated Integration Testing with Marvin
CloudStack Automated Integration Testing with Marvin
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Declare your infrastructure: InfraKit, LinuxKit and Moby
Declare your infrastructure: InfraKit, LinuxKit and MobyDeclare your infrastructure: InfraKit, LinuxKit and Moby
Declare your infrastructure: InfraKit, LinuxKit and Moby
 

Similar to Galaxy CloudMan performance on AWS

Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
Perf Vsphere Storage Protocols
Perf Vsphere Storage ProtocolsPerf Vsphere Storage Protocols
Perf Vsphere Storage ProtocolsYanghua Zhang
 
(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for PerformanceAmazon Web Services
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalTommy Lee
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Kyle Hailey
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
Maximizing Amazon EC2 and Amazon EBS performance
Maximizing Amazon EC2 and Amazon EBS performanceMaximizing Amazon EC2 and Amazon EBS performance
Maximizing Amazon EC2 and Amazon EBS performanceAmazon Web Services
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
An introduction and evaluations of a wide area distributed storage system
An introduction and evaluations of  a wide area distributed storage systemAn introduction and evaluations of  a wide area distributed storage system
An introduction and evaluations of a wide area distributed storage systemHiroki Kashiwazaki
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insightsOmid Vahdaty
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceAmazon Web Services
 

Similar to Galaxy CloudMan performance on AWS (20)

Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Perf Vsphere Storage Protocols
Perf Vsphere Storage ProtocolsPerf Vsphere Storage Protocols
Perf Vsphere Storage Protocols
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Maximizing Amazon EC2 and Amazon EBS performance
Maximizing Amazon EC2 and Amazon EBS performanceMaximizing Amazon EC2 and Amazon EBS performance
Maximizing Amazon EC2 and Amazon EBS performance
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
An introduction and evaluations of a wide area distributed storage system
An introduction and evaluations of  a wide area distributed storage systemAn introduction and evaluations of  a wide area distributed storage system
An introduction and evaluations of a wide area distributed storage system
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insights
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk Performance
 

More from Enis Afgan

Federated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the FrontierFederated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the FrontierEnis Afgan
 
From laptop to super-computer: standardizing installation and management of G...
From laptop to super-computer: standardizing installation and management of G...From laptop to super-computer: standardizing installation and management of G...
From laptop to super-computer: standardizing installation and management of G...Enis Afgan
 
Horizontal scaling with Galaxy
Horizontal scaling with GalaxyHorizontal scaling with Galaxy
Horizontal scaling with GalaxyEnis Afgan
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEnis Afgan
 
2016 07 - CloudBridge Python library (XSEDE16)
2016 07 - CloudBridge Python library (XSEDE16)2016 07 - CloudBridge Python library (XSEDE16)
2016 07 - CloudBridge Python library (XSEDE16)Enis Afgan
 
2017.07.19 Galaxy & Jetstream cloud
2017.07.19 Galaxy & Jetstream cloud2017.07.19 Galaxy & Jetstream cloud
2017.07.19 Galaxy & Jetstream cloudEnis Afgan
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformaticsEnis Afgan
 
Adding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessAdding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessEnis Afgan
 
Enabling Cloud Bursting for Life Sciences within Galaxy
Enabling Cloud Bursting for Life Sciences within GalaxyEnabling Cloud Bursting for Life Sciences within Galaxy
Enabling Cloud Bursting for Life Sciences within GalaxyEnis Afgan
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
IRB Galaxy CloudMan radionica
IRB Galaxy CloudMan radionicaIRB Galaxy CloudMan radionica
IRB Galaxy CloudMan radionicaEnis Afgan
 
GCC 2014 scriptable workshop
GCC 2014 scriptable workshopGCC 2014 scriptable workshop
GCC 2014 scriptable workshopEnis Afgan
 
Data analysis with Galaxy on the Cloud
Data analysis with Galaxy on the CloudData analysis with Galaxy on the Cloud
Data analysis with Galaxy on the CloudEnis Afgan
 
Galaxy workshop
Galaxy workshopGalaxy workshop
Galaxy workshopEnis Afgan
 
CloudMan workshop
CloudMan workshopCloudMan workshop
CloudMan workshopEnis Afgan
 

More from Enis Afgan (16)

Federated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the FrontierFederated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the Frontier
 
From laptop to super-computer: standardizing installation and management of G...
From laptop to super-computer: standardizing installation and management of G...From laptop to super-computer: standardizing installation and management of G...
From laptop to super-computer: standardizing installation and management of G...
 
Horizontal scaling with Galaxy
Horizontal scaling with GalaxyHorizontal scaling with Galaxy
Horizontal scaling with Galaxy
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
 
2016 07 - CloudBridge Python library (XSEDE16)
2016 07 - CloudBridge Python library (XSEDE16)2016 07 - CloudBridge Python library (XSEDE16)
2016 07 - CloudBridge Python library (XSEDE16)
 
2017.07.19 Galaxy & Jetstream cloud
2017.07.19 Galaxy & Jetstream cloud2017.07.19 Galaxy & Jetstream cloud
2017.07.19 Galaxy & Jetstream cloud
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 
Adding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessAdding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation Process
 
Enabling Cloud Bursting for Life Sciences within Galaxy
Enabling Cloud Bursting for Life Sciences within GalaxyEnabling Cloud Bursting for Life Sciences within Galaxy
Enabling Cloud Bursting for Life Sciences within Galaxy
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
IRB Galaxy CloudMan radionica
IRB Galaxy CloudMan radionicaIRB Galaxy CloudMan radionica
IRB Galaxy CloudMan radionica
 
GCC 2014 scriptable workshop
GCC 2014 scriptable workshopGCC 2014 scriptable workshop
GCC 2014 scriptable workshop
 
Data analysis with Galaxy on the Cloud
Data analysis with Galaxy on the CloudData analysis with Galaxy on the Cloud
Data analysis with Galaxy on the Cloud
 
Galaxy workshop
Galaxy workshopGalaxy workshop
Galaxy workshop
 
CloudMan workshop
CloudMan workshopCloudMan workshop
CloudMan workshop
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Galaxy CloudMan performance on AWS

  • 1. Enis Afgan and Mo Heydarian Nov 17, 2016. Galaxy CloudMan performance on AWS
  • 2. - System architecture centered around NFS - NFS was deemed a bottleneck - Workarounds - Launch larger instances to compensate - Create multiple clones of a cluster and split the workload The past 5+ years Master instance Worker 1 Worker 2 Worker n EBS www
  • 3. At launch: Instance type → Instance size → Disk type [→ IOPS] At runtime: Worker instance type → Worker instance size → Number of workers Choices need to be made before launch launch.usegalaxy.org
  • 4. Instances - Type: m(3|4), c(3|4), r3, x1 - Size: (x|2x|4x|8x|10x|16x|32x)?large - Network throughput: Low, Medium, High, 10Gigabit, 20Gigabit - Enhanced networking (if supported) - EBS Optimized (if supported) - Instance EBS throughput Disk - AWS EFS or Volume type: gp2, io1, st1 - IOPS: 100 to 20,000, w/ burst credits, counted in block sizes of 16 KiB (1MiB for st1) - Throughput: 20 to 500 MB/s, w/ burst credits, requires suitable instance type Infrastructure options & AWS variables
  • 5. Goal Empirically deduce some conclusions about the performance issues. How? - Collect base infrastructure performance values - Run a number of tests simulating real-world scenarios - Implement conclusive and general findings - Update documentation for the scenario-based findings - See if analyses have an ‘execution signature’
  • 6. Throughput - The speed at which data is transferred out or onto the storage - Measured in MB/s IOPS - Input/Output Operations Per Second - How often a device can perform IO tasks Latency - How long it takes for a device to start an IO task - Measures in fractions of a second - (Not captured in these slides) Terminology explanation
  • 7. Test plan: scenarios Test baseline resource performance - Run synthetic benchmarks with varying infrastructure options Test job submission rate - Run ~20 batch workflows each using many tools Test data throughput - Run a long workflow with real-world data
  • 8. Synthetic tests Run a number of generic benchmark tools on a broader set of resources to get resource baseline performance. Benchmarking tools - fio: Flexible I/O tester - iPerf3: measure network throughput Test variables - Instance types, instance sizes, volume types, network speed, (IO block size)
  • 9. Base infrastructure disk performance (fio) Instance/volume type gp2 Io1 st1 | transient c4.large ~60 MB/s, ~480 IOPS * c4.xlarge ~90 MB/s, ~720 IOPS * c4.2xlarge ~120 MB/s, ~950 IOPS * c4.4xlarge ~160 MB/s, ~1,300 IOPS* ~320 MB/s, ~2,600 IOPS (16,000 PIOPS) ~70 MB/s, 550 IOPS c4.8xlarge ~160 MB/s, ~1,300 IOPS c3.xlarge ~120 MB/s, ~950 IOPS c3.4xlarge (EBS optimized) ~160 MB/s, ~1,300 IOPS ~370 MB/s, ~2,900 IOPS (transient) x1.32xlarge ~160 MB/s, ~1,300 IOPS m1.xxlarge on Jetstream ~900 MB/s, ~7,000 IOPS ~780 MB/s, ~6,000 IOPS * Source Ilya Sytchev
  • 10. Base infrastructure network performance (iPerf3) Master instance type (worker c4.4xlarge) Network speed c4.large ~625 Mbps (80 MB/s) * c4.xlarge ~1.25 Gbps (160 MB/s) * c4.2xlarge ~2.5 Gbps (320 MB/s) * c4.4xlarge ~5 Gbps (640 MB/s) * c4.8xlarge ~5 Gbps (640 MB/s) (Gigabit driver not configured) * Source Ilya Sytchev
  • 11. Tiered network performance (iPerf3) Tiered network (with master & worker instance type) Network speed Moderate (c3.xlarge) ~1.15 Gbps (140 MB/s) High (c4.xlarge) ~5 Gbps (640 MB/s) 10 Gigabit (c4.8xlarge) ~5 Gbps (640 MB/s) - driver not configured - AWS defines a set of network speed steps for instance types but no values
  • 12. Sample analyses - RNA-seq (mouse) - ChIP-seq (human) - Training workshop (simulate ~80 participants; use downsampled human data, ~200MB across 4 fastq files) Test data - Human: ChIP, GSE82295, 119 compressed fastq files, total ~250GB - Mouse: RNA, 6 samples, 5 replicas, large input files (40M paired end reads), GSE72018, 45 fastq files, total ~1.2TB All test data and workflows are available on S3 - RNA-seq: https://s3.amazonaws.com/cloudman-test/tests/rna/index.html - ChIP-seq: https://s3.amazonaws.com/cloudman-test/tests/chIP/index.html - Workshop: https://s3.amazonaws.com/cloudman-test/tests/downsampled-workshop-data/index.html Tests were done on 16.11 release of Galaxy CloudMan. Test plan: use cases
  • 13. - Many-node cluster Master: c4.4xlarge ($0.838/hr) 16x r3.2xlarge (16x: 8 cores, 61GB RAM (~8GB/core)) 16x $0.665/hr = $10.64/hr - Large-node cluster Master: c4.4xlarge ($0.838/hr) 2x m4.16xlarge (2x: 64 cores, 256GB RAM (4GB/core)) 2x $3.83/hr = $7.66/hr - Single large instance 1x x1.32xlarge (128 cores, 1952GB RAM (~15GB/core)) $13.338/hr Disk: io1; 6TB; 20,000IOPS, 320MB/s ($2.80/hr) Test plan: infrastructure
  • 14. Test matrix: total runtime RNA-Seq ChIP-Seq Workshop Many-node cluster Upload: 3hrs 40min Runtime (FASTQ trimmer): Aborted after ~30hrs runtime Upload: 1hrs 20min Runtime: 4hrs 10min Runtime: 26 min Large-node cluster Upload: 3hrs 40min Runtime (FASTQ trimmer): Aborted after ~17hrs runtime Runtime (Trimmomatic): 25hrs 40min Upload: 1hrs 20min Runtime: 4hrs Runtime: 17 min Single large instance Upload: 1hrs 40min Runtime (Trimmomatic): 31hrs Upload: ~7 min Runtime: 3hrs 30min Runtime: 16 min
  • 15. Upload time across instance and volume types Uploading 119 compressed fastq files totalling ~250GB from S3. Instance size Upload time CVMFS priming time (mm10, 4.5 GB) Volume characteristics c4.4xlarge 40 min io1, 2TB, 16,000 IOPS m4.10xlarge 45 min gp2, 2TB (6,000 IOPS) x1.16xlarge 15 min gp2, 2TB (6,000 IOPS) c4.xlarge est. 4 hrs (after 2hrs, 120GB transferred) gp2, 2TB (6,000 IOPS) m4.2xlarge 15 min x1.32xlarge 10 min
  • 16. ChIP-Seq execution signature (large-node cluster) Master: network Master: CPU & memory Utilization(%) Speed(GB/min) Time Time
  • 17. ChIP-Seq execution signature Master: network Master: CPU & memory Many-node clusterLarge-node cluster
  • 18. Real-world workshop performance - ~60 participants running ChIP-Seq workflow - The figures are for the master instance (c4.4xlarge) with 2TB gp2 volume Speed(GB/min) Utilization(%)Count Count
  • 19. Real-world workshop performance (cont.) - Load on workers (m4.16xlarge) w1 w2 w3 CPU & memory utilization CPU & memory utilization CPU & memory utilization Network speed Network speed Network speed
  • 20. Puzzling 130,000B volume throughput? gp2 | io1 volume type Instance (x1.32xl) transient disk
  • 22. Test matrix: approximate cost - Costs are approximate, based on the cluster spec listed earlier and the time taken RNA-Seq ChIP-Seq Workshop Many-node cluster N/A $90 $15 Large-node cluster $360 $70 $12 Single large instance $530 $65 $16
  • 23. Comparison with other readily available resources ChIP-Seq Workshop Fastest cloud option Upload: ~7min Runtime: 3hrs 30min 16min Galaxy Main Upload: between 8 and 14hrs Runtime: no disk space N/A Jetstream Upload: 25min Runtime: Ran out of disk space Using 1TB volume: 25hrs 20min 3hrs
  • 24. Issues - Trouble creating /mnt/transient_nfs/slurm/slurm.conf: [Errno 11] Resource temporarily unavailable: '/mnt/transient_nfs/slurm/slurm.conf’ - Volume throughput stuck at 130,000B/min?!
  • 25. Outcomes and observations - Enabled multi-core jobs (this indirectly reduces total node memory requirements) - Enabled Enhanced Networking on the base image - Promote use of EBS-Optimized instances - Set CVMFS cache to ~30GB - Master instance hardware components are not really a bottleneck but need larger instance to ensure network throughput - AWS EBS throughput is a bottleneck? - Need to use larger instance type (see above) - The NFS is a problem for many-node clusters
  • 26. - Use a smaller number of larger instances, even a single instance - Enable EBS optimized flag, or use instance type that uses it by default (c4, m4) - No need to use provisioned IOPS volumes - If using provisioned IOPS volumes, match it with appropriate instance EBS throughput - Eg: PIOPS vol max throughput is 320 MB/s; c4.8xlarge has 500MB/s EBS throughput, while c4.4xlarge offers 250 MB/s - Suggested configurations - c4.4xlarge master with m4.10xlarge workers (use 1 per 10 workshop participants or jobs) - For temporary clusters, use c3/m3/x1 master and transient storage Recommendations
  • 27. Future directions - Update the docs (after the new wiki is live) - Recommended configurations - How to use Jetstream volumes - Include results from this into the build/runtime environment - Update instance types on the launcher - Tune NFS - Figure out how to measure utilized disk throughput - Explore AWS EFS
  • 28. A big thank you to Mo for creating all the workflows, finding the data and watching over the running jobs
  • 29. ChIP-Seq execution signature (many-node cluster) Master: network Master: CPU & memory Speed(GB/min) Utilization(%)