Galaxy CloudMan performance on AWS

Enis Afgan and Mo Heydarian
Nov 17, 2016.
Galaxy CloudMan
performance on AWS

- System architecture centered around NFS
- NFS was deemed a bottleneck
- Workarounds
- Launch larger instances to compensate
- Create multiple clones of a cluster and split the workload
The past 5+ years
Master
instance
Worker 1 Worker 2 Worker n
EBS
www

At launch: Instance type → Instance size → Disk type [→ IOPS]
At runtime: Worker instance type → Worker instance size → Number of
workers
Choices need to be made before launch
launch.usegalaxy.org

Instances
- Type: m(3|4), c(3|4), r3, x1
- Size: (x|2x|4x|8x|10x|16x|32x)?large
- Network throughput: Low, Medium, High, 10Gigabit, 20Gigabit
- Enhanced networking (if supported)
- EBS Optimized (if supported)
- Instance EBS throughput
Disk
- AWS EFS or Volume type: gp2, io1, st1
- IOPS: 100 to 20,000, w/ burst credits, counted in block sizes of 16 KiB (1MiB
for st1)
- Throughput: 20 to 500 MB/s, w/ burst credits, requires suitable instance type
Infrastructure options & AWS variables

Goal
Empirically deduce some conclusions about the performance issues.
How?
- Collect base infrastructure performance values
- Run a number of tests simulating real-world scenarios
- Implement conclusive and general findings
- Update documentation for the scenario-based findings
- See if analyses have an ‘execution signature’

Throughput
- The speed at which data is transferred out or onto the storage
- Measured in MB/s
IOPS
- Input/Output Operations Per Second
- How often a device can perform IO tasks
Latency
- How long it takes for a device to start an IO task
- Measures in fractions of a second
- (Not captured in these slides)
Terminology explanation

Test plan: scenarios
Test baseline resource performance
- Run synthetic benchmarks with varying infrastructure options
Test job submission rate
- Run ~20 batch workflows each using many tools
Test data throughput
- Run a long workflow with real-world data

Synthetic tests
Run a number of generic benchmark tools on a broader set of resources to get
resource baseline performance.
Benchmarking tools
- fio: Flexible I/O tester
- iPerf3: measure network throughput
Test variables
- Instance types, instance sizes, volume types, network speed, (IO block size)

Base infrastructure disk performance (fio)
Instance/volume
type
gp2 Io1 st1 | transient
c4.large ~60 MB/s, ~480 IOPS *
c4.xlarge ~90 MB/s, ~720 IOPS *
c4.2xlarge ~120 MB/s, ~950 IOPS *
c4.4xlarge ~160 MB/s, ~1,300 IOPS*
~320 MB/s, ~2,600 IOPS
(16,000 PIOPS)
~70 MB/s, 550 IOPS
c4.8xlarge ~160 MB/s, ~1,300 IOPS
c3.xlarge ~120 MB/s, ~950 IOPS
c3.4xlarge
(EBS optimized)
~160 MB/s, ~1,300 IOPS
~370 MB/s, ~2,900 IOPS
(transient)
x1.32xlarge ~160 MB/s, ~1,300 IOPS
m1.xxlarge
on Jetstream
~900 MB/s, ~7,000 IOPS ~780 MB/s, ~6,000 IOPS
* Source Ilya Sytchev

Base infrastructure network performance (iPerf3)
Master instance type
(worker c4.4xlarge)
Network speed
c4.large ~625 Mbps (80 MB/s) *
c4.xlarge ~1.25 Gbps (160 MB/s) *
c4.2xlarge ~2.5 Gbps (320 MB/s) *
c4.4xlarge ~5 Gbps (640 MB/s) *
c4.8xlarge ~5 Gbps (640 MB/s) (Gigabit driver not configured)
* Source Ilya Sytchev

Tiered network performance (iPerf3)
Tiered network
(with master & worker instance type)
Network speed
Moderate (c3.xlarge) ~1.15 Gbps (140 MB/s)
High (c4.xlarge) ~5 Gbps (640 MB/s)
10 Gigabit (c4.8xlarge) ~5 Gbps (640 MB/s) - driver not configured
- AWS defines a set of network speed steps for instance types but no values

Sample analyses
- RNA-seq (mouse)
- ChIP-seq (human)
- Training workshop (simulate ~80 participants; use downsampled human data,
~200MB across 4 fastq files)
Test data
- Human: ChIP, GSE82295, 119 compressed fastq files, total ~250GB
- Mouse: RNA, 6 samples, 5 replicas, large input files (40M paired end reads),
GSE72018, 45 fastq files, total ~1.2TB
All test data and workflows are available on S3
- RNA-seq: https://s3.amazonaws.com/cloudman-test/tests/rna/index.html
- ChIP-seq: https://s3.amazonaws.com/cloudman-test/tests/chIP/index.html
- Workshop: https://s3.amazonaws.com/cloudman-test/tests/downsampled-workshop-data/index.html
Tests were done on 16.11 release of Galaxy CloudMan.
Test plan: use cases

- Many-node cluster
Master: c4.4xlarge ($0.838/hr)
16x r3.2xlarge (16x: 8 cores, 61GB RAM (~8GB/core))
16x $0.665/hr = $10.64/hr
- Large-node cluster
Master: c4.4xlarge ($0.838/hr)
2x m4.16xlarge (2x: 64 cores, 256GB RAM (4GB/core))
2x $3.83/hr = $7.66/hr
- Single large instance
1x x1.32xlarge (128 cores, 1952GB RAM (~15GB/core))
$13.338/hr
Disk: io1; 6TB; 20,000IOPS, 320MB/s ($2.80/hr)
Test plan: infrastructure

Test matrix: total runtime
RNA-Seq ChIP-Seq Workshop
Many-node
cluster
Upload: 3hrs 40min
Runtime (FASTQ trimmer):
Aborted after ~30hrs runtime
Upload: 1hrs 20min
Runtime: 4hrs 10min
Runtime: 26 min
Large-node
cluster
Upload: 3hrs 40min
Runtime (FASTQ trimmer):
Aborted after ~17hrs runtime
Runtime (Trimmomatic):
25hrs 40min
Upload: 1hrs 20min
Runtime: 4hrs
Runtime: 17 min
Single large
instance
Upload: 1hrs 40min
Runtime (Trimmomatic):
31hrs
Upload: ~7 min
Runtime: 3hrs 30min
Runtime: 16 min

Upload time across instance and volume types
Uploading 119 compressed fastq files totalling ~250GB from S3.
Instance size Upload time CVMFS priming time
(mm10, 4.5 GB)
Volume characteristics
c4.4xlarge 40 min io1, 2TB, 16,000 IOPS
m4.10xlarge 45 min gp2, 2TB (6,000 IOPS)
x1.16xlarge 15 min gp2, 2TB (6,000 IOPS)
c4.xlarge est. 4 hrs (after 2hrs,
120GB transferred)
gp2, 2TB (6,000 IOPS)
m4.2xlarge 15 min
x1.32xlarge 10 min

ChIP-Seq execution signature (large-node cluster)
Master:
network
Master:
CPU & memory
Utilization(%)
Speed(GB/min)
Time
Time

ChIP-Seq execution signature
Master:
network
Master:
CPU & memory
Many-node clusterLarge-node cluster

Real-world workshop performance
- ~60 participants running ChIP-Seq workflow
- The figures are for the master instance (c4.4xlarge) with 2TB gp2 volume
Speed(GB/min)
Utilization(%)Count
Count

Real-world workshop performance (cont.)
- Load on workers (m4.16xlarge)
w1 w2 w3
CPU & memory utilization CPU & memory utilization CPU & memory utilization
Network speed Network speed Network speed

Puzzling 130,000B volume throughput?
gp2 | io1 volume type
Instance (x1.32xl) transient disk

CVMFS experiences (live workshops)

Test matrix: approximate cost
- Costs are approximate, based on the cluster spec listed earlier and the time
taken
RNA-Seq ChIP-Seq Workshop
Many-node
cluster
N/A $90 $15
Large-node
cluster
$360 $70 $12
Single large
instance
$530 $65 $16

Comparison with other readily available resources
ChIP-Seq Workshop
Fastest cloud option
Upload: ~7min
Runtime: 3hrs 30min
16min
Galaxy Main
Upload: between 8 and 14hrs
Runtime: no disk space
N/A
Jetstream
Upload: 25min
Runtime: Ran out of disk space
Using 1TB volume: 25hrs 20min
3hrs

Issues
- Trouble creating /mnt/transient_nfs/slurm/slurm.conf: [Errno 11] Resource
temporarily unavailable: '/mnt/transient_nfs/slurm/slurm.conf’
- Volume throughput stuck at 130,000B/min?!

Outcomes and observations
- Enabled multi-core jobs (this indirectly reduces total node memory
requirements)
- Enabled Enhanced Networking on the base image
- Promote use of EBS-Optimized instances
- Set CVMFS cache to ~30GB
- Master instance hardware components are not really a bottleneck but need
larger instance to ensure network throughput
- AWS EBS throughput is a bottleneck?
- Need to use larger instance type (see above)
- The NFS is a problem for many-node clusters

- Use a smaller number of larger instances, even a single instance
- Enable EBS optimized flag, or use instance type that uses it by default (c4,
m4)
- No need to use provisioned IOPS volumes
- If using provisioned IOPS volumes, match it with appropriate instance EBS throughput
- Eg: PIOPS vol max throughput is 320 MB/s; c4.8xlarge has 500MB/s EBS throughput,
while c4.4xlarge offers 250 MB/s
- Suggested configurations
- c4.4xlarge master with m4.10xlarge workers (use 1 per 10 workshop participants or jobs)
- For temporary clusters, use c3/m3/x1 master and transient storage
Recommendations

Future directions
- Update the docs (after the new wiki is live)
- Recommended configurations
- How to use Jetstream volumes
- Include results from this into the build/runtime environment
- Update instance types on the launcher
- Tune NFS
- Figure out how to measure utilized disk throughput
- Explore AWS EFS

A big thank you to Mo
for creating all the workflows, finding the data and
watching over the running jobs

ChIP-Seq execution signature (many-node cluster)
Master: network
Master: CPU & memory
Speed(GB/min)
Utilization(%)

Galaxy CloudMan performance on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Galaxy CloudMan performance on AWS

Similar to Galaxy CloudMan performance on AWS (20)

More from Enis Afgan

More from Enis Afgan (16)

Recently uploaded

Recently uploaded (20)

Galaxy CloudMan performance on AWS