This document summarizes a collaboration between the National Microbiology Laboratory of the Public Health Agency of Canada and the Communications Research Centre to explore using cloud computing to augment on-premise high performance computing (HPC) infrastructure. They migrated part of the NML's HPC architecture to CRC's Virtual Research Domain on Amazon Web Services on a 6-week challenge. The goals were to see if the cloud could cost-effectively extend their HPC capacity during bursts, do real scientific work, and compare performance to their on-premise system. They took a three phase approach of initial migration, optimization, and measurement.
29. Democratization of High
Performance Computing (HPC)
Results from a collaboration between:
The National Microbiology Laboratory; Public Health Agency
of Canada
And the Communications Research Centre; Innovation,
Science and Economic Development
15 May 2019
30. The Collaborators
• The Communications Research Centre: The Government of
Canada’s client-driven applied research centre for advanced
telecommunications. Canada’s innovator in wireless
telecommunications focused on what is possible and what works.
• The Public Health Agency’s National Microbiology Laboratory,
Canada’s only Level 4 biocontainment laboratory, has a focus on
preventing, monitoring, detecting and responding to public health
disease threats. Novel approaches are continually developed and
applied.
30
31. The Problem
Public Health Agency’s (PHAC) National Microbiology Laboratory
(NML) experiences significant bursts of compute activity, maxing
out their on-premise HPC infrastructure
31
• Can the cloud be used to extend their HPC data centre?
• Can cloud based HPC be cost effective?
• Can CRC and NML use cloud based HPC do real science
32. 32
Our 6-Week Challenge
Migrate part the NML HPC architecture into CRC’s VRD
(Virtual Research Domain)
• Using Amazon Web Services
• HPC proof of concept
• Real world use cases with public health significance
• Benchmark with on-premise system
34. Team Composition
• Small interdisciplinary team of Computer Systems (CS),
Biologist (BI), and Research Engineer (ENG)
• Agile approach with weekly sprints
34
38. Phase 2 – Scaling – Insufficient Capacity?
38
Error: Insufficient Instance Capacity.
We currently do not have sufficient capacity in the Availability Zone (AZ) you
requested
AWS Region
AZ AZ AZ AZ
43. Phase 3 – Real Use Cases
• Foodborne outbreak (baking flour recall )
• Antimicrobial resistance genes detection (MCR-1)
• Data publically available
43
44. Phase 3 - Benchmarks
Two benchmarks
• 10K sample simulation (10TB)
• 100K sample simulation (100TB)
The benchmarks were each run on:
• On premise
• Cloud - lift and shift (10K only due to cost)
• Cloud - optimized
44
45. Phase 3 – Run Times
10K time
(h:mm)
100K time
(h:mm)
On premise 0:25 3:36
Cloud - lift & shift 1:05 --
Cloud - optimized 0:07 0:26
45
1:05 --
0:07 0:26
46. Phase 3 – Cost
Base Lift & Shift Optimized
Base storage (100 TB) $1,005/day $75/day
Base CPU $67/day $55/day
Base total $1,072/day $130/day
Burst
Burst CPU - 10K $73 $62
Burst CPU - 100K -- $220
46
$1,005/day
$67/day
$1,072/day
$73
$75/day
$55/day
$130/day
$62
$220
48. Successfully Demonstrated:
• Elements of the NML HPC system can be migrated to the
cloud
• HPC systems can be optimized for cloud usage
• Cloud HPC can be cost effective
• Cloud HPC can be used for real science
48
49. Takeaways
• HPC is available to all - in the Cloud
• Cloud is scalable on demand and cost effective
• Collaborations with HPC can produce viable results
and meet actual common requirements
• ‘Early wins’ are possible
49