Enabling Research Using Cloud Computing

P U B L I C S E C T O R
S U M M I T
O T T A W A

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Enabling Research Using Cloud Computing
Sanjay Padhi, AWS Research Initiatives
Paul Astell, Communication Research Centre, Canada

S U M M I T
Why do researchers use AWS?
Time to Science
Access research
infrastructure in minutes
Low Cost
Pay-as-you-go pricing
Elastic
Easily add or remove capacity
Globally Accessible
Easily collaborate with
researchers around the world
Secure
A collection of tools to
protect data and privacy
Scalable
Access to AWS
global infrastructure

S U M M I T
AWS Global Infrastructure
20 Regions 61 Availability Zones 155 Edge Locations
11 Regional Edge Caches in 65 cities across 29 countries
Region & Number of Availability
Zones
AWS GovCloud (US) Europe
US-East (3), US-West (3) Ireland (3)
US West Frankfurt (3)
Oregon (4) London (3)
Northern California (3) Paris (3), Stockholm (3)
Asia Pacific
US East Singapore (3)
N. Virginia (6), Ohio (3) Sydney (3), Tokyo (4),
Seoul (2), Mumbai (2)
Canada Osaka-Local(1)
Central (2) China
Beijing (2)
South America Ningxia (3)
São Paulo (3)
New Regions
Bahrain, Cape Town, Hong Kong, Jakarta and Milan

S U M M I T
Evolution in compute services
Virtual server hosting, container management, and serverless computing
Virtual Servers (Amazon EC2)
Provides resizable cloud-based compute capacity in the form of
EC2 instances, which are equivalent to virtual servers
Run code without thinking about servers
Serverless compute for stateless code execution in response to triggers
A highly scalable, high performance container management service
Resource Isolation
(Amazon EC2 Container Service)
Serverless Computing (AWS Lambda)

S U M M I T
Diversified compute options: instance types – CPU, GPUs, FPGAs, …
M5
General
purpose
Compute
optimized
C5
C4
Storage and IO
optimized
I3, H1
P3
Accelerated
computing
Memory
optimized
R4
D2
M4
X1/e
R3
P2
G3
F1
M5.24xlarge
• 96 vCPU,
• 384GB RAM
• Up to 25Gps n/w
• EBS only
• 9k EBS Mbps
• New Nitro light
hypervisor +
dedicated h/w
C5.18xlarge
• 72 vCPU,
• 144GB RAM
• EBS only
• 9k EBS Mbps
• Up to 25 Gbps
w/ENA
T2.2xlarge
• 8 vCPU,
• 32GB RAM
• EBS only
• 81 cpu
credit/hr
X1e.32xlarge
• 128 vCPU,
• 4TB RAM
• 2 x 1.9TB SSD
• 14k EBS Mbps
R4.16xlarge
• 64 vCPU,
• 488GB RAM
• SSD EBS
• 25 Gbps
H1.16xlarge
• 64 vCPU
• 256GB RAM
• 8 x 2TB HDD
• 25 Gbps
I3.16xlarge
• 64 vCPU
• 488GB RAM
• 8 x 2TB NVMe SDD
• 25 Gbps
I3.metal (preview)
• 36 cores/72
• 512GB RAM
• 8 x 2TB NVMe SDD
• 25 Gbps
D2.8xlarge
• 36 vCPU
• 256GB RAM
• 8 x 2TB HDD
• 25 Gbps
P3.16xlarge
• 8 GPU Tesla V100
• 5k CUDA/640 Tensor cores
• 488GB RAM
• 64GB GPU RAM
• NVLink p-2-p

S U M M I T
• 10s-100s of processing cores
• Pre-defined instruction set & data
path widths
• Optimized for general-purpose
computing
CPU
CPUs vs GPUs vs FPGA for Compute
• 1,000s of processing cores
• Pre-defined instruction set and
data path widths
• Highly effective at parallel
execution
GPU
• Millions of programmable digital logic
cells
• No predefined instruction set or data
path widths
• Hardware timed execution
FPGA
DRAM
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Evolutions in hardware accelerators

S U M M I T
P3
G3
F1
Amazon P3 GPU Compute Instance
• Up to 8 NVIDIA V100 GPUs in a single instance, with NVLink for peer-to-peer GPU communication
• Supporting a wide variety of use cases including deep learning, HPC, financial computing, and batch
rendering
Amazon G3: GPU Graphics Instance
• Up to 4 NVIDIA M60 GPUs, with GRID Virtual Workstation features and licenses
• Designed for workloads such as 3D rendering, 3D visualizations, graphics-intensive remote
workstations, video encoding, and virtual reality applications
Amazon F1: FPGA instance
• Up to 8 Xilinx Virtex® UltraScale+™ VU9P FPGAs in a single instance. Programmable via VHDL
(Hardware description Language), Verilog, or OpenCL. Growing marketplace of pre-built application
accelerations
• Designed for hardware-accelerated applications including financial computing, genomics,
accelerated search, and image processing

S U M M I T
Large Hadron Collider
The Large Hadron Collider @ CERN
includes 6,000+ researchers from
over 40 countries and produces
approximately 25PB of data each
year.
The ATLAS and CMS experiments are
using AWS for Monte Carlo
simulations, processing, and analysis
of LHC data.

S U M M I T
Country Tier 1 Site
Canada TRIUMF
Germany KIT
Spain PIC
France IN2P3
Italy INFN
Nordic countries Nordic Datagrid Facility
Netherlands NIKHEF / SARA
Republic of Korea GSDC at KISTI
Russian Federation RRC-KI and JINR
Taipei ASGC
United Kingdom GridPP
US Fermilab-CMS
US BNL ATLAS

S U M M I T

S U M M I T
Streaming data interaction every 25 nano sec – occupancy (finding patterns)

S U M M I T
Clouds provided elasticity in computing

S U M M I T
~60,000 slots using AWS spot instances. A factor of 5 larger than Fermilab capacity!
https://aws.amazon.com/blogs/aws/experiment-that-discovered-the-higgs-boson-uses-aws-to-probe-nature/
Research using hybrid cloud computing
On demand auto-expansion to AWS

S U M M I T
HTCondor Annex for elasticity
Available in AWS Marketplace
https://aws.amazon.com/marketplace/search/results?searchTerms=htcondor

S U M M I T
Natural Language Processing
1.1 Million vCPUs & Amazon EC2 Spot Instances
https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/

S U M M I T
GUINNESS WORLD RECORDS:
• Title for fastest time to analyze 1,000 human genomes
• Used Amazon EC2 F1 instances
• The study was completed in two hours and twenty-five minutes
Children's Hospital of Philadelphia and Edico Genome:
Achieved fastest-ever analysis of 1,000 genomes

S U M M I T
Hubble space imagery on AWS:
28 years of data now available in the cloud for research
https://aws.amazon.com/blogs/publicsector/hubble-space-imagery-on-aws-28-years-of-data-now-available-in-the-cloud/

S U M M I T
AWS helps researchers ‘see’ neutrinos
https://aws.amazon.com/blogs/aws/nova-uses-aws-to-shed-light-on-neutrino-mysteries/
Neutrinos: ghosts of the universe - researchers use AWS to detect particles

S U M M I T
M L F R A M E W O R K S &
I N F R A S T R U C T U R E
The Amazon ML : Broadest & deepest set of capabilities
A I S E R V I C E S
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
L E XR E K O G N I T I O N
V I D E O
Vision Speech Chatbots
A M A Z O N
S A G E M A K E R
B U I L D T R A I N
F O R E C A S TT E X T R A C T P E R S O N A L I Z E
D E P L O Y
Pre-built algorithms & notebooks
Data labeling (G R O U N D T R U T H )
One-click model training & tuning
Optimization (N E O )
One-click deployment & hosting
M L S E R V I C E S
F r a m e w o r k s I n t e r f a c e s I n f r a s t r u c t u r e
E C 2 P 3
& P 3 d n
E C 2 C 5 F P G A s G R E E N G R A S S E L A S T I C
I N F E R E N C E
Reinforcement learning
Algorithms & models ( A W S M A R K E T P L A C E
F O R M A C H I N E L E A R N I N G )
Language Forecasting Recommendations

S U M M I T
Machine Learning - Amazon SageMaker
Pre-built
notebooks for
common
problems
K-Means Clustering
Principal Component Analysis
Neural Topic Modeling
Factorization Machines
Linear Learner - Regression
XGBoost
Latent Dirichlet Allocation
Seq2Seq
Linear Learner - Classification
ALGORITHMS
Apache MXNet
TensorFlow
Caffe2, CNTK, PyTorch,
Torch
FRAMEWORKS
Set up and manage
environments for
training
Train and tune
model (trial and
error)
Deploy model
in production
Scale and manage the
production environment
Built-in, high-
performance
algorithms
• Supervised learning
• Un-supervised learning
• Reinforcement learning

S U M M I T
Automatic grading of diabetic retinopathy through deep learning using AWS

S U M M I T
Source: Weather Channel
Machine learning for improving disaster management and response using AWS
Hurricane Irma predicted path Hurricane Irma real path
AWS Re:invent: Machine Learning for Improving Disaster Management and Response

S U M M I T
Machine learning for improving disaster management and response using AWS
https://arxiv.org/abs/1806.07378

S U M M I T
Insights into koala genome

S U M M I T
https://www.ukauthority.com/articles/nhs-digital-builds-data-services-platform/
UK: NHS patient records to be stored in AWS cloud platform

Democratization of High
Performance Computing (HPC)
Results from a collaboration between:
The National Microbiology Laboratory; Public Health Agency
of Canada
And the Communications Research Centre; Innovation,
Science and Economic Development
15 May 2019

The Collaborators
• The Communications Research Centre: The Government of
Canada’s client-driven applied research centre for advanced
telecommunications. Canada’s innovator in wireless
telecommunications focused on what is possible and what works.
• The Public Health Agency’s National Microbiology Laboratory,
Canada’s only Level 4 biocontainment laboratory, has a focus on
preventing, monitoring, detecting and responding to public health
disease threats. Novel approaches are continually developed and
applied.
30

The Problem
Public Health Agency’s (PHAC) National Microbiology Laboratory
(NML) experiences significant bursts of compute activity, maxing
out their on-premise HPC infrastructure
31
• Can the cloud be used to extend their HPC data centre?
• Can cloud based HPC be cost effective?
• Can CRC and NML use cloud based HPC do real science

32
Our 6-Week Challenge
Migrate part the NML HPC architecture into CRC’s VRD
(Virtual Research Domain)
• Using Amazon Web Services
• HPC proof of concept
• Real world use cases with public health significance
• Benchmark with on-premise system

Three Phase Approach
• Phase 1 – Lift and shift (~1-2 weeks)
• Phase 2 – Optimize (~2-3 weeks)
• Phase 3 – Measure (1-week)
33
6-weeks

Team Composition
• Small interdisciplinary team of Computer Systems (CS),
Biologist (BI), and Research Engineer (ENG)
• Agile approach with weekly sprints
34

Phase 1- Existing NML On-premise HPC
• 7,000 CPU
• 40 TB RAM
36
Cluster
Data Store
Interface

Phase 2 – Scaling – Insufficient Capacity?
38
Error: Insufficient Instance Capacity.
We currently do not have sufficient capacity in the Availability Zone (AZ) you
requested
AWS Region
AZ AZ AZ AZ

inefficient scaling
Phase 2 – Scaling Mechanism
AWS auto-scaling vs custom39
improved scaling

Phase 2 – Data Store Compare – Multi AZ
Data store speed and cost – multi AZ
40


Phase 2- Final Architecture
• Multi AZ
• Custom scaling
• S3 for data store
41

Phase 3 – Real Use Cases
• Foodborne outbreak (baking flour recall )
• Antimicrobial resistance genes detection (MCR-1)
• Data publically available
43

Phase 3 - Benchmarks
Two benchmarks
• 10K sample simulation (10TB)
• 100K sample simulation (100TB)
The benchmarks were each run on:
• On premise
• Cloud - lift and shift (10K only due to cost)
• Cloud - optimized
44

Phase 3 – Run Times
10K time
(h:mm)
100K time
(h:mm)
On premise 0:25 3:36
Cloud - lift & shift 1:05 --
Cloud - optimized 0:07 0:26
45
1:05 --
0:07 0:26

Phase 3 – Cost
Base Lift & Shift Optimized
Base storage (100 TB) $1,005/day $75/day
Base CPU $67/day $55/day
Base total $1,072/day $130/day
Burst
Burst CPU - 10K $73 $62
Burst CPU - 100K -- $220
46
$1,005/day
$67/day
$1,072/day
$73
$75/day
$55/day
$130/day
$62
$220

Successfully Demonstrated:
• Elements of the NML HPC system can be migrated to the
cloud
• HPC systems can be optimized for cloud usage
• Cloud HPC can be cost effective
• Cloud HPC can be used for real science
48

Takeaways
• HPC is available to all - in the Cloud
• Cloud is scalable on demand and cost effective
• Collaborations with HPC can produce viable results
and meet actual common requirements
• ‘Early wins’ are possible
49

Thank you!
S U M M I T
Sanjay Padhi
sanpadhi@amazon.com

Enabling Research Using Cloud Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Enabling Research Using Cloud Computing

Similar to Enabling Research Using Cloud Computing (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Enabling Research Using Cloud Computing