This session will feature ways customers have accelerated scientific research in the cloud using the AWS Cloud
Speaker: Brendon Bouffer, Scientific Computing, Amazon Web Services
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
1. AWS Government, Education, &
Nonprofits Symposium
Canberra, Australia | May 6, 2015
Time to Science, Time to Results. Accelerating
Scientific research in the Cloud
Brendan Bouffler (“boof”)
Scientific Computing Group, Amazon Web Services
2. AWS Global Impact Initiatives for Science
AWS Research Grants AWS Hosted Public Datasets
• Dedicated team focusing on
Scientific Computing &
Research workloads
• Globally focussed and engaged
in Big Science projects like the
SKA.
• Leveraging AWS resources all
over the world.
• Ensuring the cloud is able to
make a disruptive impact on
science.
AWS SciCo Team
• Grants to initiate & support development
of cloud-enabled technologies.
• Typically one-off grants of AWS
resources like EC2 (compute) or S3 &
EBS (storage) or more exotic like
Kinesis & twitter feeds.
• Frequently results in reusable
resources, like AMIs or open data,
which we strongly encourage.
• Lowers the risk to try the cloud.
• Large and globally significant datasets hosted
and paid for by AWS for community use.
• Data can be quickly and easily processed
with elastic computing resources in the
surrounding cloud.
• AWS hopes to enable more innovation, more
quickly.
• Provided in partnership with content owners,
who curate the data.
3. We are providing a grants pool of AWS credits and up to one
petabyte of storage for an AWS Public Data Set.
The data set will be initially provided by several of the SKA’s
precursor telescopes including CSIRO’s ASKAP, ICRAR’s MWA
in Australia, and KAT-7 (pathfinder to the SKA precursor
telescope Meerkat) in South Africa.
The grants are open to anyone who is making use of radio
astronomical telescopes or radio astronomical data resources
around the world.
The grants will be administered by the SKA. They will be looking
for innovative, cloud-based algorithms and tools that will be able
to handle and process this never ending data stream.
https://aws.amazon.com/blogs/aws/new-astrocompute-in-the-cloud-grants-program/
What the AWS is doing with SKA
4. $7B retail business
10,000 employees
A whole lot of servers
2006 2014
Every day, AWS adds enough
server capacity to power this
$7B enterprise
14. Spot Market
0.00
1.50
3.00
4.50
6.00
# CPUs
time
Spot Market
Our ultimate space
filler.
Spot Instances allow you
to name your own price for
spare AWS computing
capacity.
Great for workloads that
aren’t time sensitive, and
especially popular in
research (hint: it’s really
cheap).
15. Cloud Growth
0.00
1.50
3.00
4.50
6.00
# CPUs
time
Predictable growth
All of this makes it much
easier for AWS to predict
growth in aggregate
demand, and hence to
invest more to grow the
cloud.
As a result, we’re
expanding the cloud all
the time, ready for more
workload.
16. Time traveling workloads
# CPUs
time
# CPUs
time
Wall clock time: 1 hour Wall clock time: 1 week
Cost: equal
17. The Solution
When you only pay for what you use …
• If you’re only able to use your compute, say, 30%
of the time, you only pay for that time.
1Pocket the savings
• Buy chocolate
• Buy a spectrometer
• Hire a research assistant.
2 Go faster
• Use 3x the cores to
run your jobs at 3x
the speed.
3 Go Large
• Do 3x the science,
or consume 3x the
data.
… you have options.
18. Why do researchers love using AWS?
Time to Science
Access research
infrastructure in minutes
Low Cost
Pay-as-you-go pricing
Elastic
Easily add or remove
capacity
Globally Accessible
Easily Collaborate with
researchers around the world
Secure
A collection of tools to
protect data and privacy
Scalable
Access to effectively
limitless capacity
19. Collaboration is easier in the cloud
More time spent computing the data than moving the data.
23. C4
Intel Xeon E5-2666 v3, custom built for AWS.
Intel Haswell, 16 FLOPS/tick
2.9 GHz, turbo to 3.5 GHz
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html
Feature Specification
Processor Number E5-2666 v3
Intel® Smart Cache 25 MiB
Instruction Set 64-bit
Instruction Set Extensions AVX 2.0
Lithography 22 nm
Processor Base Frequency 2.9 GHz
Max All Core Turbo Frequency 3.2 GHz
Max Turbo Frequency 3.5 GHz (available on c4.2xLarge)
Intel® Turbo Boost Technology 2.0
Intel® vPro Technology Yes
Intel® Hyper-Threading Technology Yes
Intel® Virtualization Technology (VT-x) Yes
Intel® Virtualization Technology for Directed I/O (VT-d) Yes
Intel® VT-x with Extended Page Tables (EPT) Yes
Intel® 64 Yes
24. boto – the AWS API
Java
Python
Ruby
PHP
Perl
Shell
…
… and may other languages.
http://boto.readthedocs.org/en/latest/
Anything you can do in the GUI, you can do on the command line.
25. AWS CLI – command line interface
http://aws.amazon.com/cli/
Anything you can do in the GUI, you can do on the command line.
as-create-launch-config spotlc-5cents
--image-id ami-e565ba8c
--instance-type m1.small
--spot-price “0.05”
. . .
as-create-auto-scaling-group spotasg
--launch-configuration spotlc-5cents
--availability-zones “us-east-1a,us-east-1b”
--max-size 16
--min-size 1
--desiredcapacity 3
26. Bright Cluster Manager
http://www.brightcomputing.com
Bright cluster manager is an established very popular HPC
cluster management platform that can simultaneously
manage both on-premises clusters as well as infrastructure
in the cloud - all using the same system images.
Bright has offices in the UK, Netherlands (HQ) and US.
27. Bright Cluster Manager
1. User submits job to queue
2. Bright creates “data-transfer”
job
3. Bright runs compute job when
data-transfer job is complete
4. Bright transfers output data
back after completion
28. cfnCluster – provision an HPC cluster in minutes
#cfncluster
https://github.com/awslabs/cfncluster
cfncluster is a sample code framework that deploys and maintains
clusters on AWS. It is reasonably agnostic to what the cluster is
for and can easily be extended to support different frameworks.
The CLI is stateless, everything is done using CloudFormation or
resources within AWS.
10 minutes
29. Configuration is simple ….
There’s not a great deal involved
getting a cluster up and running.
This config file is enough to do it.
There are more options available,
but this is the minimum set.
[global]
cluster_template = default
update_check = true
sanity_check = true
[aws]
aws_region_name = ap-southeast-2
[cluster default]
key_location = /Users/bouffler/.ssh
key_name = boof-cluster
compute_instance_type = c3.2xLarge
scheduler = sge
vpc_settings = public
[vpc public]
vpc_id = vpc-c48a4fa1
master_subnet_id = subnet-3108f146
10 minutes
30. Infrastructure as code
#cfncluster
The creation process might take a few minutes (maybe up to
5 mins or so, depending on how you configured it.
Because the API to Cloud Formation (the service that does all
the orchestration) is asynchronous, we can kill the terminal
session if we wanted to and watch the whole show from the
AWS console (where you’ll find it all under the “Cloud
Formation”dashboard in the events tab for this stack.
$ cfnCluster create boof-cluster
Starting: boof-cluster
Status: cfncluster-boof-cluster - CREATE_COMPLETE
Output:"MasterPrivateIP"="10.0.0.17"
Output:"MasterPublicIP"="54.66.174.113"
Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"
Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"
32. Yes, it’s a real HPC cluster
#cfncluster
Now you have a cluster, probably running CentOS 6.x, with Sun Grid Engine as a default scheduler, and openMPI and a bunch of other great utilities installed that you’re
already familiar with. You also have a shared filesystem in /shared and an autoscaling group ready to expand the number of compute nodes in the cluster when the existing
ones get busy.
arthur ~ [26] $ cfnCluster create boof-cluster
Starting: boof-cluster
Status: cfncluster-boof-cluster - CREATE_COMPLETE
Output:"MasterPrivateIP"="10.0.0.17"
Output:"MasterPublicIP"="54.66.174.113"
Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"
Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"
arthur ~ [27] $ ssh ec2-user@54.66.174.113
The authenticity of host '54.66.174.113 (54.66.174.113)' can't be established.
RSA key fingerprint is 45:3e:17:76:1d:01:13:d8:d4:40:1a:74:91:77:73:31.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.66.174.113' (RSA) to the list of known hosts.
[ec2-user@ip-10-0-0-17 ~]$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 10185764 7022736 2639040 73% /
tmpfs 509312 0 509312 0% /dev/shm
/dev/xvdf 20961280 32928 20928352 1% /shared
[ec2-user@ip-10-0-0-17 ~]$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
ip-10-0-0-136 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -
ip-10-0-0-154 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -
[ec2-user@ip-10-0-0-17 ~]$ qstat
[ec2-user@ip-10-0-0-17 ~]$
[ec2-user@ip-10-0-0-17 ~]$ ed hw.qsub
hw.qsub: No such file or directory
a
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -pe mpi 2
#$ -S /bin/bash
#
module load openmpi-x86_64
mpirun -np 2 hostname
.
w
110
q
[ec2-user@ip-10-0-0-17 ~]$ ll
total 4
-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub
[ec2-user@ip-10-0-0-17 ~]$ qsub hw.qsub
Your job 1 ("hw.qsub") has been submitted
[ec2-user@ip-10-0-0-17 ~]$
[ec2-user@ip-10-0-0-17 ~]$ qstat
job-ID prior name user state submit/start at queue
slots ja-task-ID
---------------------------------------------------------------------------
---------------------
1 0.55500 hw.qsub ec2-user r 02/01/2015 05:57:25
all.q@ip-10-0-0-44.ap-southeas 2
[ec2-user@ip-10-0-0-17 ~]$ qstat
[ec2-user@ip-10-0-0-17 ~]$ ls -l
total 8
-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub
-rw-r--r-- 1 ec2-user ec2-user 26 Feb 1 05:57 hw.qsub.o1
[ec2-user@ip-10-0-0-17 ~]$ cat hw.qsub.o1
ip-10-0-0-136
ip-10-0-0-154
[ec2-user@ip-10-0-0-17 ~]$
33. Upgrade from Ivy Bridge to Haswell
#cfncluster
Yes, really :-)
You can upgrade your whole cluster in a keystroke or two. It’s an easy way to test which CPUs or instance
properties are important to your code’s performance. For example, you may find that Haswell doesn’t
impact your code’s performance sufficiently to make the additional cost worthwhile, so you can just as
easily downgrade the CPUs you’re using.
34. Config options to explore …
#cfncluster
Many options, but the
most interesting ones
immediately are:
# (defaults to t2.micro for default template)
compute_instance_type = t2.micro
# Master Server EC2 instance type
# (defaults to t2.micro for default template
#master_instance_type = t2.micro
# Inital number of EC2 instances to launch as compute nodes in the cluster.
# (defaults to 2 for default template)
#initial_queue_size = 1
# Maximum number of EC2 instances that can be launched in the cluster.
# (defaults to 10 for the default template)
#max_queue_size = 10
# Boolean flag to set autoscaling group to maintain initial size and scale back
# (defaults to false for the default template)
#maintain_initial_size = true
# Cluster scheduler
# (defaults to sge for the default template)
scheduler = sge
# Type of cluster to launch i.e. ondemand or spot
# (defaults to ondemand for the default template)
#cluster_type = ondemand
# Spot price for the ComputeFleet
#spot_price = 0.00
# Cluster placement group. This placement group must already exist.
# (defaults to NONE for the default template)
#placement_group = NONE
35. How is AWS Used for Scientific Computing?
• High Performance Computing (HPC) for Engineering and Simulation
• High Throughput Computing (HTC) for Data-Intensive Analytics
• Hybrid Supercomputing centres
• Collaborative Research Environments
• Citizen Science
• Science-as-a-Service
• Science where the workload changes (hint: almost all science)
37. Astronomy in the Cloud
CHILES will produce the first neutral hydrogen deep field, to be carried out with
the VLA in B array and covering a redshift range from z=0 to z=0.45. The field
is centered at the COSMOS field. It will produce neutral hydrogen images of at
least 300 galaxies spread over the entire redshift range.
Working with AWS’s SciCo team to exploit the SPOT market in the cloud, the
team at ICRAR in Australia have been able to implement the entire processing
pipeline in the cloud for around $2,000 per month, which means the $1.75M
they otherwise needed to spend on an HPC cluster can be spent on way cooler
things that impact their research … like astronomers.
38. High Throughput Computing at Scale
The Large Hadron Collider @
CERN includes 6,000+
researchers from over 40
countries and produces
approximately 25PB of data each
year.
The ATLAS and CMS
experiments are using AWS for
monte carlo simulations and
analysis of LHC data.
39. Zooniverse
“The Zooniverse is heavily reliant on Amazon
Web Services (AWS), particularly Elastic
Compute Cloud (EC2) virtual private servers and
Simple Storage Service (S3) data storage. AWS
is the most cost-effective solution for the dynamic
needs of Zooniverse’s infrastructure …”
http://wwwconference.org/proceedings/www2014/companion/p1049.pdf
The World’s Largest Citizen Science Platform
… cost is a factor – running a central API means that when the Zooniverse is
quiet and there aren’t many people about we can scale back the number of
servers we’re running (automagically on Amazon Web Services) to a minimal
level.
40. Novartis
39 years of computational chemistry in 9 hours
Novartis ran a project that involved virtually screening 10 million
compounds against a common cancer target in less than a week.
They calculated that it would take 50,000 cores and close to a $40
million investment if they wanted to run the experiment internally.
Partnering with Cycle Computing and Amazon Web
Services (AWS), Novartis built a platform leveraging
Amazon Simple Storage Service (Amazon S3),
Amazon Elastic Block Store (Amazon EBS), and
four Availability Zones. The project ran across
10,600 Spot Instances (approximately 87,000
compute cores) and allowed Novartis to conduct 39
years of computational chemistry in 9 hours for a
cost of $4,232. Out of the 10 million compounds
screened, three were successfully identified.
41. Globus Genomics
Globus Genomics is an indispensible
platform for Core Labs (bioinformatics,
se- quencing, HPC) to meet their
customers’ needs for cost-effective,
large-scale NGS analysis. Globus
Genomics provides a flexible,
extensible solution to ad- dress the
varying analysis and resource
requirements of bioscience
researchers, through powerful data
management tools, customized
workflow environments, and cloud-
based elastic computational
infrastructure.
www.globus.org/genomics
42. Aquaria – 3D Protein Visualization
Aquaria is a publicly-available web tool, designed for
biologists, for visualizing and working with the 3D structure of
proteins. It has radically simplified the process of analyzing
more than 500,000 proteins from the protein data bank.
Being able to visualize the three-dimensional structure of
proteins has been of great interest to scientists since long
before the genomic age.
The project was led by Dr Sean O’Donoghue from the
CSIRO in Australia along with a team from the Garvan
Institute in Sydney, and a key collaboration with Dr Andrea
Schafferhans from the Technical University of Munich.
Aquaria is fast & it comes with an easy-to-use interface and
contains twice as many models as all other similar resources
combined. It also allows users to view additional information,
like the genetic differences between individuals, mapped
onto 3D structures.
http://aquaria.ws/
43. What the SKA is saying about AWS
“No one’s ever built anything this big
before, and we really don’t understand
the ins and outs of operating it…Cloud
systems — which provide on-demand,
‘elastic’ access to shared, remote
computing resources — would provide
an amount of flexibility for the project
that buying dedicated hardware might
not.”
- SKA architect Tim Cornwell, Nature
May 27, 2014
44. We are providing a grants pool of AWS credits and up to one
petabyte of storage for an AWS Public Data Set.
The data set will be initially provided by several of the SKA’s
precursor telescopes including CSIRO’s ASKAP, ICRAR’s MWA
in Australia, and KAT-7 (pathfinder to the SKA precursor
telescope Meerkat) in South Africa.
The grants are open to anyone who is making use of radio
astronomical telescopes or radio astronomical data resources
around the world.
The grants will be administered by the SKA. They will be looking
for innovative, cloud-based algorithms and tools that will be able
to handle and process this never ending data stream.
https://aws.amazon.com/blogs/aws/new-astrocompute-in-the-cloud-grants-program/
What the AWS is doing with SKA
45. We Feel Emotion Explorer
We Feel is a project that explores whether social media can
provide an accurate, real-time signal of the world’s emotional
state.
A joint collaboration between CSIRO, mental health researchers
at The Black Dog Institute, Amazon Web Services and GNIP.
The We Feel is built on Amazon’s Big Data technologies and
currently analyzes approximately 27 million tweets/day
The outcomes?
1. We can now monitor, in real-time, the emotional health of the
world
2. Seamlessly scale infrastructure up or down in direct relation to
social activity
3. Amazon’s Big Data platform enables real-time trend analysis,
queries of historical data and geospatial analytics
http://wefeel.csiro.au/
46. “The AWS model works when we have the greatest variety of
uncoupled workloads all using the cloud. When it works, it drives the
cost of computation down to trivial levels so people can concentrate
more on their data, their science and their ideas, rather than bothering
to worry about infrastructure.
Science is one of the greatest areas of computation and also happens
to be the one that can most benefit from that democratisation in cost
and global accessibility and where we think Amazon can make a huge,
really disruptive, impact on the world by participating - which is, at the
most basic level, what we are about as a company.”