Risk Management and Particle Accelerators: Innovating with New Compute Platforms on AWS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adrian White
Senior SciCo Technical Manager, Amazon Web Services
Level 200
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Risk Management & Particle Accelerators
Innovating with new compute platforms on AWS

“… Programmers tend to set the size of problems
to fully exploit the computing power that becomes
available as the resources improve.”
Gustafson's law
https://en.wikipedia.org/wiki/Gustafson's_law

Understanding Capacity Needs
Time Cost Scale
How quickly can we
get results?
At what cost basis? What scale do we
need to get there?

Traditional Capacity Modelling
Utilisation (%)
Time
100
0
Drive utilisation as
high as possible,
But not exceeding
available capacity.

Real-World Capacity Planning
Time
High
0
Model capacity
needs and estimate
low, medium, high
utilisation.
Utilisation
Low
Medium

Real-World Capacity Planning
High
Utilisation
Low
Medium
Users are constrained
Invariably
underestimate
capacity or grow
faster than expected.

Introducing Queues
Shared file storage
Remote sitesCorporate datacenter
Queuing
architectures help
share constrained
resources.

But applications, jobs and people still have to wait.

“Invention requires two things: the ability to try a
lot of experiments, and not having to live with
the collateral damage of failed experiments.”
– Andy Jassy, Amazon Web Services

Let’s Move This Model To AWS
Head
node
Amazon
Direct Connect
Remote
visualisation
S3
RDS
EFS
AWS Snowball
Auto scaling
compute grid

Scale Grids Based On Work To Be Done
Head
node
S3
RDS
EFS
Amazon
Direct Connect
Remote
visualisation
AWS Snowball

Add Entire Compute Grids On-demand
Head
node
Head
node
S3
RDS
EFS
Amazon
Direct Connect
Remote
visualisation
AWS Snowball

On-demand, Auto Scaling Clusters On AWS
CfnCluster AWS Batch
AWS Batch
automatically provisions
compute resources
tailored to the needs of
your jobs using Amazon
EC2 and EC2 Spot
Alces Flight is
available in the AWS
Marketplace and
bundles 1000+
commonly used
applications
https://aws.amazon.com
/marketplace/
CfnCluster is provided
by AWS to quickly
provision configurable
clusters and grid
computing
environments.

http://www.kcpt.org/files/uploads/2015/01/mezzanine_386.jpg

LHC – World Wide Computing Grid
Tier 0
Tier 1
Tier 2
Tier 3
CERN
France Germany Italy Fermilab
Caltech
On-line systemDetector(s)
PB/s
Off-line system
GB/s
100s MB/s
600 Mbps
600 Mbps

Bursting Compute Using AWS Spot

An Aside: Monte Carlo Methods
Monte Carlo methods vary, but tend to follow
a particular pattern:
1. Define a domain of possible inputs
2. Generate inputs randomly from a probability
distribution over the domain
3. Perform a deterministic computation
on the inputs.
4. Aggregate the results
Monte Carlo simulations require many, many parallel iterations…

Scale with AWS: Brookhaven Labs
• More than 500 million events were fully
simulated using Monte Carlo methods in 10
days using 2.9 million jobs
• This would have taken 6 weeks on-premises
without AWS
• Used multiple AWS regions to minimise storage
costs and improve latency for data access
• The HEP Cloud project added 58,000 vCPUs
elastically to their on-premises facility for the
CMS experiment

Compute Strategies at Scale
Reserved capacity
On-demand capacity
Spot capacity Blend commercial models.
1. Reserve your baseline capacity
2. Supplement with on-demand
3. Aggressively use Spot capacity
for suitable workloads.

Spot Exists in Different Markets
$0.27 $0.29$0.50
1b 1c1a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C3
$1.76
On
Demand
$0.88
$0.44
$.22
$0.11
Each instance family
Each instance size
Each Availability Zone
In every region

Spot Strategies at Scale
1. Let Spot Fleet take care of the detail
• Lowest price vs diversified allocation
• Spot fleet instance weighting
2. Make the cluster or workload manager “Spot aware”
• HTCondor & Condor Annex in HEP
• Toil workflow engine in Genomics & Life Sciences
3. Use services like Amazon ECS and AWS Batch to
manage the details

Use AWS Batch to Manage Spot
Queue of
runnable jobs
AWS Batch
ECS / Spot
cluster
CloudWatch alarms
trigger auto scaling of
the ECS cluster
Custom CloudWatch
Metric monitoring
queue length
S3 events trigger
a lambda function
to add job to queue
Source data
Lands in S3
Output data
Products written
to S3
Additional analytics
e.g. Spark, Tez,
Hive, MLlib

“No server is easier to
manage than ‘no server’.”
– Werner Vogels, Amazon.com

AWS Lambda – How it Works
Bring your own code
Node.JS, Java, Python
Java = Any JVM based
language such as Scala,
Clojure, etc.
Bring your own libraries
Flexible invocation paths
Event or RequestResponse
invoke options
Existing integrations with
various AWS services
Simple resource model
• Select memory from 128MB
to 1.5GB in 64MB steps
• CPU & Network allocated
proportionately to RAM
• Reports actual usage
Fine grained permissions
• Uses IAM role for Lambda
execution permissions
• Uses Resource policy for
AWS event sources

CSIRO – Cloud-based CRISPR Prediction GT-Scan2
CRISPR/Cas9 technology provides genome editing capability.
This has application for personalised medicine and agriculture.
CSIRO built GT-Scan2 to:
1. Better understand the science
2. Provide higher powered computational tools
• Super-computing-scale analysis
• Interactive real time analysis (query style research)
GT-Scan2
Ranked choices

CSIRO – CRISPR Search with AWS Lambda
GT-Scan2.0 is implemented as a
microservices architecture using
AWS Lambda
Serverless:
• Does not require users to
have high-compute power
Scalable:
• Can be easily scaled to
whole genome analysis
Also implement as a “stand-alone”
• Can be run on local servers
• Can incorporate your own
ChIP-seq data rather than
public data

Lambda in the Context of Grid Computing
Source: “Occupy the Cloud: Distributed Computing for the 99%”
https://arxiv.org/pdf/1702.04024.pdf

Cost Considerations for Lambda at Scale?
www.cloudhealthtech.com/blog/how-use-lambda-ec2-save-most-money
Millions of function executions
Cost per million
executions
$0
$20
0
$14
$4
1 3 6 10 40
Lambda
On-demand
3 year RI
26

https://aws.amazon.com/solutions/case-studies/aon/
Stochastic Simulations on AWS: Aon Benfield

GPUs for Risk Modelling & Hedging
The Challenge
Spinning up up large numbers of GPUs quickly and
inexpensively to meet ABSI’s customers financial
modelling & reporting needs
ABSI uses proprietary algorithms (Monte Carlo
simulations) running millions of times
The solution
ABSI moved its infrastructure to AWS and deprecated its
co-located data centre
ABSI built a front-end on AWS for its processing solution,
automatically running GPU instances on Amazon EC2
using EBS in an Amazon VPC for security.
The Result
Can be as much as 500 times more efficient in terms of
performance per dollar for some clients
“Using AWS helps us reduce a 10-
day process to 10 minutes. That’s
transformative: it broadens our ability
to discover.”
Peter Phillips
Managing Director, Aon Benfield Securities
UK-based Aon plc, the ultimate parent company of Aon
Benfield Securities, is a leading global provider of risk
management, insurance and reinsurance brokerage

“… Programmers tend to set the size of problems
to fully exploit the computing power that becomes
available as the resources improve.”
Gustafson's Law
https://en.wikipedia.org/wiki/Gustafson's_law

Let’s remove constraints.
What can you build?

Further Information
AWS resources:
• High Performance Computing
• Financial Services Grid Computing on Amazon Web Services
• Research & Technical Computing on AWS
Blog posts and articles:
• Genome Engineering Applications: Early Adopters of the Cloud
• Experiment that Discovered the Higgs Boson Uses AWS to Probe Nature

Thank you!
whiteadr@amazon.com

Risk Management and Particle Accelerators: Innovating with New Compute Platforms on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Risk Management and Particle Accelerators: Innovating with New Compute Platforms on AWS

Similar to Risk Management and Particle Accelerators: Innovating with New Compute Platforms on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Risk Management and Particle Accelerators: Innovating with New Compute Platforms on AWS