AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nathan McGuirt, Manager, Solutions Architecture, AWS
Gabriele Garzoglio, HEP Cloud Facility Project Manager, Fermilab
December 2016
Building HPC Clusters as Code
in the (Almost) Infinite Cloud
CMP318

What to Expect from the Session
• Why customers are using AWS for HPC/HTC
• Leveraging Spot Instances for big compute at low cost
• Accelerating deployment with automation and managed
services

Agenda
• Why AWS for HPC?
• Automating cluster deployment
• Fermi National Accelerator Laboratory
• Demo of scaling jobs on a budget

High Performance Computing (HPC) vs.
High Throughput Computing (HTC)
HPC: High performance computing
(cluster computing)
- Tightly clustered
- Latency sensitive
HTC: High throughput computing
(grid computing)
- Less inter-node communication
- More horizontal scalability (pleasingly
parallel)

Data
AWS Snowball AWS Direct Connect

Cost – Spot market
Request
1
2
3
4
5
6
7
8
9
Bid Price
$1.00
$0.55
$0.50
$0.33
$0.20
$0.18
$0.15
$0.10
$0.05
Spot Price
$0.20
$0.20
$0.20
$0.20
$0.20

Automation
• Fully custom
• APIs
• AWS CloudFormation
• Managed services
• Amazon EMR
• AWS Batch
• Software cluster management solutions
• CFNCluster
• Alces Flight
• Partner offerings

API - SDKs
Java Python PHP .NET Ruby nodeJS
iOS Android AWS Toolkit
for Visual
Studio
AWS Toolkit
for Eclipse
Tools for
Windows
PowerShell
CLI

CloudFormation
Resources:
Ec2Instance:
Type: AWS::EC2::Instance
Properties:
SecurityGroups:
- Ref: InstanceSecurityGroup
KeyName: mykey
ImageId: ''
InstanceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable SSH access via port 22
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: '22'
ToPort: '22'
CidrIp: 0.0.0.0/0

AWS CFNCluster
$ pip install cfncluster
...
$ cfncluster configure
...
$ cfncluster run mycluster

Alces Flight
Alces Flight is a software offering self-service
supercomputers via the AWS Marketplace.
Creates self-scaling clusters with more than
750 popular scientific applications pre-installed,
complete with libraries and various compiler
optimizations, ready to run. The clusters use
the AWS Spot Instances by default.

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gabriele Garzoglio, HEP Cloud Facility Co-Project Manager, Fermilab
December 2016
The HEP Cloud Facility
Elastic Computing for High Energy Physics

Computing at the Fermi National Accelerator Laboratory
Lead United States particle physics laboratory
• Funded by the Department of Energy
• ~100 PB of data on tape
• High Throughput Computing characterized by:
• “Pleasingly parallel” tasks
• High CPU instruction / Bytes IO ratio
• But still lots of I/O. See Pfister: “In Search of
Clusters”
Focus on Neutrino Physics
• Including the NOvA Experiment
Strong collaborations with international
laboratories
• CERN / Large Hardron Collider (LHC)
Experiments
• Brookhaven National Laboratory (BNL)
• Lead institution (“Tier-1”) for the Compact Muon
Solenoid (CMS)

Drivers of Facility Evolution: Capacity / Cost / Elasticity
Price of one core-year on
Commercial CloudsHEP needs: 10-100 x today capacity
Facility size: 15k cores
NOvA experiment jobs in queue at FNAL
Usage is not steady-state
CMS Analysis Users – Yearly Cycle

Vision for Facility Evolution
• Strategic Plan for U.S. Particle Physics (P5 Report to the U.S. funding agencies)
Fermilab Facility
HTC, HPC Cores
68.7K
Disk Systems
37.6 PB
Tape
101 PB
10/100 Gbit
Networking
~5k internal
network ports
The Facility Today is “Fixed”
Rapidly evolving computer architectures
and increasing data volumes require
effective crosscutting solutions that are
being developed in other science
disciplines and in industry.
• HEP Cloud Vision Statement
– HEPCloud is envisioned as a portal to an ecosystem of diverse computing resources commercial or
academic
– Provides “complete solutions” to users, with agreed upon levels of service
– The Facility routes to local or remote resources based on workflow requirements, cost, and efficiency of
accessing various resources
– Manages allocations of users to target compute engines
• Pilot project to explore feasibility, capability of HEPCloud
– Goal of moving into production during FY18
– Seed money provided by industry

HEP Cloud Architecture
Overview External Relationships

HEP Cloud Architecture
Overview External Relationships
Basic idea: Add
disparate resources
(Cloud VM, HPC slots,
Grid nodes, local
resources) into a
central resource pool.

Fermilab HEPCloud: Expanding to the Cloud
Reference herein to any specific commercial product, process, or service by trade name, trademark,
manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or
favoring by the United States Government or any agency thereof.
– Provisioning
– Performance
– Image portability
– On-demand services
• Where to start?
– Market leader:
Amazon Web Services (AWS)
• Integration challenges that needs to
be managed to run at scale:
– Networking
– Storage and data movement
– Monitoring and accounting
– Security

Integration Challenges: Provisioning – Create an Overlay Batch System with
GlideinWMS and HTCondor
condor
submit
VO Frontend
HTCondor
Central Manager
HTCondor
Schedulers
HTCondor
Schedulers
Frontend
Grid Site
Virtual Machine
Job
Local Resources
Virtual Machine
Job
GlideinWMS Factory
HTCondor-G
High Performance
Computers
Virtual Machine
Job
Cloud Provider
Virtual Machine
VM
Glidein
HTCondor
Startd
Job
Pull Job

Integration Challenges: Provisioning – Containing costs
• Using AWS Spot market to
contain costs
• Workflows are already engineered
to sustain preemption from the
Grid
– Job are “short”, i.e., killed jobs are
affordable w/o checkpointing
– Preempted jobs are automatically
resubmitted
– Data management systems
identify files in a dataset that were
not processed and allow recovery
CMS use case:
Histogram of number
of times each job
started
(measure of
preemption)
NOvA use case:
number of VMs
running (blue) and
preempted (red)
every hour
2.5M jobs
with no
preemption
240 VM / h
60 VM preempted
in 1h
400K jobs
with one
preemption

Integration Challenges: Provisioning – Containing costs
• The Decision Engine oversees
the costs and optimizing VM
placement using the status of the
facility, the historical prices, and
the job characteristics
Bid at 25% x on-demand price has lowest expected cost
• Based on pre-emption history,
calculating the probability that a 5-
24 h job finishes within a week
although it has to restart due to
preemption, for various bidding
algorithms.
$0.25 / h

Integration Challenges: Performance
Benchmarks used to compare workflow duration on AWS (and $$) with local execution
Need EBS
Need EBS
32 cores
scale w/ cores Need EBS
Need EBS
32 cores
scale w/ cores
Need
parallel
streams
c3.2xlarge c3.2xlarge
good candidate – want > 1
From AWS to
FNAL: 7Gbps
Access to S3 always
saturates the 1 Gbps
interface

Integration Challenges: Performance
CMS Use Case:
Wallclock distribution by AWS instance type

Integration Challenges: Image Portability
Build VM management tool,
considering:
• HVM virtualization (HW VM
+ Xen) on AWS: gives
access to all AWS
resources
• Contain VM size (saves
import time and cost)
• Import process covers
multiple AWS accounts and
regions
• AuthN with AWS use short-
lived role-based tokens,
rather than long term keys
Build “Golden Image” from standard Fermilab Worker Node configuration VM.

Integration Challenges: On-demand Services
Jobs depend on software services to run
Automating the deployment of these services on AWS on-demand - enables scalability and cost savings
• Services include data caching (e.g., Squid) WMS , submission service, data transfer, etc.
• As services are made deployable on-demand, instantiate ensemble of services together (e.g.,
through AWS CloudFormation)
Example: on-demand Squid
• Deploy Squid via
auto-scaling services.
Squid is deployed if average
group bandwidth utilization
is too high. Server is
deployed or destroyed in
30 seconds.
• Front Squids with a
load balancer.
• Name the load balancer for that
region via Route 53
Auto Scaling
group
CloudFormation

"SquidInstanceType" : { "Type" : "String", "Default" : "c3.xlarge", … },
"SquidLaunchConfiguration" : { "Type" : "AWS::AutoScaling::LaunchConfiguration",
"Properties" : {
"InstanceType" : { "Ref" : "SquidInstanceType" },
"ImageId" : { "Fn::FindInMap" : [ "AMIRegionMap", {"Ref":"AWS::Region"}, "SquidAMI" ]},
"SecurityGroups" : [ { "Fn::FindInMap" :
["SecurityGroupRegionMap",{"Ref":"AWS::Region"}, "SquidSG" ] } ],
… } }
"SquidAutoscalingGroup" : { "Type" : "AWS::AutoScaling::AutoScalingGroup",
"Properties" : {
"AvailabilityZones" : {"Ref" : "AvailabilityZones"},
"LaunchConfigurationName" : {"Ref" : "SquidLaunchConfiguration" },
"LoadBalancerNames" : [ {"Ref" : "SquidLoadBalancer" } ],
… } },
"SquidAutoscaleUpPolicy" : { "Type" : "AWS::AutoScaling::ScalingPolicy",
"Properties" : {
"AdjustmentType" : "ChangeInCapacity",
"AutoScalingGroupName" : { "Ref" : "SquidAutoscalingGroup" },
"ScalingAdjustment" : "1”
… } },
…
Integration Challenges: On-demand Services – CloudFormation

"SquidNetworkBandwidthHighAlarm" : { "Type" : "AWS::CloudWatch::Alarm",
"Properties" : {
"AlarmDescription" : "Scale up if average NetworkIn > for 5 minutes",
"MetricName" : "NetworkOut",
"Statistic" : "Average",
"Period" : "300",
"Threshold" : "1100000000",
"AlarmActions" : [ { "Ref" : "SquidAutoscaleUpPolicy" } ],
"ComparisonOperator" : "GreaterThanThreshold”,
… } }
…
"SecurityGroupRegionMap" : {
"us-west-2“ : { "SquidSG" : "sg-xxxxf6cb" },
"us-east-1" : { "SquidSG" : "sg-xxxx70ca" },
… }
"SquidLoadBalancer" : {"Type" : "AWS::ElasticLoadBalancing::LoadBalancer",
"Properties" : {
"CrossZone" : "false",
"SecurityGroups" : [ {"Fn::FindInMap" :
[ "SecurityGroupRegionMap", { "Ref" : "AWS::Region" } , "SquidSG" ] } ],
"Listeners" : [ { "LoadBalancerPort":"3128", "InstancePort":"3128", "Protocol":"TCP" } ],
"HealthCheck" : { "Target" : "TCP:3128", "HealthyThreshold" : "3", … }
… } }

"elbHostedZone": { "Type" : "AWS::Route53::HostedZone",
"Properties" : {
"HostedZoneConfig" : {
"Comment" : "auto-generated private hosting zone for ELB” },
"Name" : { "Fn::Join" : ["", [{"Ref":"AvailabilityZone"},".elb.fnaldata.org.”]]},
"VPCs" : [{
"VPCId" : { … },
"VPCRegion" : { "Ref" : "AWS::Region"} }]
} }
"elbDNS" : { "Type" : "AWS::Route53::RecordSet",
"Properties" : {
"HostedZoneId" : { "Ref" : "elbHostedZone" },
"Name" : { "Fn::Join" :
["", ["elb2.",{"Ref":"AvailabilityZone"},".elb.fnaldata.org."]]},
"ResourceRecords" : [ { "Fn::GetAtt" : [ "SquidLoadBalancer", "DNSName" ] } ]
… } }
Clients call Squid as elb2.<AvailabilityZone>.elb.fnaldata.org

Integration Challenges: Networking
Implement routing / firewall configuration
to use peered ESNet / AWS to route
data flow through ESNet
AWS / ESNet data egress cost waiver
• For data transferred through
ESNet, transfer charges are
waived for data costs up to 15%
of the total

Integration Challenges: Storage and Data Movement
Integrate S3 storage stage-in/-out for AWS internal /
external access - enables flexibility on data
management
• Consider O(1000) jobs finishing on the cloud and
transferring output to remote storage
• Storage bandwidth capacity is limited
• Two main strategies for data transfers:
1. Fill the available network transfer by having some
jobs wait - Put the jobs on a queue and transfer
data from as many jobs as possible - idle VMs
have a cost
2. Store data on S3 almost concurrently (due to high
scalability) and transfer data back asynchronously
- data on S3 has a cost
• The cheapest strategy depends on the storage
bandwidth, number of jobs, etc.
S3

Integration Challenges: Monitoring and Accounting
Monitor # GCloud VMs (S. Korea Priv. Cloud) Monitor # AWS VMs
Accounting:
$ by VO and VM Type
Monitor
HEP Cloud
Slots

NoVA Data Processing
Processing the 2014/2015 dataset
3 use cases: Particle ID, Montecarlo ,
Data Reconstruction
Received AWS research grant
Dark Energy Survey
Gravitational Waves
Search for optical
counterpart of events
detected by LIGO/VIRGO
gravitational wave detectors (FNAL LDRD)
Modest CPU needs, but want 5-10 hour turnaround
Burst activity driven entirely by physical phenomena
(gravitational wave events are transient)
Rapid provisioning to peak
CMS Monte Carlo Simulation
Generation (and detector simulation, digitization,
reconstruction) of simulated events in time for
Moriond conference.
58,000 compute cores, steady-state
Demonstrates scalability
Received AWS research grant
Initial HEPCloud Use Cases

Results from the CMS Use Case
• All CMS simulation requests fulfilled by the conference
deadline (Rencontres de Moriond 2016 )
– 2.9 million jobs, 15.1 million wall hours
• 9.5% badput – includes preemption from spot pricing
• 87% CPU efficiency
– 518 million events generated

CMS Reaching ~60k slots on AWS with HEPCloud
10% Test 25%
60000 slots
10000 VM
Each color corresponds to a
different region / zone /machine type

HEPCloud AWS: 25% of CMS global capacity
Production
Analysis
Reprocessing
Production on AWS
via FNAL HEPCloud
Production
Analysis
Reprocessing
Production on AWS
via FNAL HEPCloud

On-premises vs. cloud cost comparison
Average cost per core-hour
• On-premises resource: 0.9 cents per
core-hour
• Includes power, cooling, staff,
but assumes 100% utilization
• Off-premises at AWS (CMS use case):
1.4 cents per core-hour
• Off-premises at AWS (NOvA use case):
3.0 cents per core-hour
• Use case demanded bigger VM
Benchmarks
• Specialized (“ttbar”) benchmark focused on HEP workflows
• On-premises: 0.0163 ttbar /s (higher = better)
• Off-premises: 0.0158 ttbar /s
Raw compute performance roughly equivalent
Cloud costs approaching equivalence
Amazon provisions/retires 60k cores for our system in ~1 hour

Acknowledgements
The support from the Computing Sector
The Fermilab HEPCloud Facility team
AWS and their engagement team, in particular Jamie Baker
The HTCondor team
The collaboration and contributions from KISTI, in particular Dr. Seo-Young Noh
The Illinois Institute of Technology (IIT) students and professors Ioan Raicu and
Shangping Ren
The Italian National Institute of Nuclear Physics (INFN) summer student program
• NOvA: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5774
• CMS: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5750
For More Information:

Remember to complete
your evaluations!

Related Sessions
CMP201 - Auto Scaling – The Fleet Management Solution for Planet Earth

AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)

Similar to AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318) (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)