SlideShare a Scribd company logo
1 of 64
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dr. Jeffrey B. Layton
Global Scientific Computing
June 20, 2016
Building HPC Clusters as Code in
the [Almost] Infinite Cloud
Agenda
• Why cloud for HPC?
• Tools for creating clusters in the cloud
• SPOT + HPC = peas and carrots
• Fermi National Accelerator Laboratory
• Demo of scaling jobs on a budget
• Summary
Why cloud for HPC?
Scalability
• If you need to run on lots of cores, just spin them up
• If you don’t need nodes, turn them off (and don’t pay for them)
Time to research
• Usually on-premises high-performance computing (HPC) resources are centralized (shared)
• Researchers like to have their own nodes when they need them
World-wide collaboration
• Share data and interact with it by using the cloud
Latest technology and various instance types
Can save $$$
Flexibility: code as infrastructure
AWS HPC architectures—phases of deployment
• Fork lift
• Make it look like on-premises
• Cloud “port”
• Adapt to cloud features
• Auto Scaling
• Spot
• Born in the cloud
• Cycle computing
• Rethink application
• Microservices and serverless computing
You must think in “cloud”
You cannot think in “on-prem” and transpose
You must think in “cloud”
Do you think you can do that, Mr. Gant?
AWS HPC architecture
Master Node
Compute Node Compute Node
Compute Node Compute Node
Storage
(NFS, Parallel)
Master Instance
Compute Instance Compute Instance
Compute Instance Compute Instance
Storage
(NFS, Parallel)
On-premises AWS Cloud
Compute Instance
Compute Instance
HPC tools
MIT StarCluster
• No longer being supported nor developed
Bright Cluster Manager
• Good for hybrid solutions
CloudyCluster
• Omnibond (out of Clemson University)
Amazon Cfncluster
• Getting started writing your own tools
Alces Flight—on AWS Marketplace
Alces Flight
Alces Flight is software offering self-service
supercomputers by using the AWS
Marketplace (the cloud’s “App Store”). It
creates self-scaling clusters with more than 750
popular scientific applications pre-installed,
complete with libraries and various compiler
optimizations, ready to run. The clusters use
the AWS Spot market by default.
5 minutes
http://alces-flight.com
Alces Flight is familiar and flexible
• Same tools as virtually all HPC systems
• Environment modules
• Job scheduler (SGE)
• Catalog of 750+ prebuilt scientific applications and
libraries including visualization tools
• Alces gridware tool for application management
• Integrated with modules
• Defaults to the Spot market
• Auto Scaling cluster based on queued jobs
Flight enables collaboration
Access the graphical console of your
control node simultaneously with your
collaborators
• Run visual apps that use the elastic
cluster to drive visual results and you
can work together with the visual
console in real-time
Shared and secure cloud workspaces
• Control access and focus on data
analysis
• Make more discoveries faster
• Save lives
• Change the world
Collaborative IGV
Integrative Genomics
Viewer (IGV) workspace
for variant analysis
SPOT + HPC = peas and carrots
Multiple pricing methods
Spot Market filler
0.00
1.50
3.00
4.50
6.00
# CPUs
time
Spot Market
Our ultimate space
filler.
Spot Instances allow you
to name your own price
for spare AWS
computing capacity.
Great for workloads that
aren’t time sensitive, and
especially popular in
research (hint: it’s really
cheap).
Spot vs. On-Demand (YMMV)
4 compute nodes, 2 hours
• On-Demand, us-east
• $19.13
• Spot (us-west-1)
• $7.22
• Almost 1/3 the cost!
16 compute nodes, 32 hours
• On-Demand, us-east
• $1,018.77
• Spot (us-west-1)
• $223.11
• Almost 1/5 the cost!
Fermi National Accelerator Laboratory
Dr. Panagiotis Spentzouris
Fermi National Accelerator Laboratory
Fermilab is America’s particle physics and accelerator lab.
• Mission: solve the mysteries of matter, energy, space and time
for the benefit of all.
More than 4,200 scientists worldwide use Fermilab and its
particle accelerators, detectors and computers for their
research.
Particle Physics Science Drivers
Utilize high-energy particle beam collisions to discover
• the origin of mass, the nature of dark matter, extra dimensions.
Employ high-flux beams to explore
• neutrino interactions, to answer questions about the origins of
the universe, matter-antimatter asymmetry, force unification.
• rare processes, to open a doorway to realms to ultra-high
energies, close to the unification scale.
Massive instruments generate massive data
Fermilab experiments
The big data frontier… from Wired
Fermilab Facility Evolution: HEPCloud
HEPCloud: Provide cost
effective and efficient “elastic”
resource deployment, utilizing
sophisticated decision engine
and middleware for
automation. A single portal to
heterogeneous computing and
storage resources, both local
and “rental” (commercial or
academic).
• Initial focus on commercial
clouds➡️ AWS
AWS infrastructure
• AWS CloudFormation
automates the setup and
teardown of the Amazon Route
53 DNS entries, the Elastic
Load Balancing load balancer,
the Auto Scaling group, and
Amazon CloudWatch
monitoring
• Launched in each Availability
Zone prior to workflows being
run
On Demand services
● Workflows depend on software services to run
● Automating the deployment of these services on AWS on-
demand—enables scalability and cost savings
o Services include data caching (e.g. Squid) WMS , submission service, data transfer, etc.
o As services are made deployable on-demand, instantiate ensemble of services together
(e.g. through AWS CloudFormation)
● Example: On-demand Squid
Decision engine
Reaching ~60k slots on AWS with HEPCloud
HEPCloud/AWS: 25% of CMS global capacity
HPC needs of particle physics workflows
Now that the HTC use case is out of the way…
Machine learning for pattern recognitions
Specialized HPC demands
Very large computations (petascale) of physics processes
necesary for theoretical interpretations
Very large computations (petascale) for modeling particle
accelerators and detectors
Demo: Scaling jobs on a budget
Summary
Summary
Easy to “recreate” clusters in the cloud
• Extremely scalable and flexible
Spot + HPC is a wonderful combination
• Saves time and money
Customer example—FNAL
Alces Flight in AWS Marketplace
This is only the beginning—rethink HPC applications for
fault tolerance, extreme scalability, etc.
Thank you!
Demo backups
Introduction
Setup 2 node cluster (2 compute nodes) where:
• Master node = c4.8xlarge
• 2x compute nodes = c4.xlarge
• 10GigE networking
Run compute nodes on Spot market and master node On-
Demand
Access cluster from Microsoft Windows box (using PuTTY)
Step 1
Start up cluster using Alces Flight JSON file
• CloudFormation service
• Click Create Stack
• Answer questions
• Key file is critical! You will use it to log in to master node.
• Choose a reasonable Spot price (check current market in region)
– http://aws.amazon.com/ec2/pricing/ (near bottom of page)
AWS CloudFormation page
Create a stack
• Specify the
details of
template
instantiation
• Called a “stack”
• Allows you to
tailor stack to
needs
Stack details—top portion
Name of cluster
Spot bid
Instance type for compute nodes
Amazon S3 bucket for customizations
Key pair for that region
Number of initial nodes in cluster
Stack details—bottom portion
Storage capacity for instances
Master node type
Max number of nodes
Network CIDR
User name
Optional tags
Review of stack configuration—1
Review of stack configuration—2
Don’t forget to check this box!
Stack gets created—1
Stack gets created—2
Stack is done and cluster exists!
Demo 2
Cluster configuration
Recall that Alces Flight comes with:
• Environment modules (connected to Alces Gridware)
• Pdsh
• SGE job scheduler
• GNU Compilers
• Alces Gridware
• Built on CentOS 7
• 750+ applications and libraries (MPI included)
Log into master node and try out commands (PuTTY). Run
application.
Start up PuTTY
Copy/Paste IP address
Keep alive in PuTTY
 Go to Connection on
left menu
 Click on it
 Select Enable TCP
keepalives
 Keeps PuTTY
connection alive
Add key to PuTTY session
 Go to SSH on left menu
 Expand menu
 Select Auth
 Use Browse to location
private key (should be the
same as was used when
cluster was created)
 Note: Has to be in .ppk
format (might have to
convert it from .pem format)
Log in to master node!
 Use “alces” as login (should
match what you input to create
cluster)
 No password needed (uses
pass key)
 Ready to go!
Check number of nodes
 pdsh uses genders
– “nodes” are only
compute nodes
– “cluster” includes
master node
 Be sure to check
“qhost” for compute
nodes
Available modules at boot
 AWS command line tools installed by default
“alces gridware list”
Install an application
 Search for application
using “alces gridware
search …”
 Install application using
“alces gridware install …”
 Environment modules are
updated with application is
installed
Modules after installing application
 To run application
don’t forget to
load the
application
module!
Run application
Remove module and application
 First, remove
module
 Second, run
“alces gridware
purge… “
Demo 3
Demo 3
Cluster is up—show running MPI application
• Which MPI application (make it something reasonable)
Install application
• Show change in modules
Job script (go over details)
Submit job—show output of qstat
• Auto Scaling?
Show output from application (yes it’s running)
Cluster MPI definition
Install MPI application using alces gridware
Load module
Set up job script
Submit job
Watch it run (run, app, run)
List available depots
Install benchmark depot
 Installs depot
 Abbreviated
output
Check modules
New modules
New modules
Don’t forget to
load modules
before running!
Create job script
Don’t forget that Alces Flight uses SGE
#!/bin/bash
#$ -j y –N imb –o $HOME/imb_out.$JOB_ID
#$ -pe mpinodes-verbose 2 –cwd –V
module load mpi/openmpi
module load apps/imb
mpirun IMB-MPI1
Alces also has job templates available:
“alces gridware templates list”
Submit job and check status
Once job is done—check output

More Related Content

What's hot

What's hot (20)

Deep Dive: Amazon EC2 Elastic GPUs - May 2017 AWS Online Tech Talks
Deep Dive: Amazon EC2 Elastic GPUs - May 2017 AWS Online Tech TalksDeep Dive: Amazon EC2 Elastic GPUs - May 2017 AWS Online Tech Talks
Deep Dive: Amazon EC2 Elastic GPUs - May 2017 AWS Online Tech Talks
 
HPC in AWS - Technical Workshop
HPC in AWS - Technical WorkshopHPC in AWS - Technical Workshop
HPC in AWS - Technical Workshop
 
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...
AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...
AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...
 
AWS for HPC in Drug Discovery
AWS for HPC in Drug DiscoveryAWS for HPC in Drug Discovery
AWS for HPC in Drug Discovery
 
Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options
Creating Your Virtual Data Center: VPC Fundamentals and Connectivity OptionsCreating Your Virtual Data Center: VPC Fundamentals and Connectivity Options
Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options
 
Travel hackathon
Travel hackathonTravel hackathon
Travel hackathon
 
An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Inven...
An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Inven...An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Inven...
An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Inven...
 
Cloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to ScaleCloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to Scale
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
 
HSBC and AWS Day - Big Data and HPC on AWS
HSBC and AWS Day - Big Data and HPC on AWSHSBC and AWS Day - Big Data and HPC on AWS
HSBC and AWS Day - Big Data and HPC on AWS
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
Get the Most Bang for Your Buck with #EC2 #WINNING
Get the Most Bang for Your Buck with #EC2 #WINNINGGet the Most Bang for Your Buck with #EC2 #WINNING
Get the Most Bang for Your Buck with #EC2 #WINNING
 
AWS re:Invent 2016: NextGen Networking: New Capabilities for Amazon’s Virtual...
AWS re:Invent 2016: NextGen Networking: New Capabilities for Amazon’s Virtual...AWS re:Invent 2016: NextGen Networking: New Capabilities for Amazon’s Virtual...
AWS re:Invent 2016: NextGen Networking: New Capabilities for Amazon’s Virtual...
 
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech TalksDesign, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
 
AWSome Day Leeds
AWSome Day Leeds AWSome Day Leeds
AWSome Day Leeds
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 

Viewers also liked

Barracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSBarracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWS
Amazon Web Services
 

Viewers also liked (20)

David LeBauer PEcAn
David LeBauer PEcAnDavid LeBauer PEcAn
David LeBauer PEcAn
 
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
 
David Kelly SWIFT
David Kelly SWIFTDavid Kelly SWIFT
David Kelly SWIFT
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
Digital Workloads on Amazon Web Services
Digital Workloads on Amazon Web ServicesDigital Workloads on Amazon Web Services
Digital Workloads on Amazon Web Services
 
Hackproof Your Cloud: Responding to 2016 Threats
Hackproof Your Cloud: Responding to 2016 ThreatsHackproof Your Cloud: Responding to 2016 Threats
Hackproof Your Cloud: Responding to 2016 Threats
 
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsDevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
 
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
 
AWS Partner ConneXions Taiwan - Q3 2016 Technology Update
AWS Partner ConneXions Taiwan - Q3 2016 Technology UpdateAWS Partner ConneXions Taiwan - Q3 2016 Technology Update
AWS Partner ConneXions Taiwan - Q3 2016 Technology Update
 
Big Data Solutions Day - Calgary
Big Data Solutions Day - CalgaryBig Data Solutions Day - Calgary
Big Data Solutions Day - Calgary
 
Maximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWSMaximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWS
 
This One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You Thousands
 
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
 
another day, another billion packets
another day, another billion packetsanother day, another billion packets
another day, another billion packets
 
Barracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSBarracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWS
 
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
AWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWSAWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWS
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 

Similar to Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 
Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...
Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...
Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...
Platform9
 

Similar to Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016 (20)

Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the CloudTime to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
Cloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: OpenstackCloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: Openstack
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cloud Computing - Challenges & Opportunities
Cloud Computing - Challenges & OpportunitiesCloud Computing - Challenges & Opportunities
Cloud Computing - Challenges & Opportunities
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, Docker
 
Increase Speed and Agility with Amazon Web Services
Increase Speed and Agility with Amazon Web ServicesIncrease Speed and Agility with Amazon Web Services
Increase Speed and Agility with Amazon Web Services
 
Increase Speed and Agility with Amazon Web Services
Increase Speed and Agility with Amazon Web ServicesIncrease Speed and Agility with Amazon Web Services
Increase Speed and Agility with Amazon Web Services
 
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanDay 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
 
Kinney j aws
Kinney j awsKinney j aws
Kinney j aws
 
Application Lifecycle Management on AWS
Application Lifecycle Management on AWSApplication Lifecycle Management on AWS
Application Lifecycle Management on AWS
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...
Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...
Cost-effective Compute Clusters with Spot and Pre-emptible Instances - KubeCo...
 
How Easy to Automate Application Deployment on AWS
How Easy to Automate Application Deployment on AWSHow Easy to Automate Application Deployment on AWS
How Easy to Automate Application Deployment on AWS
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dr. Jeffrey B. Layton Global Scientific Computing June 20, 2016 Building HPC Clusters as Code in the [Almost] Infinite Cloud
  • 2. Agenda • Why cloud for HPC? • Tools for creating clusters in the cloud • SPOT + HPC = peas and carrots • Fermi National Accelerator Laboratory • Demo of scaling jobs on a budget • Summary
  • 3. Why cloud for HPC? Scalability • If you need to run on lots of cores, just spin them up • If you don’t need nodes, turn them off (and don’t pay for them) Time to research • Usually on-premises high-performance computing (HPC) resources are centralized (shared) • Researchers like to have their own nodes when they need them World-wide collaboration • Share data and interact with it by using the cloud Latest technology and various instance types Can save $$$ Flexibility: code as infrastructure
  • 4. AWS HPC architectures—phases of deployment • Fork lift • Make it look like on-premises • Cloud “port” • Adapt to cloud features • Auto Scaling • Spot • Born in the cloud • Cycle computing • Rethink application • Microservices and serverless computing You must think in “cloud” You cannot think in “on-prem” and transpose You must think in “cloud” Do you think you can do that, Mr. Gant?
  • 5. AWS HPC architecture Master Node Compute Node Compute Node Compute Node Compute Node Storage (NFS, Parallel) Master Instance Compute Instance Compute Instance Compute Instance Compute Instance Storage (NFS, Parallel) On-premises AWS Cloud Compute Instance Compute Instance
  • 6. HPC tools MIT StarCluster • No longer being supported nor developed Bright Cluster Manager • Good for hybrid solutions CloudyCluster • Omnibond (out of Clemson University) Amazon Cfncluster • Getting started writing your own tools Alces Flight—on AWS Marketplace
  • 7. Alces Flight Alces Flight is software offering self-service supercomputers by using the AWS Marketplace (the cloud’s “App Store”). It creates self-scaling clusters with more than 750 popular scientific applications pre-installed, complete with libraries and various compiler optimizations, ready to run. The clusters use the AWS Spot market by default. 5 minutes http://alces-flight.com
  • 8. Alces Flight is familiar and flexible • Same tools as virtually all HPC systems • Environment modules • Job scheduler (SGE) • Catalog of 750+ prebuilt scientific applications and libraries including visualization tools • Alces gridware tool for application management • Integrated with modules • Defaults to the Spot market • Auto Scaling cluster based on queued jobs
  • 9. Flight enables collaboration Access the graphical console of your control node simultaneously with your collaborators • Run visual apps that use the elastic cluster to drive visual results and you can work together with the visual console in real-time Shared and secure cloud workspaces • Control access and focus on data analysis • Make more discoveries faster • Save lives • Change the world Collaborative IGV Integrative Genomics Viewer (IGV) workspace for variant analysis
  • 10. SPOT + HPC = peas and carrots
  • 12. Spot Market filler 0.00 1.50 3.00 4.50 6.00 # CPUs time Spot Market Our ultimate space filler. Spot Instances allow you to name your own price for spare AWS computing capacity. Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).
  • 13. Spot vs. On-Demand (YMMV) 4 compute nodes, 2 hours • On-Demand, us-east • $19.13 • Spot (us-west-1) • $7.22 • Almost 1/3 the cost! 16 compute nodes, 32 hours • On-Demand, us-east • $1,018.77 • Spot (us-west-1) • $223.11 • Almost 1/5 the cost!
  • 14. Fermi National Accelerator Laboratory Dr. Panagiotis Spentzouris
  • 15. Fermi National Accelerator Laboratory Fermilab is America’s particle physics and accelerator lab. • Mission: solve the mysteries of matter, energy, space and time for the benefit of all. More than 4,200 scientists worldwide use Fermilab and its particle accelerators, detectors and computers for their research.
  • 16. Particle Physics Science Drivers Utilize high-energy particle beam collisions to discover • the origin of mass, the nature of dark matter, extra dimensions. Employ high-flux beams to explore • neutrino interactions, to answer questions about the origins of the universe, matter-antimatter asymmetry, force unification. • rare processes, to open a doorway to realms to ultra-high energies, close to the unification scale.
  • 17. Massive instruments generate massive data Fermilab experiments
  • 18. The big data frontier… from Wired
  • 19. Fermilab Facility Evolution: HEPCloud HEPCloud: Provide cost effective and efficient “elastic” resource deployment, utilizing sophisticated decision engine and middleware for automation. A single portal to heterogeneous computing and storage resources, both local and “rental” (commercial or academic). • Initial focus on commercial clouds➡️ AWS
  • 20. AWS infrastructure • AWS CloudFormation automates the setup and teardown of the Amazon Route 53 DNS entries, the Elastic Load Balancing load balancer, the Auto Scaling group, and Amazon CloudWatch monitoring • Launched in each Availability Zone prior to workflows being run
  • 21. On Demand services ● Workflows depend on software services to run ● Automating the deployment of these services on AWS on- demand—enables scalability and cost savings o Services include data caching (e.g. Squid) WMS , submission service, data transfer, etc. o As services are made deployable on-demand, instantiate ensemble of services together (e.g. through AWS CloudFormation) ● Example: On-demand Squid
  • 23. Reaching ~60k slots on AWS with HEPCloud
  • 24. HEPCloud/AWS: 25% of CMS global capacity
  • 25. HPC needs of particle physics workflows Now that the HTC use case is out of the way… Machine learning for pattern recognitions Specialized HPC demands Very large computations (petascale) of physics processes necesary for theoretical interpretations Very large computations (petascale) for modeling particle accelerators and detectors
  • 26. Demo: Scaling jobs on a budget
  • 28. Summary Easy to “recreate” clusters in the cloud • Extremely scalable and flexible Spot + HPC is a wonderful combination • Saves time and money Customer example—FNAL Alces Flight in AWS Marketplace This is only the beginning—rethink HPC applications for fault tolerance, extreme scalability, etc.
  • 31. Introduction Setup 2 node cluster (2 compute nodes) where: • Master node = c4.8xlarge • 2x compute nodes = c4.xlarge • 10GigE networking Run compute nodes on Spot market and master node On- Demand Access cluster from Microsoft Windows box (using PuTTY)
  • 32. Step 1 Start up cluster using Alces Flight JSON file • CloudFormation service • Click Create Stack • Answer questions • Key file is critical! You will use it to log in to master node. • Choose a reasonable Spot price (check current market in region) – http://aws.amazon.com/ec2/pricing/ (near bottom of page)
  • 34. Create a stack • Specify the details of template instantiation • Called a “stack” • Allows you to tailor stack to needs
  • 35. Stack details—top portion Name of cluster Spot bid Instance type for compute nodes Amazon S3 bucket for customizations Key pair for that region Number of initial nodes in cluster
  • 36. Stack details—bottom portion Storage capacity for instances Master node type Max number of nodes Network CIDR User name
  • 38. Review of stack configuration—1
  • 39. Review of stack configuration—2 Don’t forget to check this box!
  • 42. Stack is done and cluster exists!
  • 44. Cluster configuration Recall that Alces Flight comes with: • Environment modules (connected to Alces Gridware) • Pdsh • SGE job scheduler • GNU Compilers • Alces Gridware • Built on CentOS 7 • 750+ applications and libraries (MPI included) Log into master node and try out commands (PuTTY). Run application.
  • 46. Keep alive in PuTTY  Go to Connection on left menu  Click on it  Select Enable TCP keepalives  Keeps PuTTY connection alive
  • 47. Add key to PuTTY session  Go to SSH on left menu  Expand menu  Select Auth  Use Browse to location private key (should be the same as was used when cluster was created)  Note: Has to be in .ppk format (might have to convert it from .pem format)
  • 48. Log in to master node!  Use “alces” as login (should match what you input to create cluster)  No password needed (uses pass key)  Ready to go!
  • 49. Check number of nodes  pdsh uses genders – “nodes” are only compute nodes – “cluster” includes master node  Be sure to check “qhost” for compute nodes
  • 50. Available modules at boot  AWS command line tools installed by default
  • 52. Install an application  Search for application using “alces gridware search …”  Install application using “alces gridware install …”  Environment modules are updated with application is installed
  • 53. Modules after installing application  To run application don’t forget to load the application module!
  • 55. Remove module and application  First, remove module  Second, run “alces gridware purge… “
  • 57. Demo 3 Cluster is up—show running MPI application • Which MPI application (make it something reasonable) Install application • Show change in modules Job script (go over details) Submit job—show output of qstat • Auto Scaling? Show output from application (yes it’s running)
  • 58. Cluster MPI definition Install MPI application using alces gridware Load module Set up job script Submit job Watch it run (run, app, run)
  • 60. Install benchmark depot  Installs depot  Abbreviated output
  • 61. Check modules New modules New modules Don’t forget to load modules before running!
  • 62. Create job script Don’t forget that Alces Flight uses SGE #!/bin/bash #$ -j y –N imb –o $HOME/imb_out.$JOB_ID #$ -pe mpinodes-verbose 2 –cwd –V module load mpi/openmpi module load apps/imb mpirun IMB-MPI1 Alces also has job templates available: “alces gridware templates list”
  • 63. Submit job and check status
  • 64. Once job is done—check output

Editor's Notes

  1. ALCES Flight can instantly take a researcher from zero to hero, by building HPC clusters at any scale with one of the largest catalogs of scientific applications ever put in one place---all immediately accessible in the AWS Cloud. Through AWS Marketplace, Alces Flight gives researchers access to a massive catalog of scientific apps in exactly the same way they’re used to working with a national supercomputing center, along with libraries, compilers and job schedulers that provide a very familiar look and feel. It also provides many things that national shared facilities can’t easily provide, like console access for GUIs and visualization tools, or admin access to install packages or modify the environment to suit the specific needs of a user.