Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High Performance Computing in AWS, Immersion Day Huntsville 2019


Published on

Learn more about High Performance Computing on AWS for the Department of Defense.

  • Be the first to comment

  • Be the first to like this

High Performance Computing in AWS, Immersion Day Huntsville 2019

  1. 1. Brad Dispensa Pr. Security and Compliance SA WW Public Sector HPC on AWS Wednesday, October 9, 2019 © 2019 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services, Inc.
  2. 2. Talk Outline Brief Overview: Cloud Computing on AWS Why use AWS for HPC? HPC Building Block in AWS ü Compute ü Networking ü Storage ü Deployment Tools Customer Success Stories and Example Use Cases
  3. 3. 1 What is Cloud Computing on AWS?
  4. 4. Global Infrastructure Coming soon 69 Availability Zones within 22 geographic Regions around the world We add the equivalent of an entire Fortune 500 company’s compute capacity every day
  5. 5. AWS Availability Zones, Data Centers, Servers AZ AZ AZ AZ AZ Tran sit Transit over 50,000 servers & often over 80,000
  6. 6. 2 Why AWS for HPC?
  7. 7. Amazing Amount of Compute Capacity – Economy of Scale A virtually unlimited number of architecture options Instance Types, OS, Traditional Cluster, Auto Scaling Clusters, Serverless, GPUs Extensive deployment options – “Infrastructure as Code” Console, Configuration Control, Automated, SDK, Bash/CLI, AWS CloudFormation Lots of useful services Amazon DynamoDB, Amazon CloudWatch, Amazon Glacier, and much more! instance with Amazon CloudWatch Auto Scaling template Amazon DynamoDB Why use AWS for HPC? AWS Lambda
  8. 8. Great Features for HPC Workloads Experimentation without Fear! Activate Multiple Compute Clusters Simultaneously! A Supercomputer at the Fingertip of EACH Scientist! Start and stop instances or entire clusters! Take Advantage of Spot Pricing Receive Continuous Updates Compute, Network, Storage All Services Immediate Access to latest Technology
  9. 9. The Life of an Average HPC Code on a Supercomputer: The average number of cores: 14 The average wall-clock time: 1.69 hrs The average queue wait time: 4.4 days! Cloud Improves “Workload Throughput” Think: “Workload Throughput” § The job queue becomes the capacity buffer § Job completion times are hard to predict § Users are frustrated and run fewer jobs § Innovation is throttled by fixed IT resources Run many Jobs in Parallel, Use it when you need it Pay only for what you use Right-size clusters and resources Optimize each workload for performance
  10. 10. Time-to-results Efficiency 2 2 2 4 2 1 1 3 7 7 4 9 5 7 6 6 7 7 4 8 4 Cores 8 2 1 9 5 4 5 3 1 2 3 6 1 9 4 8 1 2 8 7 7 6 Fixed data center capacity limit Cores Finite capacity, usually with long queues to wait in Massive capacity when needed to speed up time to results, and agile environment when additional hardware and software experimentation is needed “For every $1 spent on HPC, businesses see $463 in incremental revenues and $44 in incremental profit.”
  11. 11. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +00h Scale using Elastic Capacity <1,000 cores
  12. 12. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +24h Scale using Elastic Capacity >75,000 memory optimized cores
  13. 13. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +72h Scale using Elastic Capacity <1,000 cores
  14. 14. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +120h Scale using Elastic Capacity >30,000 GPU optimized cores
  15. 15. 3 AWS Building Blocks
  16. 16. AWS HPC Building Blocks Outline Compute Storage/Data Management Networking Deployment Tools
  17. 17. “What did you say???” – The AWS Lingo Term Meaning AWS Amazon Web Services EC2 Elastic Cloud Compute; an AWS service providing virtual machines AMI Amazon Machine Image, a virtual image for a virtual machine Instance One launched virtual server EBS Elastic Block Storage, data storage attached to an EC2 Instance VPC Virtual Private Cloud, your private piece of the cloud S3 Simple Storage Service, amazing object storage service Security Group The instance firewall
  18. 18. Compute
  19. 19. CHOICE OF AWS INSTANCES FOR HPC M4,5 General purpose Compute Optimized, Core Count Storage and IO optimized GPU, FPGA accelerated Memory optimized X1 F1 P3dn I3 D2 R4, 5 C5(n) C4 P2 z1d
  20. 20. Instance Generation c4.large Instance Family Instance Size Vertical Scaling Amazon EC2 Instances c4.8xlarge c4.4xlarge ≈ c4.2xlarge ≈ c4.xlarge ≈
  21. 21. Selecting an instance type for an HPC Instance Type vCPU Memory (GiB) Storage (GB) Networking Performance Physical Processor Vector Engine Clock Speed (GHz) Hypervisor c4.8xlarge 36 60 EBS Only 10 Gigabit Intel Xeon V3 AVX2 2.9 Xen based c5(n).18xlarge 72 144 EBS Only 25/100 Gigabit Intel Xeon Platinum AVX512 3.5 Nitro m5(d).24xlarge 48 384 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 2.5 Nitro r4.16xlarge 32 488 EBS Only 25 Gigabit Intel Xeon V4 AVX2 2.3 Xen based r5(d).24xlarge 96 768 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 3.1 Nitro x1.32xlarge 128 1,952 SSD 25 Gigabit Intel Xeon V3 AVX2 2.3 Nitro z1d.12xlarge 48 384 SSD 25 Gigabit Intel Xeon Platinum AVX512 4.0 Nitro
  22. 22. High network bandwidth compute instances: C5n, P3dn, i3en C5n § First “network optimized” instances on AWS § Will deliver up to 100Gbps network throughput § Instances based on C5/P3/i3 instances: § Intel Skylake/Broadwell CPUs § Nitro System (hypervisor and ENA) § Intended for network-intensive applications including HPC
  23. 23. High bandwidth compute instances: C5n HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Massively scalable performance • C5n Instances will offer up to 100 Gbps of network bandwidth • Significant improvements in maximum bandwidth, packet per seconds, and packets processing • Custom designed Nitro network cards • Purpose-built to run network bound workloads including distributed cluster and database workloads, HPC, real-time communications and video streaming Featuring
  24. 24. High bandwidth compute instances: P3dn HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Optimized for distributed ML training • One of the most powerful GPU instance available in the cloud • Distributed machine learning training across multiple GPU instances • 100 Gbps of networking throughput • Based on NVIDIA’s latest GPU Tesla V100 with 32GB of memory each • The largest Amazon Elastic Compute Cloud (Amazon EC2) P3 instance size available
  25. 25. High clock speed compute instances: Z1d HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Up to 4 GHz sustained, all-turbo performance • Z1d instances are optimized for memory-intensive, compute- intensive applications • Custom Intel Xeon Scalable processor • Up to 4 GHz sustained, all-turbo performance • Up to 385GiB DDR4 memory • Enhanced networking, up to 25 GB throughput Featuring
  26. 26. Network “It’s as fast as it’s SLOWEST Component”
  27. 27. AWS is Committed to Networking at Scale • The AWS Network is Custom Built – Full bi-section bandwidth in placement groups – Designed such that all ports can run flat out – No blocking, no oversubscription – Continuously improving – Commodity parts on a Moore’s Law Pace • Enhanced Networking – Reduced instance-to-instance latency – Reduced jitter • Amazon Elastic Network Adapter – New PCI network device developed for EC2 – Available on newer instances, including C5, M5, R4, C5, R5, Z1d – Ability to scale across a variety of bandwidths • 10 and 20 Gbps today “We love where we are right now,” AND ”It will only get better!”
  28. 28. ELASTIC NETWORK ADAPTER § Latest generation of Enhanced Networking § Hardware Checksums § Multi-Queue Support § Receive Side Steering § 25Gbps in a Placement Group § Open Source Amazon Network Driver
  29. 29. Remember this? AZ AZ AZ AZ AZ Tran sit Transit Our Network Needs to Scale! 22 Regions – 69 Availability Zones – 87 Edge Locations
  30. 30. Elastic Fabric Adapter (EFA) C5n P3dn EFA Elastic Fabric Adapter, best for large HPC workloads Scale tightly-coupled HPC applications on AWS i3en
  31. 31. AWS Elastic Fabric Adapter – EFA § Proprietary, AWS-designed fabric network § Built on top of network optimized instances on AWS § Delivering up to 100Gbps network throughput § Delivering below 15µs latencies for HPC applications § Optimized for OpenMPI and other MPI libraries § Supported on C5n, R5n, M5n, and P3n instances Amazon Confidential – provided under NDA
  32. 32. HPC software stack in Amazon EC2 Userspace Kernel Without EFA With EFA
  33. 33. What can EFA do? Thanks to Metacomp Technologies and the Klingon Empire.
  34. 34. OpenFoam benchmark MotorBike 140M
  35. 35. Storage
  36. 36. Comprehensive portfolio of storage options for HPC Block storage File storage Object storage Amazon EBS Amazon EFS Amazon S3 Elastic, high performance block storage at any scale Petabyte-scale, elastic file storage sharable across applications, instances and servers Low cost, highly scalable cloud storage with 99.999999999% durability
  37. 37. Amazon FSx for Lustre: Fully managed high performance parallel shared file system HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache High performing Parallel distributed file system Tune complex performance parameters Massively scalable performance 100+ GiB/s throughput Millions of IOPS Consistent low latencies
  38. 38. High and scalable performance Each terabyte (TB) of storage provides 200 MB/second of file system throughput and ~5,000 IOPS High and scalable performance Parallel File System 100+ GiB/s throughput Millions of IOPS Consistent sub-millisecond latencies Supports concurrent access from hundreds of thousands of cores SSD-based
  39. 39. File system throughput & IOPS scale linearly with storage capacity Each TB of storage provides 200 MB/s of baseline throughput, and up to 12x burst throughput File systems can scale to hundreds of GB/s and millions of IOPS Capacity Baseline throughput Burst throughput 1TB 200 MB/s up to 2.4 GB/s 10TB 2 GB/s up to 24 GB/s 50TB 10 GB/s up to 120 GB/s 100TB 20 GB/s up to 240 GB/s 1PB 200 GB/s at least 240 GB/s
  40. 40. Deployment Tools - Orchestration
  41. 41. Easy cluster management: AWS ParallelCluster Simplifies deployment of HPC in the cloud, including integrating with popular HPC schedulers Integrated with AWS Batch, Amazon FSx for Lustre and Elastic Fabric Adapter
  42. 42. AWS Parallel Cluster § Simplifies deployment of HPC Clusters in the cloud § Integrates with popular HPC schedulers including such as: § SLURM, Grid Engine, Torque § Built on AWS CloudFormation § Easy to modify to meet specific application or project requirements • Latest Features: – Multiple EBS volumes – Custom AMI support • Bring your own custom AMI, not just build from our default AMI – Easy use of EFS and Lustre use • Launch Templates support (for EFA) – Support for C5n and other new instance types – AWS Batch Integration • Open Source available on GitHub
  43. 43. AWS Batch • Dynamically provisions resources • Plans, schedules, and executes • No additional components to install Event Changes in data state Requests to endpoints Services (anything) Scheduled triggers Compute Execution Your code Auto Scaling Job queue
  44. 44. Efficient job scheduling: Multi-node parallel job support on AWS Batch HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Simplify your compute clusters and scale jobs across multiple instances with AWS Batch support for Multi-node Parallel (MNP) jobs Container 2 Container 4Container 3 Instance 1 Container 1 My job Instance 2 My job My job My job Instance 3 Instance 4
  45. 45. Orchestration tools include support for capacity and cost optimization Use Reserved Instances for known/steady-state workloads Scale using Spot, On-Demand, or both Evaluate the trade-off of time to solution vs. cost for scaling
  46. 46. 4 Customer References
  47. 47. Running HPC applications at extreme scale “Storage technology is amazingly complex and we’re constantly pushing the limits of physics and engineering to deliver next-generation capacities and technical innovation. This successful collaboration with AWS shows the extreme scale, power and agility of cloud-based HPC to help us run complex simulations for future storage architecture analysis and materials science explorations. Using AWS to easily shrink simulation time from 20 days to 8 hours allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.” —Steve Phillpott, CIO, Western Digital single HPC cluster of 1 million vCPUs Accelerating time to innovation 20 days à 8 hours
  48. 48. Descartes Labs makes the Top500 List running on AWS running-in-concert-on-aws-bf1610679978
  49. 49. 3 million core-hours of Amazon EC2 Spot capacity Complete sequencing of 3.24 billion base pairs
  50. 50. Manage 50X the number of securities 4,000 times faster In hours, instead of months Run risk models Helping financial institutions model investment risks
  51. 51. 600 times faster Engineering simulations Helping to make supersonic flights mainstream
  52. 52. Flexible configuration and virtually unlimited scalability to grow and shrink your infrastructure as your HPC workloads dictate, not the other way around HPC on AWS
  53. 53. Thank You! Any Questions?For More Information: