• Like
  • Save

Intro to High Performance Computing in the AWS Cloud

  • 1,449 views
Uploaded on

Visit http:aws.amazon.com/hpc for more information about HPC on AWS. …

Visit http:aws.amazon.com/hpc for more information about HPC on AWS.

High Performance Computing (HPC) allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, low latency networking, and very high compute capabilities. AWS allows you to increase the speed of research by running high performance computing in the cloud and to reduce costs by providing Cluster Compute or Cluster GPU servers on-demand without large capital investments. You have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications.

More in: Internet
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,449
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Credit to: David Pellerin, Dougal Ballantyne, Angel Pizarro, and Ryan Shuttleworth for a lot of the content for these slides.
    Amy Sun for many of the AWS graphics.
    Ian Meyers for the financial grid computing architecture.

    Last Updated: July 26, 2014 – butlerb@

    Point of contact: Ben Butler (butlerb@)

    Tags: #hpc, #high performance computing, #cluster, #MPI, #SR-IOV, #bi-section, #bandwidth, #architecture, #instance
  • Unlimited Infrastructure – increased scalability and elasticity – go from 10’s to 1000’s of instances

    Efficient clusters – our compute instances have high efficiency comparing actual performance vs theoretical performance: Rmax/Rpeak – this means you need less nodes in your cluster than inefficient clusters – HUGE potential cost savings. Tune your cluster, not tune to your cluster – with AWS you can pick the right EC2 instance type instead of being forced to optimize your workload for the fixed cluster you have in-house


    Low Cost with Flexible pricing – multiple pricing models, pay as you go, no CAPEX

    Increased collaboration – access to clusters and data can be from anywhere with an internet connection

    Faster time to results –focus on your business/science, increase efficiency of your people by not being burdent by IT. While you can benchmark the cluster on the performance of a running a job, it is more important and comprehensive to benchmark the total time it takes to provision and use the cluster end-to-end

    Concurrent Clusters on demand – no more waiting in a queue, run multiple jobs simultaneously with an API call


  • Other Use Cases:
    Science-as-a-Service
    Large-scale HTC (100,000+ core clusters)
    Small to medium-scale MPI clusters (hundreds of nodes)
    Many small MPI clusters working in parallel to explore parameter space
    GPGPU workloads
    Dev/test of MPI workloads prior to submitting to supercomputing centers
    Collaborative research environments
    On-demand academic training/lab environments
    Ephemeral clusters, custom tailored to the task at hand, created for various stages of a pipeline
  • Need design graphics here
    Slot in customers are here
    Customer logos

    *BEN: check with David Pellerin, Jamie and Dougal on the best verticals with the best customer reference for each vertical
  • Check on using screenshot of a website

  • http://news.cnet.com/8301-1001_3-57611919-92/supercomputing-simulation-employs-156000-amazon-processor-cores
    http://blog.cyclecomputing.com/2013/11/back-to-the-future-121-petaflopsrpeak-156000-core-cyclecloud-hpc-runs-264-years-of-materials-science.html

    Computational compound analysis Solar panel material Estimated computation time 264 years
  • DESIGN/DEV:
    Need a much better graphic to signify the idea of using multiple EC2 purchasing options to optimize cost.

    Medium RI to replace Light RI in the chart.
  • http://aws.amazon.com/solutions/case-studies/harvard/
    AWS Case Study: Harvard Medical School

    About Harvard Medical School
    The Laboratory for Personalized Medicine (LPM), of the Center for Biomedical Informatics at Harvard Medical School, run by Dr. Peter Tonellato, took the power of high throughput sequencing and biomedical data collection technologies and the flexibility of Amazon Web Services (AWS) to develop innovative whole genome analysis testing models in record time. “The combination of our approach to biomedical computing and AWS allowed us to focus our time and energy on simulation development, rather than technology, to get results quickly,“ said Tonellato. “Without the benefits of AWS, we certainly would not be as far along as we are.”

    The Challenge
    Tonellato’s lab focuses on personalized medicine—preventive healthcare for individuals based on their genetic characteristics—by creating models and simulations to assess the clinical value of new genetic tests.

    Other projects include simulating large patient populations to aid in clinical trial simulations and predictions. To overcome the difficulty of finding enough real patient data for modeling, LPM creates patient avatars—literally “virtual” patients. The lab can create different sets of avatars for different genetic tests and then replicate huge numbers of them based on the characteristics of hospital populations.

    Tonellato needed to find an efficient way to manipulate many avatars, sometimes as many as 100 million at a time. “In addition to being able to handle enormous amounts of data,” he said, “I wanted to devise system where postdoctoral researchers can scope a genetic risk situation, determine the appropriate simulation and analysis to create the avatars, and then quickly build web applications to run the simulations, rather than spend their time troubleshooting computing technology.”

    Why Amazon Web Services
    In 2006, Tonellato turned to cloud computing to address the complex and highly variable computational need. “I evaluated several alternatives but found nothing as flexible and robust as Amazon Web Services,” he said. Having built datacenters previously, Tonellato could not afford the time he knew would be required to set up servers and then write code. Instead, he decided to conduct a test to see how fast his team could put together a series of custom Amazon Machine Images (AMIs) that would reflect the optimal development environment for researchers’ web applications.

    Now, Tonellato’s lab has extended their efforts to integrate Spot Instances into their workflows so that they could stretch their grant money even further. According to Tonellato, “We leverage Spot Instances when running Amazon Elastic Cloud Compute (Amazon EC2) clusters to analyze entire genomes. We have the potential to run even more worker nodes at less cost when using Spot Instances, so it is a huge saving in both time and cost for us. To take advantage of these savings, it just took us a day of engineering, and saw roughly 50% savings in cost.” Tonellato’s lab leverages MIT’s StarCluster tools, which has built-in capabilities to manage an Oracle Grid Engine Cluster on Spot Instances. Erik Gafni, a programmer in Tonellato’s lab, performed the integration of StarCluster into our workflow. According to Gafni, “Using StarCluster, it was incredibly easy to configure, launch, and start using a running Spot Cluster in less than 10 minutes.”

    In addition the LPM recognized the need for published resources about how to effectively use cloud computing in an academic environment and published an educational primer in PLoS Computational Biology to address this need. “We believe this article clearly shows how an academic lab can effectively use AWS to manage their computing needs. It also demonstrates how to think about computational problems in relation to AWS costs and computing resources,” says Vincent Fusaro, lead author and senior research fellow in the LPM.

    The Benefits
    “The AWS solution is stable, robust, flexible, and low cost,” Tonellato commented. “It has everything to recommend it.”

    Tonellato runs his simulations on Amazon EC2, which provides customers with scalable compute capacity in the cloud. Designed to make web-scale computing easier for developers, Amazon EC2 makes it possible to create and provision compute capacity in the cloud within minutes.

    Tonellato’s lab is thrilled with their AWS solution. “The number of genetic tests available to doctors and hospitals is constantly increasing,” Tonellato explained, “and they can be very expensive. We’re interested in determining which tests will result in better patient care and better results.” He added, “We believe our models may dramatically reduce the time it usually takes to identify the tests, protocols, and trials that are worth pursuing aggressively for both FDA approval and clinical use.”

    Next Step
    To learn more about how AWS can help your big data needs, visit our Big Data details page: http://aws.amazon.com/big-data/.

  • *Ben: Slide Notes

    Loosely Coupled
    Embarrassingly parallel -
    Elastic -
    Batch workloads –

    Tightly Coupled
    Interconnected jobs – MPI description here
    Network sensitivity – enhanced networking, high throughput, low latency, predictable performance
    Job specific algorithms –
  • Supporting Services
    Data management
    Task distribution
    Workflow management
  • Feature
    Details
    Flexible
    Run windows or Linux distributions
    Scalable
    Wide range of instance types from micro to cluster compute
    Machine Images
    Configurations can be saved as machine images (AMIs) from which new instances can be created
    Full control
    Full root or administrator rights
    Secure
    Full firewall control via Security Groups
    Monitoring
    Publishes metrics to Cloud Watch
    Inexpensive
    On-demand, Reserved and Spot instance types
    VM Import/Export
    Import and export VM images to transfer configurations in and out of EC2
  • * DESIGN/DEV:
    Grapic to signify, console, CLI – using code to provision thousands of instances, instead of provisioning datacenters


    Example CLI Line
    ec2-run-instances ami-b232d0db
    --instance-count 3000
    --availability-zone eu-west-1a
    --instance-type m1.xlarge
  • Auto scaling is the tool that allows just-in time provisioning of compute resources based on the policies and events that you specify
  • This family includes the C1, CC2, and C3 instance types, and is optimized for applications that benefit from high compute power. Compute-optimized instances have a higher ratio of vCPUs to memory than other families, and the lowest cost per vCPU among all Amazon EC2 instance types. We recommend compute-optimized instances for running compute-bound scale out applications. Examples of such applications include high performance front-end fleets and web-servers, on-demand batch processing, distributed analytics, and high performance science and engineering applications.
    C3 instances are the latest generation of compute-optimized instances, providing customers with the highest performing processors and the lowest price/compute performance available in EC2 currently.

    Each virtual CPU (vCPU) on C3 instances is a hardware hyper-thread from a 2.8 GHz Intel Xeon E5-2680v2 (Ivy Bridge) processor allowing you to benefit from the latest generation of Intel processors.

    C3 instances support Enhanced Networking that delivers improved inter-instance latencies, lower network jitter and significantly higher packet per second (PPS) performance. This feature provides hundreds of teraflops of floating point performance by creating high performance clusters of C3 instances.

    Compared to C1 instances, C3 instances provide faster processors, approximately double the memory per vCPU, support for clustering and SSD-based instance storage.

    CC2 instances feature 2.6GHz Intel Xeon E5-2670 processors, high core count (32 vCPUs) and support for cluster networking. CC2 instances are optimized specifically for HPC applications that benefit from high bandwidth, low latency networking and high compute capacity. 
    Popular use cases: High-traffic web applications, ad serving, batch processing, video encoding, distributed analytics, high-energy physics, genome analysis, and computational fluid dynamics.
  • Nvidia GPUs (cg1.4xlarge)
    Intel Nehalem (cc1.4xlarge)
    2TB of SSD 120,000 IOPS (hi1.4xlarge)
    Intel Sandy Bridge E5-2670 (cc2.8xlarge)
    Sandy Bridge, NUMA, 240GB RAM (cr1.4xlarge)
    48 TB of ephemeral storage (hs1.8xlarge)


    2.6 GHz Sandy Bridge CPU w/ Turbo enabled
    1 NVIDIA GK104 GPU (Kepler)
    8 vCPUs, 15 GiB of RAM
    60GB SSD storage

    Ideally suited for remote desktop and 3D
    Supports DirectX, OpenGL, CUDA, OpenCL
    Wide range of platform partners including Citrix, Otoy, NICE Software


  • http://aws.amazon.com/solutions/case-studies/national-taiwan-university/

    About National Taiwan University
    Fast Crypto Lab is a research group within National Taiwan University, in Taiwan. The group’s research activities focus on the design and analysis of efficient algorithms to solve important mathematical problems, as well as the development and implementation of these algorithms on massively parallel computers.
    Why Amazon Web Services
    Prior to signing on with Amazon Web Services (AWS), the group used a private cloud and ran Hadoop on their own machines. Prof. Chen-Mou Cheng, the Principal Investigator of Fast Crypto Lab, explains why the research group made the switch to AWS: “It is quite easy to get started with AWS with its clear and flexible interface. Amazon Elastic Compute Cloud (Amazon EC2) provides a common measure of cost across problems of a different nature. For problems that are the same or similar, Amazon EC2 can also be used as a metric for comparing alternative or competing algorithms and their implementations.”
    Chen-Mou adds, “When using Amazon EC2 as a metric, the parallelizability of the algorithm or the parallelization of the implementation is explicitly taken into account, as opposed to being assumed or unspecified. The Amazon EC2 metric is thus practical and easy to use.”
    The group now uses Hadoop Streaming in their architecture, and runs their programs with Amazon Elastic MapReduce (Amazon EMR) and Cluster GPU Instances for Amazon EC2.
    “Our purpose is to break the record of solving the shortest vector problem (SVP) in Euclidean lattices," Chen-Mou says. "The problem plays an important role in the field of information science. We estimated that we would need 1,000 cg1.4xlarge instance-hours. We ended up using 50 cg1.4xlarge instances for about 10 hours to solve our problem. Now, the vectors we found are considered the hardest SVP anyone has solved so far. We only spent $2,300 for using the 100 Tesla M2050 for 10 hours, which is quite a good deal.”
    The Benefits
    Since switching to AWS, the group indicates that the machine maintenance costs have been reduced, and they have experienced more stable and scalable computational power. The group’s favorite component of AWS is Amazon CloudWatch, which it uses to watch computer utilities while also improving their program.
    Looking into the future, Chen-Mou says, “We want to increase our GPU Cluster quote and solve a higher dimension SVP. We are also considering renting an AWS machine for setting up an SVN server.”
  • Customer reference here?
  • Speaker Notes
  • SEC’s Market Information Data Analytics System (MIDAS) provided by AWS partner Tradeworx
    Powerful AWS-based system for big data market analytics
    2M transaction messages/per sec; 20B records, 1TB data/per day
    From RFP to contract award to production in ~12 months
    See http://sec.gov/marketstructure
    “For the growing team of quant types now employed at the SEC, MIDAS is becoming the world’s greatest data sandbox. And the staff is planning to use it to make the SEC a leader in its use of market data”
    – Elisse B. Walter, Chairman of the SEC

    “This basically propels the SEC from zero to 60 in one fell swoop, going from being way behind even the most basic market participant to being on par if not ahead of the vast majority of market participants, in terms of their system and analytical capabilities‘”
    – Gregg E. Berman, Associate Director, Office of Analytics and Research, SEC

  • Flip diagram
  • Task Clusters is HPC specific terminology
  • http://media.amazonwebservices.com/architecturecenter/AWS_ac_ra_loganalysis_11.pdf

    To upload large data sets into AWS, it is critical to make
    the most of the available bandwidth. You can do so by
    uploading data into Amazon Simple Storage Service (S3) in
    parallel from multiple clients, each using multithreading to
    enable concurrent uploads or multipart uploads for further
    parallelization. TCP settings like window scaling and selective
    acknowledgement can be adjusted to further enhance
    throughput. With the proper optimizations, uploads of several
    terabytes a day are possible. Another alternative for huge
    data sets might be Amazon Import/Export, which supports
    sending storage devices to AWS and inserting their contents
    directly into Amazon S3 or Amazon EBS volumes.

    Parallel processing of large-scale jobs is critical, and
    existing parallel applications can typically be run on
    multiple Amazon Elastic Compute Cloud (EC2) instances.
    A parallel application may sometimes assume large scratch
    areas that all nodes can efficiently read and write from. S3
    can be used as such a scratch area, either directly using
    HTTP or using a FUSE layer (for example, s3fs or SubCloud)
    if the application expects a POSIX-style file system.

    Once the job has completed and the result data is
    stored in Amazon S3, Amazon EC2 instances can be
    shut down, and the result data set can be downloaded The
    output data can be shared with others, either by granting read
    permissions to select users or to everyone or by using time
    limited URLs.

    Instead of using Amazon S3, you can use Amazon
    EBS to stage the input set, act as a temporary storage
    area, and/or capture the output set. During the upload, the
    concepts of parallel upload streams and TCP tweaking also
    apply. In addition, uploads that use UDP may increase speed
    further. The result data set can be written into EBS volumes,
    at which time snapshots of the volumes can be taken for
    sharing.
  • http://media.amazonwebservices.com/architecturecenter/AWS_ac_ra_financialgrid_12.pdf

    Date sources for market, trade, and counterparties are
    installed on startup from on premise data sources, or
    from Amazon Simple Storage Service (Amazon S3).

    AWS DirectConnect can be used to establish a low
    latency and reliable connection between the corporate
    data center site and AWS, in 1 to 10Gbit increments. For
    situations with lower bandwidth requirements, a VPN
    connection to the VPC Gateway can be established.

    Private subnetworks are specifically created for
    customer source data, compute grid clients, and the
    grid controller and engines.

    Application and corporate data can be securely stored
    in the cloud using the Amazon Relational Database
    Service (Amazon RDS).

    Grid controllers and grid engines are running Amazon
    Elastic Compute Cloud (Amazon EC2) instances
    started on demand from Amazon Machine Images (AMIs)
    that contain the operating system and grid software.

    Static data such as holiday calendars and QA libraries
    and additional gridlib bootstrapping data can be
    downloaded on startup by grid engines from Amazon S3.

    Grid engine results can be stored in Amazon
    DynamoDB, a fully managed database providing
    configurable read and write throughput, allowing scalability on
    demand.

    Results in Amazon DynamoDB are aggregated using
    a map/reduce job in Amazon Elastic MapReduce
    (Amazon EMR) and final output is stored in Amazon S3.

    The compute grid client collects aggregate results from
    Amazon S3.

    Aggregate results can be archived using Amazon
    Glacier, a low-cost, secure, and durable storage service
  • http://aws.amazon.com/solutions/case-studies/bankinter/

    About Bankinter
    Bankinter was founded in June, 1965 as a Spanish industrial bank through a joint venture by Banco de Santander and Bank of America. It is currently listed among the top ten banks in Spain. Bankinter has provided online banking services since 1996, when they pioneered the offering of real-time stock market operations. More than 60% of Bankinter transactions are performed through remote channels; 46% of those transactions are through the Internet. Today Bankinter.com and Bankinter brokerage services continue to lead the European banking industry in online financial operations.

    The Challenge
    Bankinter uses Amazon Web Services (AWS) as an integral part of their credit-risk simulation application, developing complex algorithms to simulate diverse scenarios in order to evaluate the financial health of Bankinter clients. "This requires high computational power,” says Bankinter Director of Technological Innovation, Javier Roldán. “We need to perform at least 5,000,000 simulations to get realistic results.”
    Why Amazon Web Services
    Bankinter uses the flexibility and power of Amazon Elastic Compute Cloud (Amazon EC2) to perform these simulations, subdividing processes through a grid of Amazon EC2 instances and implementing simulations in parallel on several Amazon EC2 instances to obtain the result in a very effective time period.
    Bankinter used Java to develop their application and the Amazon Software Development Kit (SDK) to automate the provisioning process of AWS elements. Through the use of AWS, Bankinter decreased the average time-to-solution from 23 hours to 20 minutes and dramatically reduced processing, with the ability to reduce even further when required. Amazon EC2 also allowed Bankinter to adapt from a big batch process to a parallel paradigm, which was not previously possible. Costs were also dramatically reduced with this cloud-based approach.
    The Benefits
    Bankinter plans to expand their use of AWS for future applications and business units. “The AWS platform, with its unlimited and flexible computational power, is a good fit for our risk-simulation process requirements,” says Roldán. “With AWS, we now have the power to decide how fast we want to obtain simulation results. More important, we have the ability to run simulations that were not possible before due to the large amount of infrastructure required.”
  • http://media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf


    Users interact with the Job Manager application which
    is deployed on an Amazon Elastic Computer Cloud
    (EC2) instance. This component controls the process of
    accepting, scheduling, starting, managing, and completing
    batch jobs. It also provides access to the final results, job and
    worker statistics, and job progress information.
    Raw job data is uploaded to Amazon Simple Storage
    Service (S3), a highly-available and persistent data
    store.

    Individual job tasks are inserted by the Job Manager in
    an Amazon Simple Queue Service (SQS) input
    queue on the user’s behalf.

    Worker nodes are Amazon EC2 instances deployed
    on an Auto Scaling group. This group is a container
    that ensures health and scalability of worker nodes. Worker
    nodes pick up job parts from the input queue automatically
    and perform single tasks that are part of the list of batch
    processing steps.

    Interim results from worker nodes are stored in
    Amazon S3.

    Progress information and statistics are stored on the
    analytics store. This component can be either an
    Amazon DynamoDB table or a relational database such as
    an Amazon Relational Database Service (RDS) instance.
    Optionaly, completed tasks can be inserted in an
    Amazon SQS queue for chaining to a second
    processing stage.
  • Engage Sales and Solutions Architects

    You have long queues to use your cluster
    Your jobs are varied enough that you spend more time optimizing for the cluster you have
    You are benchmarking for you cluster
    Find out how your HPC workloads can run AWS
    We can find the right partner to help manage your HPC workload for you
  • A centralized repository of public datasets
    Seamless integration with cloud based applications
    No charge to the community
    Some of the datasets available today:
    1000 Genomes Project
    Ensembl
    GenBank
    Illumina – Jay Flateley Human Genome Dataset
    YRI Trio Dataset
    The Cannabis Sativa Genome
    UniGene
    Influenza Virrus
    PubChem
  • Better graphics needed for these groups
  • These can be there own slide
    Stick these in an appendix




    Analytics – log scanning and simulations
    Financial Modeling and Analysis – weath management simulations, value at risk, conterparty value analytics
    Image and Media encoding – render and encode media assests
    Testing – load, integration, canary, and security testing
    Process big data – twitter feeds, genomes, trend analysis
    Geospatial analysis – rendering
    Scientific computing – simulations, drug discovery
    Web crawling
    Particle Physics Simulaitons
    Artificial Intelligence Research
    Durg Discovery
    Scientific Collaboration and Centralized Data management
    GPU – Use Cases
    Molecular Dynamics
    Seismic Analysis
    Genome Assembly and Alignment
    Visualize molecules
    Simulation
    Floating point applications
    Page ranking
    Video transcoding

  • http://www.foetron.com/case-studies/amazon/cyclopic-energy/
    http://aws.amazon.com/solutions/case-studies/cyclopic-energy/
  • http://aws.amazon.com/solutions/case-studies/mentor-graphics/

    Mentor Graphics is a global supplier of EDA and mechanical CAD tools
    ASIC design and simulation, mask verification
    Board-level design products
    EM and CFD solvers
    Business case: enable cloud-based deployments for product evaluation and training. Create a virtual lab environment with IP security and a responsive user experience.
    Solution: Mentor Virtual Labs, built on AWS
  • Time accurate fluid dynamics
    SBIR-funded project for the US Air Force Research Laboratory (AFRL)
    SAS 70 Type II certification and VPN-level access required
    Additional security measures:
    Uploaded and downloaded data was encrypted
    Dedicated EC2 cluster instances were provisioned
    Data was purged upon completion of the run

    “The results of this case were impressive. Using Amazon EC2 the large-scale, time accurate simulation was turned around in just 72 hours with computing infrastructure costs well below $1,000.” http://aws.amazon.com/solutions/case-studies/aerodynamic-solutions/
  • HGST application roadmap:
    Collaboration tools
    Molecular dynamics
    CAD, CFD, EDA
    Big data for manufacturing yield analysis
  • http://aws.amazon.com/solutions/case-studies/pfizer/

    About Pfizer
    Pfizer, Inc. applies science and global resources to improve health and well-being at every stage of life. The company strives to set the standard for quality, safety, and value in the discovery, development, and manufacturing of medicines for people and animals.
    The Challenge
    Pfizer’s high performance computing (HPC) software and systems for worldwide research and development (WRD) support large-scale data analysis, research projects, clinical analytics, and modeling. Pfizer’s HPC services are used across the spectrum of WRD efforts, from the deep biological understanding of disease, to the design of safe, efficacious therapeutic agents.
    Why Amazon Web Services
    Dr. Michael Miller, Head of HPC for R&D at Pfizer explains why Pfizer initially considered using Amazon Web Services (AWS) to handle its peak computing needs: “The Amazon Virtual Private Cloud (Amazon VPC) was a unique option that offered an additional level of security and an ability to integrate with other aspects of our infrastructure.”
    Pfizer has now set up an instance of the Amazon VPC to provide a secure environment with which to carry out computations for WRD. They say, “We accomplished this by customizing the ‘job scheduler’ in our HPC environment to recognize VPC workload, and start and stop instances as needed to address the workflow. Research can be unpredictable, especially as the on-going science raises new questions.” The VPC has enabled Pfizer to respond to these challenges by providing the means to compute beyond the capacity of the dedicated HPC systems, which provides answers in a timely manner.
    Pfizer’s solution was written in C and is based on the Amazon Elastic Compute Cloud (Amazon EC2) command line tools. They say, “We are currently migrating this solution over to a commercial API that will enable additional provisioning and usage tracking capabilities.”
    The Benefits
    The primary cost savings has been in cost avoidance, “Pfizer did not have to invest in additional hardware and software, which is only used during peak loads; that savings allowed for investments in other WRD activities.”
    For Pfizer, AWS is a fit-for-purpose solution. The Dr. Miller explains, “It is not a replacement for, but rather an addition to our capabilities for HPC WRD activities, providing a unique augmentation to our computing capabilities.
    Overall, “AWS enables Pfizer’s WRD to explore specific difficult or deep scientific questions in a timely, scalable manner and helps Pfizer make better decisions more quickly.”
    Looking ahead, Pfizer is interested in exploring Amazon Simple Storage Service (Amazon S3) for storing reference data to expand the type of computational problems they can address.

Transcript

  • 1. Ben Butler Sr. Mgr., Big Data & HPC Marketing Amazon Web Services butlerb@amazon.com @bensbutler aws.amazon.com/hpc High Performance Computing on AWS
  • 2. Take a typical big computation task… Big Job
  • 3. …that an average cluster is too small (or simply takes too long to complete)… Big Job Cluster
  • 4. …optimization of algorithms can give some leverage… Big Job Cluster
  • 5. …and complete the task in hand… Big Job Cluster
  • 6. Applying a large cluster… Big Job Cluster
  • 7. …can sometimes be overkill and too expensive Big Job Cluster
  • 8. AWS instance clusters can be balanced to the job in hand… Big Job
  • 9. …nor too large… Small Job
  • 10. …nor too small… Bigger Job
  • 11. …with multiple clusters running at the same time
  • 12. Why AWS for HPC? Low cost with flexible pricing Efficient clusters Unlimited infrastructure Faster time to results Concurrent Clusters on-demand Increased collaboration
  • 13. Customers running HPC workloads on AWS
  • 14. Popular HPC workloads on AWS Transcoding and Encoding Monte Carlo Simulations Computational Chemistry Government and Educational Research Modeling and Simulation Genome processing
  • 15. Academic research BiopharmaUtilities Manufacturing Materials Design Auto & Aerospace Across several key industry verticals
  • 16. Jun 2014 Top 500 list 484.2 TFlop/s 26,496 cores in a cluster of EC2 C3 instances Intel Xeon E5-2680v2 10C 2.800GHzprocessors LinPack Benchmark TOP500: 76th fastest supercomputer on-demand
  • 17. Elastic Cloud-Based Resources Actual demand Resources scaled to demand Benefits of Agility Waste Customer Dissatisfaction Actual Demand Predicted Demand Rigid On-Premises Resources
  • 18. Unilever: augmenting existing HPC capacity The key advantage that AWS has over running this workflow on Unilever’s existing cluster is the ability to scale up to a much larger number of parallel compute nodes on demand. Pete Keeley Unilever Researchs eScience IT Lead for Cloud Solutions ” “ • Unilever’s digital data program now processes genetic sequences twenty times faster
  • 19. Scalability on AWS Time:+00h Scale using Elastic Capacity <10 cores
  • 20. Scalability on AWS Time: +24h Scale using Elastic Capacity >1500 cores
  • 21. Scalability on AWS Time:+72h Scale using Elastic Capacity <10 cores
  • 22. Time: +120h Scale using Elastic Capacity >600 cores Scalability on AWS
  • 23. Schrodinger & CycleComputing: computational chemistry Simulation by Mark Thompson of the University of Southern California to see which of 205,000 organic compounds could be used for photovoltaic cells for solar panel material. Estimated computation time 264 years completed in 18 hours. • 156,314 core cluster across 8 regions • 1.21 petaFLOPS (Rpeak) • $33,000 or 16¢ per molecule
  • 24. Cost Benefits of HPC in the Cloud Pay As You Go Model Use only what you need Multiple pricing models On-Premises Capital Expense Model High upfront capital cost High cost of ongoing support
  • 25. Reserved Make a low, one-time payment and receive a significant discount on the hourly charge For committed utilization Free Tier Get Started on AWS with free usage & no commitment For POCs and getting started On-Demand Pay for compute capacity by the hour with no long-term commitments For spiky workloads, or to define needs Spot Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand For time-insensitive or transient workloads Dedicated Launch instances within Amazon VPC that run on hardware dedicated to a single customer For highly sensitive or compliance related workloads Many pricing models to support different workloads
  • 26. 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Heavy Utilization Reserved Instances Light RI Light RILight RILight RI On-Demand Spot and On- Demand 100% 80% 60% 40% 20% Utilization Over Time Optimize Cost by using various EC2 instance pricing models
  • 27. Harvard Medical School: simulation development The combination of our approach to biomedical computing and AWS allowed us to focus our time and energy on simulation development, rather than technology, to get results quickly. Without the benefits of AWS, we certainly would not be as far along as we are. Dr. Peter Tonellato, LPM, Center for Biomedical Informatics, Harvard Medical School ” “ • Leveraged EC2 spot instances in workflows • 1 day worth of effort resulted in 50% in cost savings
  • 28. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Interconnected jobs Network sensitivity Job specific algorithms Loosely Coupled Tightly Coupled
  • 29. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Interconnected jobs Network sensitivity Job specific algorithms Data management Task distribution Workflow management Loosely Coupled Supporting Services Tightly Coupled
  • 30. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Interconnected jobs Network sensitivity Job specific algorithms Data management Task distribution Workflow management Loosely Coupled Supporting Services Tightly Coupled
  • 31. Elastic Compute Cloud (EC2) c3.8xlarge g2.medium m3.large Basic unit of compute capacity,virtual machines Range of CPU, memory & local disk options Choice of instance types, from micro to cluster compute Compute Services
  • 32. CLI, API and Console Scripted configurations Automation & Control
  • 33. Auto Scaling Automatic re-sizing of compute clusters based upon demand and policies
  • 34. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Interconnected jobs Network sensitivity Job specific algorithms Data management Task distribution Workflow management Loosely Coupled Supporting Services Tightly Coupled
  • 35. What if you need to: Implement MPI? Code for GPUs?
  • 36. Cluster compute instances Implement HVM process execution Intel® Xeon® processors 10 Gigabit Ethernet –c3 has Enhanced networking, SR-IOV cc2.8xlarge 32 vCPUs 2.6 GHz Intel Xeon E5-2670 Sandy Bridge 60.5 GB RAM 2 x 320 GB Local SSD c3.8xlarge 32 vCPUs 2.8 GHz Intel Xeon E5-2680v2 Ivy Bridge 60GB RAM 2 x 320 GB Local SSD Tightly coupled
  • 37. Network placement groups Cluster instances deployed in a Placement Group enjoy low latency, full bisection 10 Gbps bandwidth 10Gbps Tightly coupled
  • 38. GPU compute instances cg1.8xlarge 33.5 EC2 Compute Units 20GB RAM 2x NVIDIA GPU 448 Cores 3GB Mem g2.2xlarge 26 EC2 Compute Units 16GB RAM 1x NVIDIA GPU 1536 Cores 4GB Mem G2 instances Intel® Intel Xeon E5-2670 1 NVIDIA Kepler GK104 GPU I/O Performance: Very High (10 Gigabit Ethernet) CG1 instances Intel® Xeon® X5570 processors 2 x NVIDIA Tesla “Fermi” M2050 GPUs I/O Performance: Very High (10 Gigabit Ethernet) Tightly coupled
  • 39. National Taiwan University: shortest vector problem Our purpose is to break the record of solving the shortest vector problem (SVP) in Euclidean lattices…the vectors we found are considered the hardest SVP anyone has solved so far. Prof. Chen-Mou Cheng Principle Investigator of Fast Crypto Lab ” “ • $2,300 for using 100x Tesla M2050 for ten hours
  • 40. CUDA & OpenCL Massive parallel clusters running in GPUs NVIDIA Tesla cards in specialized instance types Tightly coupled
  • 41. Embarrassingly parallel Elastic Batch workloads Interconnected jobs Network sensitivity Job specific algorithms Data management Task distribution Workflow management Characterizing HPC Loosely Coupled Supporting Services Tightly Coupled
  • 42. Data management Fully-managed SQL, NoSQL, and object storage Relational Database Service Fully-managed database (MySQL, Oracle, MSSQL, PostgreSQL) DynamoDB NoSQL, Schemaless, Provisioned throughput database S3 Object datastore up to 5TB per object Internet accessibility Supporting Services
  • 43. “Big Data” changes dynamic of computation and data sharing Collection Direct Connect Import/Export S3 DynamoDB Computation EC2 GPUs Elastic MapReduce Collaboration CloudFormation Simple Workflow S3 Moving compute closer to the data
  • 44. TRADERWORX: Market Information Data Analytics System For the growing team of quant types now employed at the SEC, MIDAS is becoming the world’s greatest data sandbox. And the staff is planning to use it to make the SEC a leader in its use of market data Elisse B. Walter, Chairman of the SEC Tradeworx ” “ • Powerful AWS-based system for market analytics • 2M transaction messages/sec; 20B records and 1TB/day
  • 45. Feeding workloads Using highly available Simple Queue Service to feed EC2 nodes Processing task/processing trigger Amazon SQS Processing results Supporting Services
  • 46. Coordinating workloads & task clusters Handle long running processes across many nodes and task steps with Simple Workflow Task A Task B (Auto-scaling) Task C Supporting Services 3 1 2
  • 47. Architecture of large scale computing and huge data sets
  • 48. NYU School of Medicine: Transferring large data sets Transferring data is a large bottleneck; our datasets are extremely large, and it often takes more time to move the data than to generate it. Since our collaborators are all over the world, if we can’t move it they can’t use it. ” “ • Uses Globus Online • Data transfer speeds of up to 50MB/s Dr. Stratos Efstathiadis Technical Director of the HPC facility, NYU
  • 49. Architecture of a financial services grid computing
  • 50. Bankinter: credit-risk simulation With AWS, we now have the power to decide how fast we want to obtain simulation results. More important, we have the ability to run simulations that were not possible before due to the large amount of infrastructure required. Javier Roldán Director of Technological Innovation, Bankinter ” “ • Reduced processing time of 5,000,000 simulations from 23 hours to 20 minutes
  • 51. Architecture of queue-based batch processing
  • 52. When to consider running HPC workloads on AWS New ideas New HPC project Proof of concept New application features Training Benchmarking algorithms Remove the queue Hardware refresh cycle Reduce costs Collaboration of results Increase innovation speed Reduce time to results Improvement
  • 53. AWS Grants Program aws.amazon.com/grants
  • 54. AWS Public Data Sets aws.amazon.com/datasets free for everyone
  • 55. AWS Marketplace – HPC category aws.amazon.com/marketplace
  • 56. Getting Started with HPC on AWS aws.amazon.com/hpc contact us, we are here to help Sales and Solutions Architects Enterprise Support Trusted Advisor Professional Services
  • 57. cfncluster (“CloudFormation cluster”) Command Line Interface Tool Deploy and demo an HPC cluster For more info: aws.amazon.com/hpc/resources Try out our HPC CloudFormation-based demo
  • 58. HPC Partners and Apps
  • 59. Oil and Gas Seismic Data Processing Reservoir Simulations, Modeling Geospatial applications Predictive Maintenance Manufacturing & Engineering Computational Fluid Dynamics (CFD) Finite Element Analysis (FEA) Wind Simulation Life Sciences Genome Analysis Molecular Modeling Protein Docking Media & Entertainment Transcoding and Encoding DRM, Encryption Rendering Scientific Computing Computational Chemistry High Energy Physics Stochastic Modeling Quantum Analysis Climate Models Finaancial Monte Carlo Simulations Wealth Management Simulations Portfolio, Credit Risk Analytics High Frequency Trading Analytics Costomers are using AWS for more and more HPC workloads
  • 60. Accelerating materials science 2.3M core-hours for $33K So What Does Scale Mean on AWS?
  • 61. Cyclopic energy: computational fluid dynamics AWS makes it possible for us to deliver state-of-the-art technologies to clients within timeframes that allow us to be dynamic, without having to make large investments in physical hardware. Rick Morgans Technical Director (CTO), cyclopic energy ” “ • Two months worth of simulations finished in two days
  • 62. Mentor Graphics: virtual lab for design and simulation Thanks to AWS, the Mentor Graphics customer experience is now fast, fluid, and simple. Ron Fuller Senior Director of Engineering, Mentor ” “ • Developed a virtual lab for ASIC design and simulation for product evaluation and training
  • 63. AeroDynamic Solutions: turbine engine simulation We’re delighted to be working closely with the U.S. Air Force and AWS to make time accurate simulation a reality for designers large and small. George Fan CEO, AeroDynamic Solutions ” “ • Time accurate simulation was turned around in 72 hours with infrastructure costs well below $1,000
  • 64. HGST: molecular dynamics simulation HGST is using AWS for a higher performance, lower cost, faster deployed solution vs buying a huge on-site cluster. Steve Philpott CIO, HGST ” “ • Uses HPC on AWS for CAD, CFD, and CDA
  • 65. Pfizer: large-scale data analytics and modeling AWS enables Pfizer’s Worldwide Research and Development to explore specific difficult or deep scientific questions in a timely, scalable manner and helps Pfizer make better decisions more quickly. Dr. Michael Miller Head of HPC for R&D, Pfizer ” “ • Pfizer avoiding having to procure new HPC hardware by being able to use AWS for peak work loads.
  • 66. Thank you! Please visit aws.amazon.com/hpc for more info