Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013
 

Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013

on

  • 1,704 views

"Running high-performance scientific and engineering applications is challenging no matter where you do it. Join IT executives from Hitachi Global Storage Technology, The Aerospace Corporation, ...

"Running high-performance scientific and engineering applications is challenging no matter where you do it. Join IT executives from Hitachi Global Storage Technology, The Aerospace Corporation, Novartis, and Cycle Computing and learn how they have used the AWS cloud to deploy mission-critical HPC workloads.
Cycle Computing leads the session on how organizations of any scale can run HPC workloads on AWS. Hitachi Global Storage Technology discusses experiences using the cloud to create next-generation hard drives. The Aerospace Corporation provides perspectives on running MPI and other simulations, and offer insights into considerations like security while running rocket science on the cloud. Novartis Institutes for Biomedical Research talks about a scientific computing environment to do performance benchmark workloads and large HPC clusters, including a 30,000-core environment for research in the fight against cancer, using the Cancer Genome Atlas (TCGA)."

Statistics

Views

Total Views
1,704
Views on SlideShare
1,702
Embed Views
2

Actions

Likes
3
Downloads
27
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013 Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013 Presentation Transcript

  • Real-world Cloud HPC at Scale, for Production Workloads Jason A Stowe, Cycle Computing November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • We believe that utility access to HPC accelerates invention
  • Goals for today • See real world use cases from 3 leading engineering and scientific computing users – Steve Philpott, CIO, HGST, A Western Digital Company – Bill E. Williams, Director, The Aerospace Corporation – Michael Steeves, Sr. Systems Engineer, Novartis • Understand the motivations, strategies, lessons learned in running HPC / Big Data workloads in the cloud • See the varying scales and application types that run well, including a 1.21 PetaFLOPS environment
  • Agenda • • • • • • Introduction Steve Philpott – Journey into Cloud Bill Williams – Cloud Computing @ Aerospace Michael Steeves – Accelerating Science Spot, On-demand, & Other Production uses Questions and answers
  • Journey to the Cloud Steve Phillpott CIO HGST, a Western Digital Company © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Cloud & Datacenter Performance Enterprise  Founded in 2003 through the combination of the hard drive businesses of IBM, the inventor of the hard drive, and HGST, Ltd PCIe Enterprise SSD (+3 acquisitions) SAS 10K & 15K HDDs  Acquired by Western Digital in 2012  More than 4,200 active worldwide patents  Headquartered in San Jose, California  Approximately 41,000 employees worldwide  Develops innovative, advanced hard disk drives, enterprise-class solid state drives, external storage solutions and services Ultrastar® Capacity Enterprise 7200 RPM & CoolSpin HDDs Ultrastar® & MegaScale DC™  Delivers intelligent storage devices that tightly integrate hardware and software to maximize solution performance 6
  • Zero to Cloud in 6+ Month By 31 Oct 2013: Cloud eMail – Microsoft Office365 April 2013 Cloud eMail archiving/eDiscovery External SingleSignOn (off VPN) Cloud File/Collaboration – BOX Cloud CRM – Salesforce.com  Integrated to save files in BOX Cloud–High Performance Computing (HPC) on Amazon AWS Cloud – Big Data Platform on Amazon AWS 7
  • Responding to the Changing Business Model  Where is our business model headed? “New Age of Innovation” as a guide N=1 Focus on Individual Customer Experience R=G Resources are Global  Implications –Increase in strategic partnering –Need for high level of flexibility –Leveraging external expertise Use of the Cloud/SaaS aligns with Virtual Business Model:      Variable cost model critically important Lightweight, scalable services Reduced up-front capital spend Accelerated provisioning Pay as you go 8
  • Paradigm Shift: Consumerization of IT “I have better technology at home” Consumer Web A new paradigm in ease of use and reduced cost. Consumer web has been driven by a series of platforms – and these platforms are household brand names today When we use these platforms, it continually amazes us – how easy, how consistent these platforms work A new set of services: DRM to iTunes Yet, our workplace applications are cumbersome, costly, difficult to navigate and require extensive support Workday, 2009 9
  • The Big Switch – The Box has Disappeared The Transformation of Computing as we Know it.  Physical to Virtual/Digital move – Do you really care which computer processed your last Google search?  Efficiency – Do not waste a CPU cycle or a byte of memory. Building a 4-story building and only using the 1st floor  Utility: IT as a Service - Plug it in and get it – Where the electricity industry has gone, Computing is following – Computing shift is almost invisible to the end-user DATA is the value to the Organization, not the “where” 1
  • Enabling the Virtual Organization Reframing IT Away From Thinking of “The App” Business Intelligence and Analytics End-to-End Business Processes Enterprise Data Management New Computing Platforms Strategic Outsourcing Software as a Service (SaaS) New IT Organizational Structures: Support and Align to “New Business Model” 1 1
  • Creating an Innovation Playground: Where to Start and How to Evolve IT Supports Business Strategy Executive Buy-In – CEO, CIO, InfoSec, etc Reduce Cap-ex, Optimize DC usage Build Expertise Implement Outcome Defined Knowledge Play Learn Educate • Team Involvement • Conferences • Vendor Briefing • Expert Services • Best Practices Experiment • Team Approach • Hands-on approach • Understand the value proposition • Understand constraints Migrate • Migrate dev/test environments • Migrate or launch new apps on the cloud Embrace success Showcase cost savings Build an enterprise cloud strategy Learn from each experience Expand accordingly • Indentify app fit for cloud computing • Define new processes • Collaborate with other companies 12 Awareness Understanding Transition Commitment 12
  • Multiple Opportunities to Leverage Amazon Web Services (AWS) AWS: “ >5x the compute capacity than next 14 providers combined” – Gartner, Aug 2013 Access to massive compute and storage Billed by the hour - only pay for what is used HGST Japan Research Lab: Using AWS for higher performance, lower cost, faster deployed solution vs. buying huge on-site cluster Develop AWS Competency Many Opportunities: In-house and commercial HPCs are “cloud ready” Provide Computing When Needed: Reduce capital investment & risk and increase flexibility Faster Response to Business Needs: Rapid prototyping to pilot new IT capabilities with “PO Process” ; setup users, allocate compute and storage in minutes, load apps and go AWS provide a great option for disaster recovery for our “on-premise” clusters and storage 13
  • HGST’s Amazon HPC Platform Case 3: Lube depletion in TAR (2D heat profile) 1.E+07 (300,000 atoms) Atoms Dealing with Basic Molecular Simulation Large Scale Molecular Simulation for HDI Top view 1.E+06 (Lube molecules spreading onto COC) Case 3 5 ns Case 1 1.E+05 1 ns 5 ns Case 2 1.E+04 Relaxation time: 5 ns Relaxation time: < 1 ns 1.E+03 0 100 200 300 400 Number of Core 500 600 Heat spot in TAR 36 nm Molecular Dynamics Simulation Read / Write Magnetics Electo – Magnetic Fields Mechanical MAGLAND Simulation Application CST Read / Write Magnetics Electo – Magnetic Fields Base HPC Platform Scalable to thousands of instances to support numerous simultaneous simulations Ansys Commercial LLG Ansys HFSS Pre- and Post-Processing Server Farms New G2 Instances Add Visualization Capabilities 14
  • Big Data’s “3 V’s” Three “V’s” of Big Data Best pragmatic Volume Velocity •Data sources •Data types •Applications Trends Variety •Data collected •Analysis & metadata creation •Data acquisition •Analysis & action Structured Terabytes Batch Unstructured, Semi-Structured & Structured Petabytes & Exabytes Real-Time & Streaming Implications & Opportunities • Hardware and software optimization • Architectural shifts: Scale-out systems, Distributed filesystems, Tiered storage, Hadoop… Key difference: data structure does not need to be defined before loading definition from Snijders et al. “Data sets so large and complex that they become awkward to work with using standard tools and techniques” 15
  • Data Sources Big Data Platform All raw parametric, logistic, vintage, data Parallelized batch analytics raw extracts Batch Analytics Enriched data Slider Wafer Media Substrate Optimize/Reduce Testing End-to-End Integrated Data . . . SAP/DW’s App-Specific Views Failure Screen Tests Proactive Drift Identification Field Data Supplier Ad hoc Analysis Customer FA via Field Data HDD HGA Consumers New High-Value Parameters SAS, Compellon or other Predictive Analytic Tools Tableau, and other tools New Unified EDW 16
  • Characteristics of a “Typical” Hadoop / Big Data Cluster  Hadoop handles large data volumes and reliability in the software tier − Hadoop distributes data across cluster; uses replication to ensure data reliability and fault tolerance.  Each machine in Hadoop cluster stores AND processes data; machines must do both well. Processing sent directly to the machines storing the data.  Hadoop MapReduce Compute Bound Operations and Workloads • • • • Clustering/Classification Complex text mining Natural-language processing Feature extraction  Hadoop MapReduce I/O Bound Operations and Workloads • Indexing • Grouping • Data importing and exporting • Data movement and transform Big Data Solutions Must Support a Large Variety of Compute and I/O Operations and Storage Needs …enter “the Cloud” 17
  • AWS Big Data Platform Storage Services  Block Storage for Elastic Computing  Optimized for Performance  SSD / 15K / 10K Amazon EBS  Highly Virtualized / SAN-Based  “Generic” Object Storage  Bulk of AWS Storage Today Amazon S3  Virtualized or Reserved Use  Server/Network-Based  Cold/Cool Storage Amazon Glacier  Lowest Cost Model for “least” used data  3-5 hour Latency / Sequentialized 18
  • HGST’s Other Amazon Use Cases/Capabilities  Petabyte-Scale Data Warehousing  “Between Glacier & S3”  Run Data Visualization tools in AWS  Resource Tracking Tool  Includes Tableau instance for reporting and visualization More and more users coming to IT asking for how to leverage this new compute capability 19
  • We Are Just Starting with the Cloud • Current Results From 6 month Effort • Re-aligning Business Group Leadership • Demands and Use To Grow And Accelerate Cloud + HGST IT = Strong Innovation and Business Partner 20
  • Cloud Computing @ Aerospace Bill Williams, The Aerospace Corporation © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Introduction and Background • IT Executive for the The Aerospace Corporation • • (Aerospace) Manage HPC compute and cloud resources for the Aerospace corporate Career path has taken me through end user support, system administration, and enterprise architecture
  • Agenda • • • • • • • • Who is Aerospace? High Performance Computing @ Aerospace Services Provided Cloud Motivation Where are we today? What makes this work? Challenges Lessons Learned
  • Who is Aerospace? Video
  • High Performance Computing @ Aerospace • Allow engineers and scientists to focus on their • • discipline and research Reduce and eliminate complexity in using High Performance Computing (HPC) resources Supply and support centralized and networked HPC resources
  • Services Provided • • • • • Cluster Computing "Big Iron Linux" Dense Core Computing High Performance Cloud Computing High Performance Storage Systems Software Development Revision Control Repository
  • Cloud Motivation • • • • • • Respond to an increasing and variable demand Improve resource deployments and use Enhance provisioning Improve security posture Improve disaster recovery posture Greener
  • Where are we today? • • • • Successfully established elastic clusters in AWS GovCloud – Workload runs include Monte Carlo and Array Simulations Key features of the GovCloud clusters are auto-scaling and on-demand computing Compute instances are created as needed to meet job computational requirements Making strides towards mimicking internal clusters in GovCloud
  • What makes this work? • AWS GovCloud – GovCloud is FedRAMP compliant • Secure transport to and from Aerospace – VPC provides an additional layer of security while data is in transit • Cyclecomputing – Cycle provides cluster auto-scaling
  • Lessons Learned • Enhanced analytics and business intelligence • Customer success stories • Standard images • Demonstrated operational “agility”
  • Lessons Learned • Domain space is dynamic • Expertise required • Layers of complexity • Ensuring data security (in hybrid deployment model)
  • Challenges • • • • • Establishing a cloud storage infrastructure Determining appropriate bandwidth between Aerospace and GovCloud Library replication of internal systems System integration with internal authentication services Insuring a seamless transition to hybrid services
  • What’s Next? • • • • • • Expand offerings Explore charge back Explore “cloudifying” other HPC platforms Track technology Provide workload specific ad-hoc offerings Provide surge capability for HPC resources
  • Accelerating Science Michael Steeves, Novartis Institutes for Biomedical Research
  • Novartis Institutes for BioMedical Research (NIBR)  Unique research strategy driven by patient needs  World-class research organization with about 6000 scientists globally  Intensifying focus on molecular pathways shared by various diseases  Integration of clinical insights with mechanistic understanding of disease  Research-to-Development transition redefined through fast and rigorous “proof-of-concept” trials  Strategic alliances with academia and biotech strengthen preclinical pipeline
  • Accelerating the Science  Requirements Large Scale Computational Chemistry Simulation Results in under a week Ability to run multiple experiments “on-demand”  Challenges Sustained access to 50000+ compute cores Ability to monitor and re-launch jobs No additional Capital Expenditure Internal HPCC already running at capacity  Job Profile Embarrassingly Parallel CPU Bound Low I/O, Memory and Network requirements Virtual Screening Target Molecule Compound Molecule binding site "Lock" "Keys"
  • The Cloud: Flexible Science on Flexible Infrastructure Engineering the right infrastructure for a workload:  Software runs the same job many times across instance types  Measures the throughput and determines the $ per job  Use the instances that provide the best scientific ROI  CC2 instance (Intel Xeon® ‘Sandy Bridge’) ran best for this
  • Super Computing in the Cloud Metric Compute Hours of Science 341,700 hours Compute Days of Science 14,238 days Compute Years of Science 39 years AWS Instance Count-CC2     Count 10,600 instances $44 Million infrastructure 10 million compounds screened 39 Drug Design years in 11 hours for a cost of …$4,232 3 compounds identified and synthesized for screening
  • Key Learnings/What’s Next? Diversity of Life Sciences brings unique challenges  Spend the time analyzing and tuning  Flexibility, Scalability and Performance  Time to rethink and retool  Challenge the Science and the Scientist  Collaboration Future plans  Chemical Universe : 166 Billion cpds (Extreme scale CPU)  Next Generation Sequencing in the Cloud (Extreme CPU, Mem, I/O)  “Disruptive” Technologies-Imaging (10x that of NGS!)
  • Using On-Demand and Spot Instances together When task durations are > than 1 hour or require multiple machines (MPI) for long periods, then use ondemand Shorter workloads work great for Spot Instances If you want a guaranteed end time, use on-demand as well, so the architecture looks like…
  • User Scale from 150 - 150,000+ cores CycleCloud Deploys Secured, Auto-scaled HPC Clusters HPC Cluster Load-based Spot bidding On-Demand Execute Nodes (Guaranteed finish) Check job load Calculate ideal HPC cluster Legacy Internal HPC Shared FS Spot Instance Execute Nodes (auto-started & auto-stopped calculation is faster/cheaper) Properly price the bids Manage Spot Instance loss FS / S3 HPC Orchestration to Handle Spot Instance Bid & Loss
  • Other Production use cases • • • • • • • Sequencing, Genomics, Life Sciences MPI workloads for FEA, CFD, energy, utilities MATLAB and R applications for stats/modeling Win HPC Server cluster for finance Heat transfer and other FEA Insurance risk management Rendering/VFX
  • Designing Solar Materials The Challenge is efficiency Need to efficiently turn photons from the sun to Electricity The number of possible materials is limitless: • Need to separate the right compounds from the useless ones • If the 20th century was the century of silicon, the 21st will be all organic How do we find the right material out of 205,000 without spending the entire 21st century looking for it? EMBARGOED until Nov. 12, 2013 8 a.m. EST
  • Challenge: 205,000 compounds totaling 2,312,959 core-hours, or 264 core-years EMBARGOED until Nov. 12, 2013 8 a.m. EST
  • 205,000 molecules 264 years of computing 16,788 Spot Instances, 156,314 cores! EMBARGOED until Nov. 12, 2013 8 a.m. EST
  • 205,000 molecules 264 years of computing 156,314 cores = 1.21 PetaFLOPS (Rpeak) Equivalent to Top500 Jun2013 #29 EMBARGOED until Nov. 12, 2013 8 a.m. EST
  • 205,000 molecules 264 years of computing Done in 18 hours Access to $68M system for $33k EMBARGOED until Nov. 12, 2013 8 a.m. EST
  • 1.21 PetaFLOPS, 156,000 core cluster
  • Solution: 205,000 compounds, 264 core years, 156k core Utility HPC cluster in 18 hours for $0.16/molecule using Schrödinger Materials Science tools, CycleCloud & AWS Spot Instances EMBARGOED until Nov. 12, 2013 8 a.m. EST
  • Thanks to our speakers!
  • Question and Answer How does utility HPC apply to your organization? Follow us: @cyclecomputing, @jasonastowe Come to Cycle’s booth: #1112 We’re hiring jointheteam@cyclecomputing.com
  • Please give us your feedback on this presentation BDT212 As a thank you, we will select prize winners daily for completed surveys!