Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013

Real-world Cloud HPC at Scale, for
Production Workloads
Jason A Stowe, Cycle Computing
November 15, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

We believe that utility access
to HPC accelerates invention

Goals for today
• See real world use cases from 3 leading engineering
and scientific computing users
– Steve Philpott, CIO, HGST, A Western Digital Company
– Bill E. Williams, Director, The Aerospace Corporation
– Michael Steeves, Sr. Systems Engineer, Novartis

• Understand the motivations, strategies, lessons learned
in running HPC / Big Data workloads in the cloud
• See the varying scales and application types that run
well, including a 1.21 PetaFLOPS environment

Agenda
•
•
•
•
•
•

Introduction
Steve Philpott – Journey into Cloud
Bill Williams – Cloud Computing @ Aerospace
Michael Steeves – Accelerating Science
Spot, On-demand, & Other Production uses
Questions and answers

Journey to the Cloud
Steve Phillpott
CIO
HGST, a Western Digital Company


Cloud & Datacenter
Performance Enterprise

 Founded in 2003 through the combination of the hard drive
businesses of IBM, the inventor of the hard drive, and
HGST, Ltd

PCIe

Enterprise SSD
(+3 acquisitions)

SAS

10K & 15K
HDDs

 Acquired by Western Digital in 2012
 More than 4,200 active worldwide patents
 Headquartered in San Jose, California
 Approximately 41,000 employees worldwide
 Develops innovative, advanced hard disk drives, enterprise-class
solid state drives, external storage solutions and services

Ultrastar®

Capacity Enterprise
7200 RPM &
CoolSpin
HDDs

Ultrastar® &
MegaScale DC™

 Delivers intelligent storage devices that tightly integrate hardware
and software to maximize solution performance

6

Zero to Cloud in 6+ Month
By 31 Oct 2013:
Cloud eMail – Microsoft Office365

April 2013

Cloud eMail archiving/eDiscovery
External SingleSignOn (off VPN)

Cloud File/Collaboration – BOX
Cloud CRM – Salesforce.com
 Integrated to save files in BOX

Cloud–High Performance Computing
(HPC) on Amazon AWS
Cloud – Big Data Platform on Amazon AWS
7

Responding to the Changing Business Model
 Where is our business model headed?

“New Age of Innovation” as a guide
N=1 Focus on Individual Customer Experience
R=G Resources are Global
 Implications
–Increase in strategic partnering
–Need for high level of flexibility
–Leveraging external expertise

Use of the Cloud/SaaS aligns with
Virtual Business Model:






Variable cost model critically important
Lightweight, scalable services
Reduced up-front capital spend
Accelerated provisioning
Pay as you go

8

Paradigm Shift: Consumerization of IT
“I have better technology at home”
Consumer Web
A new paradigm in ease of use and reduced cost.
Consumer web has been driven by a series of
platforms – and these platforms are household brand
names today
When we use these platforms, it continually amazes
us – how easy, how consistent these platforms work

A new set of services: DRM to iTunes
Yet, our workplace applications are cumbersome, costly,
difficult to navigate and require extensive support
Workday, 2009

9

The Big Switch – The Box has Disappeared
The Transformation of Computing as we Know it.

 Physical to Virtual/Digital move
– Do you really care which computer processed your last
Google search?
 Efficiency
– Do not waste a CPU cycle or a byte of memory.
Building a 4-story building and only using the 1st floor
 Utility: IT as a Service - Plug it in and get it
– Where the electricity industry has gone, Computing is following
– Computing shift is almost invisible to the end-user

DATA is the value to the Organization, not the “where”
1

Enabling the Virtual Organization
Reframing IT Away From Thinking of “The App”
Business Intelligence and Analytics
End-to-End Business Processes
Enterprise Data Management
New Computing Platforms

Strategic
Outsourcing

Software as a Service
(SaaS)

New IT Organizational Structures:
Support and Align to “New Business Model”

1

1

Creating an Innovation Playground:
Where to Start and How to Evolve

IT Supports Business Strategy
Executive Buy-In – CEO, CIO, InfoSec, etc
Reduce Cap-ex, Optimize DC usage

Build
Expertise
Implement
Outcome Defined

Knowledge

Play
Learn
Educate
• Team Involvement
• Conferences
• Vendor Briefing
• Expert Services
• Best Practices

Experiment
• Team Approach
• Hands-on approach
• Understand the value
proposition
• Understand constraints

Migrate
• Migrate dev/test
environments
• Migrate or
launch new apps
on the cloud

Embrace success
Showcase cost
savings
Build an enterprise
cloud strategy
Learn from each
experience
Expand accordingly

• Indentify app fit for cloud
computing
• Define new processes

• Collaborate with
other companies

12

Awareness

Understanding

Transition

Commitment

12

Multiple Opportunities to Leverage Amazon Web Services (AWS)
AWS: “ >5x the compute capacity than next
14 providers combined” – Gartner, Aug 2013
Access to massive compute and storage
Billed by the hour - only pay for what is used
HGST Japan Research Lab: Using AWS for higher
performance, lower cost, faster deployed solution vs. buying
huge on-site cluster

Develop AWS Competency
Many Opportunities: In-house and commercial HPCs are “cloud ready”

Provide Computing When Needed: Reduce capital investment & risk and increase flexibility
Faster Response to Business Needs: Rapid prototyping to pilot new IT capabilities with “PO
Process” ; setup users, allocate compute and storage in minutes, load apps and go
AWS provide a great option for disaster recovery for our “on-premise” clusters and storage

13

HGST’s Amazon HPC Platform

Case 3: Lube depletion in TAR (2D heat profile)
1.E+07

(300,000 atoms)

Atoms Dealing with

Basic Molecular Simulation
Large Scale Molecular Simulation for HDI

Top view

1.E+06

(Lube molecules spreading onto COC)
Case 3

5 ns

Case 1
1.E+05
1 ns
5 ns

Case 2

1.E+04

Relaxation time: 5 ns
Relaxation time: < 1 ns
1.E+03
0

100

200

300

400

Number of Core

500

600

Heat spot in TAR
36 nm

Molecular
Dynamics
Simulation

Read / Write
Magnetics
Electo –
Magnetic Fields

Mechanical
MAGLAND
Simulation
Application
CST Read / Write
Magnetics

Electo –
Magnetic Fields

Base HPC Platform
Scalable to thousands of
instances to support numerous
simultaneous simulations

Ansys
Commercial
LLG

Ansys
HFSS

Pre- and Post-Processing
Server Farms

New G2 Instances Add
Visualization Capabilities
14

Big Data’s “3 V’s”
Three “V’s” of Big Data

Best pragmatic
Volume

Velocity

•Data sources
•Data types
•Applications

Trends

Variety

•Data collected
•Analysis & metadata creation

•Data acquisition
•Analysis & action

Structured

Terabytes

Batch

Unstructured, Semi-Structured
& Structured

Petabytes & Exabytes

Real-Time & Streaming

Implications &
Opportunities

• Hardware and software optimization
• Architectural shifts: Scale-out systems, Distributed filesystems,
Tiered storage, Hadoop…

Key difference: data structure does not need
to be defined before loading

definition from
Snijders et al.
“Data sets so large
and complex that
they become
awkward to work
with using standard
tools and
techniques”

15

Data Sources

Big Data Platform

All raw parametric,
logistic, vintage, data
Parallelized
batch analytics
raw
extracts

Batch Analytics

Enriched
data

Slider
Wafer
Media
Substrate

Optimize/Reduce
Testing

End-to-End
Integrated
Data

.
.
.

SAP/DW’s

App-Specific
Views

Failure Screen
Tests
Proactive Drift
Identification

Field Data
Supplier

Ad hoc Analysis
Customer FA via
Field Data

HDD
HGA

Consumers

New High-Value
Parameters

SAS, Compellon or
other Predictive
Analytic Tools

Tableau, and
other tools

New Unified
EDW

16

Characteristics of a “Typical” Hadoop / Big Data Cluster
 Hadoop handles large data volumes and reliability in the software tier
− Hadoop distributes data across cluster; uses replication to ensure data reliability
and fault tolerance.
 Each machine in Hadoop cluster stores AND processes data; machines must do both well.
Processing sent directly to the machines storing the data.

 Hadoop MapReduce
Compute Bound Operations
and Workloads
•
•
•
•

Clustering/Classification
Complex text mining
Natural-language processing
Feature extraction

 Hadoop MapReduce
I/O Bound Operations
and Workloads
• Indexing
• Grouping
• Data importing and exporting
• Data movement and
transform

Big Data Solutions Must Support a Large Variety of
Compute and I/O Operations and Storage Needs …enter “the Cloud”
17

AWS Big Data Platform Storage Services
 Block Storage for Elastic Computing
 Optimized for Performance
 SSD / 15K / 10K
Amazon
EBS

 Highly Virtualized / SAN-Based
 “Generic” Object Storage
 Bulk of AWS Storage Today

Amazon
S3

 Virtualized or Reserved Use
 Server/Network-Based
 Cold/Cool Storage

Amazon
Glacier

 Lowest Cost Model for “least”
used data
 3-5 hour Latency / Sequentialized

18

HGST’s Other Amazon Use Cases/Capabilities

 Petabyte-Scale Data
Warehousing
 “Between Glacier & S3”
 Run Data Visualization
tools in AWS

 Resource Tracking Tool
 Includes Tableau
instance for reporting
and visualization

More and
more users
coming to IT
asking for how
to leverage
this new
compute
capability

19

We Are Just Starting with the Cloud
• Current Results From 6 month Effort
• Re-aligning Business Group Leadership
• Demands and Use To Grow And Accelerate
Cloud + HGST IT =
Strong Innovation and Business Partner
20

Cloud Computing @ Aerospace
Bill Williams, The Aerospace Corporation


Introduction and Background

• IT Executive for the The Aerospace Corporation
•
•

(Aerospace)
Manage HPC compute and cloud resources for
the Aerospace corporate
Career path has taken me through end user
support, system administration, and enterprise
architecture

Agenda
•
•
•
•
•
•
•
•

Who is Aerospace?
High Performance Computing @ Aerospace
Services Provided
Cloud Motivation
Where are we today?
What makes this work?
Challenges
Lessons Learned

High Performance Computing @ Aerospace

• Allow engineers and scientists to focus on their
•
•

discipline and research
Reduce and eliminate complexity in using High
Performance Computing (HPC) resources
Supply and support centralized and networked
HPC resources

Services Provided
•
•
•
•
•

Cluster Computing
"Big Iron Linux" Dense Core Computing
High Performance Cloud Computing
High Performance Storage Systems
Software Development Revision Control Repository

Cloud Motivation
•
•
•
•
•
•

Respond to an increasing and variable demand
Improve resource deployments and use
Enhance provisioning
Improve security posture
Improve disaster recovery posture
Greener

Where are we today?
•
•

•
•

Successfully established elastic clusters in AWS
GovCloud
– Workload runs include Monte Carlo and Array Simulations

Key features of the GovCloud clusters are
auto-scaling and on-demand computing
Compute instances are created as needed to meet job
computational requirements
Making strides towards mimicking internal clusters in
GovCloud

What makes this work?

• AWS GovCloud
– GovCloud is FedRAMP compliant

• Secure transport to and from Aerospace
– VPC provides an additional layer of security while data is in transit

• Cyclecomputing
– Cycle provides cluster auto-scaling

Lessons Learned

• Enhanced analytics and business intelligence
• Customer success stories

• Standard images
• Demonstrated operational “agility”

Lessons Learned

•

Domain space is dynamic

•

Expertise required

•

Layers of complexity

•

Ensuring data security (in hybrid deployment model)

Challenges

•
•
•
•

•

Establishing a cloud storage infrastructure
Determining appropriate bandwidth between
Aerospace and GovCloud
Library replication of internal systems
System integration with internal authentication
services
Insuring a seamless transition to hybrid services

What’s Next?
•
•
•
•
•
•

Expand offerings
Explore charge back

Explore “cloudifying” other HPC platforms
Track technology

Provide workload specific ad-hoc offerings
Provide surge capability for HPC resources

Accelerating Science
Michael Steeves, Novartis Institutes for Biomedical Research

Novartis Institutes for BioMedical Research (NIBR)


Unique research strategy driven by patient needs



World-class research organization with about
6000 scientists globally



Intensifying focus on molecular pathways shared by
various diseases



Integration of clinical insights with mechanistic
understanding of disease



Research-to-Development transition redefined
through fast and rigorous “proof-of-concept” trials



Strategic alliances with academia and biotech
strengthen preclinical pipeline

Accelerating the Science
 Requirements
Large Scale Computational Chemistry Simulation
Results in under a week
Ability to run multiple experiments “on-demand”
 Challenges
Sustained access to 50000+ compute cores
Ability to monitor and re-launch jobs
No additional Capital Expenditure
Internal HPCC already running at capacity
 Job Profile
Embarrassingly Parallel
CPU Bound
Low I/O, Memory and Network requirements

Virtual Screening
Target
Molecule

Compound
Molecule

binding
site

"Lock"

"Keys"

The Cloud: Flexible Science on Flexible Infrastructure

Engineering the right infrastructure for a workload:
 Software runs the same job many times across instance types
 Measures the throughput and determines the $ per job
 Use the instances that provide the best scientific ROI
 CC2 instance (Intel Xeon® ‘Sandy Bridge’) ran best for this

Super Computing in the Cloud
Metric
Compute Hours of Science

341,700 hours

Compute Days of Science

14,238 days

Compute Years of Science

39 years

AWS Instance Count-CC2





Count

10,600 instances

$44 Million infrastructure
10 million compounds screened
39 Drug Design years in 11 hours for a cost of …$4,232
3 compounds identified and synthesized for screening

Key Learnings/What’s Next?
Diversity of Life Sciences brings unique challenges
 Spend the time analyzing and tuning
 Flexibility, Scalability and Performance
 Time to rethink and retool
 Challenge the Science and the Scientist
 Collaboration
Future plans
 Chemical Universe : 166 Billion cpds (Extreme scale CPU)
 Next Generation Sequencing in the Cloud (Extreme CPU, Mem, I/O)
 “Disruptive” Technologies-Imaging (10x that of NGS!)

Using On-Demand and
Spot Instances together

When task durations are > than 1
hour or require multiple machines
(MPI) for long periods, then use ondemand
Shorter workloads work great for
Spot Instances

If you want a guaranteed end time,
use on-demand as well, so the
architecture looks like…

User

Scale from 150 - 150,000+ cores
CycleCloud Deploys Secured, Auto-scaled HPC Clusters
HPC Cluster

Load-based Spot bidding

On-Demand Execute Nodes
(Guaranteed finish)

Check job load

Calculate ideal HPC cluster

Legacy
Internal
HPC
Shared FS

Spot Instance Execute Nodes
(auto-started & auto-stopped
calculation is faster/cheaper)

Properly price the bids
Manage Spot Instance loss

FS /
S3

HPC Orchestration to
Handle Spot Instance Bid & Loss

Other Production use cases
•
•
•
•
•
•
•

Sequencing, Genomics, Life Sciences
MPI workloads for FEA, CFD, energy, utilities
MATLAB and R applications for stats/modeling
Win HPC Server cluster for finance
Heat transfer and other FEA
Insurance risk management
Rendering/VFX

Designing Solar Materials
The Challenge is efficiency
Need to efficiently turn photons from the sun to Electricity
The number of possible materials is limitless:
• Need to separate the right compounds from the useless ones
• If the 20th century was the century of silicon, the 21st will be
all organic
How do we find the right material out of 205,000
without spending the entire 21st century looking for it?
EMBARGOED until Nov. 12, 2013 8 a.m. EST

Challenge:
205,000 compounds
totaling 2,312,959 core-hours,
or 264 core-years


205,000 molecules
264 years of computing

16,788 Spot Instances,
156,314 cores!

205,000 molecules

156,314 cores =
1.21 PetaFLOPS (Rpeak)
Equivalent to Top500 Jun2013 #29

205,000 molecules

Done in 18 hours
Access to $68M system
for $33k

1.21 PetaFLOPS, 156,000 core cluster

Solution:
205,000 compounds, 264 core years,
156k core Utility HPC cluster
in 18 hours
for $0.16/molecule using
Schrödinger Materials Science tools,
CycleCloud & AWS Spot Instances

Question and Answer
How does utility HPC apply to your organization?

Follow us: @cyclecomputing, @jasonastowe
Come to Cycle’s booth: #1112
We’re hiring jointheteam@cyclecomputing.com

Please give us your feedback on this
presentation

BDT212
As a thank you, we will select prize
winners daily for completed surveys!

Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013

Similar to Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013