SlideShare a Scribd company logo
CORAL
Briefing on
CORAL-2 RFP
and
Draft Technical Requirements
Vendor Webinar Meeting
December 6, 2017
COLLABORATION
OAK RIDGE ARGONNE LIVERMORE
http://procurement.ornl.gov/rfp/CORAL2/
2
What is CORAL?
• CORAL is a Collaboration of Oak Ridge, Argonne, and
Lawrence Livermore Labs
• Began in 2012 to acquire three systems for delivery in 2017
with a single RFP
• This CORAL partnership and process was considered a big
success in the first acquisition
• So the partnership is being continued for the second round of
system acquisitions (called CORAL-2) with a single RFP for
delivery of a system in late 2021 and one or two systems in
late 2022.
3
LLNL
IBM/NVidia
P9/Volta
Classified
DOE HPC Facilities Systems
NERSC-9
Crossroads
Frontier
El Capitan
Pre-Exascale Systems Exascale Systems
Argonne
IBM BG/Q
Unclassified
Argonne
Intel/Cray KNL
Unclassified
ORNL
Cray/NVidia K20
Unclassified
LBNL
Cray/Intel Xeon/KNL
Unclassified
LBNL
TBD
Unclassified
LANL/SNL
TBD
Classified
Argonne
Intel/Cray TBD
Unclassified
ORNL
TBD
Unclassified
LLNL
TBD
Classified
LANL/SNL
Cray/Intel Xeon/KNL
Classified
2013 2016 2018 2020 2021-2022
Summit
Sierra
ORNL
IBM/NVidia
P9/Volta
Unclassified
LLNL
IBM BG/Q
Classified
Sequoia
CORI
A21
Trinity
Theta
Mira
Titan
4
RFP Will Request Two Proposals
• ANL is getting a 2021 Intel system (Aurora) outside this RFP
• The RFP requests proposals for an ORNL system delivered in
late 2021 (accepted in 2022) with the requirement that the
system must be diverse from the ALCF 2021 system.
• The RFP requests proposals for a LLNL system delivered in
late 2022 (accepted in 2023). There is no diversity requirement.
• ANL, as CORAL partner, has the potential to choose a system
from the same pool as LLNL .
• The reasons for just one RFP:
– Same RFP release time
– Same system requirements (based on ECP exascale definition)
– Only differences are facility descriptions and I/O subsystems
– Potential opportunity to share NRE costs
5
CORAL-2 RFP Time Line
• Vendors are free to talk with the DOE labs about the RFP until the
RFP is officially released (e.g. unachievable requirements)
• RFP Release goal is February 2018
• Responses due eight business weeks later (early April)
• Evaluations by Lab teams
• Selections made in May 2018
• Negotiations for NRE contracts begin (50 pg proposal)
– with goal of awards before October 2018
– one for each build contract (less if shared NRE)
• Negotiations for System Build Contracts begin
– with goal of awards by start of 2019
– Set Go/No Go date to change targets to firm requirements
6
LLNL El Capitan System goals and objectives
• Support the ongoing and future workload for
the NNSA Stockpile Stewardship Program.
– Life Extension Programs
– Demands from the Predictive Capability
Framework and the experimental campaigns
• Replace Sierra with a system to support:
– Hostile environment assessment
– Uncertainty quantification for certification
– Very large weapons’ science calculations
• Siting in 2022, acceptance in 2023
• Available to support growing mission needs
through 2027
7
ORNL Frontier System goals and objectives
• Provide the Leadership computing
capabilities needed for the DOE Office of
Science mission thru 2026
– Capabilities for INCITE, ECP, and ALCC
science projects
– Capabilities assessed from gathered user
requirements
– Increased need for data science and
deep learning
• Siting in 2021. Run in parallel with Summit
until Summit end-of-life (2023)
8
Technical requirements contents
Chapter 3 High Level System Requirements
Chapter 4 Application Benchmarks
Chapter 5 Compute Partition
Chapter 6 I/O Subsystem
Chapter 7 Interconnect
Chapter 8 OS, Middleware, System Resource Management
Chapter 9 Front-end Environment
Chapter 10 System Management and RAS
Chapter 11 Maintenance and Support
Chapter 12 Facility Requirements
Chapter 13 Project Management
Started with the successful RFP developed last time in CORAL
As before, chapter teams with experts from all three labs
worked on each chapter
9
Basic RFP Terminology Definitions
The technical requirements have these designations:
• Mandatory Requirement (MR): (only 3)
– Requirements that are essential to CORAL RFP
– Required to be responsive
• Mandatory Option (MO):
– Separately priced system options that may be exercised in individual build contracts
– Required to be responsive
• Technical Option (TO-1):
– System upgrade option that each laboratory may choose
– Not required to be responsive
• Target Requirement (TR):
– Performance features that are important to CORAL requirements
– Graded (1, 2 or 3) based on overall importance to CORAL
• 1 is essential; 2 is important 3 is desirable
– Not required to be responsive
10
Technical requirements guiding principles
• Minimize number of mandatory requirements and allow
consideration of widest range of architectural solutions.
• Focus on requiring science and throughput performance.
Avoid overly prescriptive explicit speeds and feeds.
• Agree on common technical requirements across all three
Laboratories – not three separate sets of requirements.
• Require vendors to describe available options to adjust system
size and configuration to meet individual site needs and/or
budgetary constraints.
11
SOW has three mandatory requirements
1. Description of the CORAL system
– Overall system architecture and details of the interconnect
– Detailed node architecture diagram with all data movement paths
2. Description of the software stack
– Describe all software components and license strategy for each
– Entire system software stack including system management and
program development
3. Description of high level project management
– Proposed multi-year collaboration plan
– Plan to deliver systems
– Perceived risks, risk management plan, and potential mitigations
– Prime’s ability to ensure the responsiveness of its partners to the
performance requirements
12
High level system requirements (1 of 4)
It is desirable for overall system to meet or exceed the following:
CORAL system peak (TR-1)
– Peak of at least 1300 PF (double precision floating point)
Reasons
• Why a peak requirement? Provides vendors a target to shoot for.
– Vendor may have to propose more to meet 50X benchmarks FOMs
– There is a technical option requirement to scale peak up or down.
• Providing 50X science and throughput over today’s largest DOE systems is a mission
need for ECP
– Given Titan and Sequoia performance today, 1300 PF was considered a minimum
size to achieve 50X science. 1 EF peak not expected to be enough
13
It is desirable for overall system to meet or exceed the following:
Total Memory (TR-1)
– Aggregate of at least 8 PB total system memory
– Must propose the same memory capacity and configuration as is used to obtain
the reported benchmark results
– Place a high value on solutions that provide mostly fast memory (e.g., HBM)
Reasons
• Science teams wanted to be sure system has enough memory to do the science
without prescribing a particular memory architecture or configuration.
• Study of exascale apps showed that at least 4-6 PB of HBM needed. Benchmarks set
up to require 5 PB.
• Data Science and Deep Learning apps often desire large memory per node.
• The vendors can propose a wide range of memory architectures. RFP does not
prescribe a particular architecture.
High level system requirements (2 of 4)
14
It is desirable for overall system to not exceed the following:
Maximum Power Consumption (TR-1)
– Cannot exceed 40 MW for the 2021 system or the 2022 system including all
peripherals and the file system.
– power consumption between 20-30 MW preferred
Reasons
• All three labs agreed that 40 MW would be the maximum requirement based on
– Available power at each facility for the new system
– Annual cost of power
– Evaluation of the RFI responses for expected power requirements for systems in
the 2021-2023 timeframe
• The 2022 proposals are expected to have more capability compared to the 2021
proposals for a given power consumption.
High level system requirements (3 of 4)
15
• Scale the system size (TO-1)
• Scale the system memory (TO-1)
• Scale the system interconnect (TO-1)
• Scale the system I/O (TO-1)
• Scale the integrated telemetry DB (TO-1)
• CORAL-scalable unit (MO)
• Options for mid-life upgrades (TO-1)
Because the different
CORAL sites may want
to procure different
system configurations
CORAL sites want to procure
smaller versions of the system
System does not have to be
upgradeable, but if it is, what
are the options and time-line
Require vendors to describe any available
options to adjust system
16
Early access to CORAL hardware/software (TR-1)
Throughout the long-term collaboration, Offeror will propose mechanisms
to provide the Laboratories with early access to hardware and software
technology for evaluation, feedback, and application testing
Reason To enhance the partnership nature of the vendor relationship
Two runtime variability requirements (TR-1)
Runtime variability <= 5% on a dedicated system running the benchmarks
Runtime variability <= 30% on a busy system with representative workload
Reason Reproducible performance from run to run is a highly desired
property by science teams. Values of 5% and 30% come from experience
with original CORAL RFP
Other notable requirements
17
Offeror will provide an integrated telemetry database (ITDB) that facilitates
collection and analysis of system management and monitoring data to help
understand utilization of all system resources, failure modes and trends, and to
correlate the various data sources. The database will be scalable and accessible
through a web interface and command-line interface.
• The database will provide rich search capabilities and advanced analytics that
enable authorized users to perform complex event correlation and problem
determination as well as to avoid outages proactively and to optimize system
operation and performance.
• Authentication and authorization mechanisms will enable subsets of data to be made
available to different groups of people (e.g., system administrators, operators, end-
users, and researchers).
Integrated Telemetry Database and
Analytics Infrastructure (TR-1)
New requirement that has arisen since the original CORAL
RFP. Sys admins, operators, tool developers, and end-users
all had input into this requirement.
18
2021 and 2022 IO Subsystems (TR-1)
The evolution of node local and system local storage has led to the replacement
of an optional storage system requirement in the original CORAL RFP with a
TR-1 requirement to include the entire I/O subsystem in the RFP response.
ORNL and LLNL use their file systems in different ways that require different I/O
subsystem requirements:
• The I/O subsystem for the 2021 system at ORNL will be the center-wide storage
system, providing a global POSIX namespace that is accessible by both the CN
partition as well as external resources in the facility;
• The I/O subsystem for the 2022 system at LLNL will not be a center-wide system but
will support two types of namespaces: a global namespace and transient
namespaces and will support two logical storage tiers, each optimized based on
their performance and data residency requirements.
The 2021 System’s I/O subsystem and the 2022 System’s I/O subsystem
requirements are described in separate sections to make it clear to the vendors
what to respond to in each of the two proposals.
19
Application performance requirements are
the highest priority to CORAL
An average “figure of merit” (FOM) improvement of
at least 50X for scalable science, throughput and data science apps
over today’s DOE systems.
– Reference machines are Sequoia and Titan
– The Offeror will provide actual, predicted and/or extrapolated performance results
for the proposed system for the following:
• CORAL-2 system performance (TR-1)
– Average FOM over five scalable science apps >= 50
– Average FOM over six throughput apps >= 50
– Average FOM over four data science and machine learning apps >= 50
– Raw results for nine skeleton apps
20
CORAL-2 benchmarking strategy
• Assuring that actual DOE applications perform well on
CORAL platforms is key to success
– Develop a benchmark suite that is representative of the
DOE workload
– Develop a benchmark suite that can evaluate future
technology options
– Provide flexibility to vendors for diverse architectures
• Full applications not ideal to achieve all these goals
– Millions of lines of code, may be MPI only, & perhaps
export controlled
– Limit benchmark suite to smaller applications, skeleton
benchmarks, and micro-benchmarks
21
CORAL-2 benchmark categories represent DOE
workloads and technical requirements
Benchmark Categories
• Scalable Science Benchmarks
– Expected to run at full scale of the CORAL-2
systems
• Throughput Benchmarks
– Represent large ensemble runs; may be subsets
of full applications; run many small to medium
sized jobs across full system
• Data Science & Deep Learning Benchmarks
– Represent emerging data intensive & deep
learning workloads
– Integer operations, instruction throughput,
indirect addressing, graph analysis
– ML algorithms, regression analysis, training
• Skeleton Benchmarks
– Investigate various platform characteristics
including network performance, threading
overheads, I/O, memory, memory hierarchies,
system software, and programming models
Scalable
Science
Through
put
Skeleton
Breadth
Complexity
Data
Sciences
and
Machine
Learning
Scalable
Science
Throughput Data
Sciences
&
Deep
Learning
22
CORAL-2 benchmarking suite uses
proxy-apps and a few larger applications
Scalable
Science Throughput
Data Science
Deep Learning Skeleton
QMCPACK
HACC
NEKbone
ACME
VPIC
Quicksilver
Kripke
LAMMPS
AMG
PENNANT
Laghos
Integer sort
Havoq
Big Data Suite
Deep Learning Suite
MPI suite
Memory suite
GEMM
Pynamic
CLOMP
ML suite
I/O suite
RAJA suite
Hash
All Benchmarks are Target Requirements (TR-1) in the RFP
CORAL-2 Benchmarks code and
procedures are available at
https://asc.llnl.gov/CORAL-2-benchmarks
23
CORAL-2 performance targets for scalable science
and throughput codes address key
application/workload requirements
50X Improvement on
Scalable Science
50X Improvement on
Throughput
Geometric Mean
Geometric Mean
QMCPACK
HACC
NEKbone
ACME
VPIC
Each run/projected at full machine scale
Recommend 5X larger problem and 10X faster
Each run/projected at 1/24th and 1/4th machine scale
Multiple jobs to fill machine
Recommend 2-10X larger problem and 5-25X faster
Quicksilver
Kripke
LAMMPS
AMG
PENNANT
Laghos
24
CORAL-2 system performance targets will be
projected for each benchmark category
Quicksilver
Kripke
AMG
Laghos
PENNANT
LAMMPS
Quicksilver
Kripke
AMG
Quicksilver
Kripke
AMG
Quicksilver
Kripke
AMG
Laghos Laghos Laghos
PENNANT PENNANT PENNANT
LAMMPS LAMMPS LAMMPS
Throughput Benchmarks
QMCPACK
HACC
NEKbone
ACME
VPIC
Scalable Science Benchmarks
(each run/projected at full machine scale)
Si = projected FOMi / baseline FOMi
Throughput 1
Throughput 2
Throughput 3
Throughput 4
𝑆𝑆24
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = �
𝑖𝑖=1
24
𝑆𝑆𝑖𝑖
1
24
𝑆𝑆4
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = �
𝑖𝑖=1
6
𝑆𝑆𝑖𝑖
1
6
𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠 = �
𝑖𝑖=1
5
𝑆𝑆𝑖𝑖
1
5
Box represent full scale
Run mode#2: Ensembles @ 1/24th machine scale
Run mode#1: Capability @ 1/4th machine scale
25
CORAL-2 addresses emerging data centric
and deep learning workloads
Stressed features
• Integer operations
• Instruction throughput
• Indirect addressing
• Both full machine and single
node benchmarks
• Parallel runs across entire
target platform
Exercised capabilities
• Interconnect
• Entire memory hierarchy
• Irregular access patterns
50X Improvement on
Data Science &
Deep Learning
Geometric Mean
Integer Sort
Havoq
Big Data Suite
Deep Learning Suite
26
CORAL-2 benchmarking includes emerging Data Science,
Deep Learning, & Big Data Analytic workloads
• Data Science
– Parallel Integer Sort + Havoq (Graph Analytics) Triangle counting
• Deep Learning Benchmark suite contains:
– Convolutional Neural Networks (CNNs) that comprise convolutional layers followed by fully
connected layers;
– Long short-term memory (LSTM) recurrent neural network (RNN) architecture that
remembers values over arbitrary intervals to deal with temporal and time-series prediction;
– Distributed training code for classification in ImageNet data set at scale
• Big Data Benchmark suite contains:
– Subset of CANDLE benchmark codes that implement deep learning architectures that are
relevant to problems in cancer.
– Address problems at different biological scales, specifically problems at the molecular,
cellular and population scales
– Two diverse benchmark problems, namely
• P1B1, a sparse autoencoder to compress the expression profile into a low-dimensional vector
• P3B1 a multi-task deep neural net for data extraction from clinical reports.
27
CORAL-2 system performance targets will be
projected for each benchmark category
Big Data
Big Data
Deep Learning
Big Data
Deep Learning
Deep Learning
Big Data
Big Data
Deep Learning
Big Data
Big Data
Deep Learning
Big Data
Big Data
Deep Learning
Big Data Big Data Big Data
Deep Learning Deep Learning Deep Learning
Deep Learning Deep Learning Deep Learning
Big Data and Deep Learning
Integer Sort
Havoq
Data/Graph Analytics
Si = projected FOMi / baseline FOMi
𝑆𝑆𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = �
𝑖𝑖=1
4
𝑆𝑆𝑖𝑖
1
4
Box represent full scale
Run mode: Small Ensembles
Run mode: Full machine scale
28
Rules for CORAL-2 benchmark procedures
• Define a variety of problem sizes:
− Small is single node
− Medium is ~1K nodes
− Large is reference problem on current
platforms
− CORAL-2 class problems are:
− SS: 5X or larger + 10X faster
− TP: 2-10X larger + 5-25X faster
− Require 5 PB of memory
• Follow CORAL-2 code modification
guidelines
− Baseline and optimized results
allowed
• Offeror estimates sustained performance
for science, throughput, & data science
deep learning benchmarks
− CORAL-2 class problem sizes
29
Allowed code modification to CORAL-2
benchmarks
• Benchmarks may be modified as necessary to get them to
compile and run
– Portability changes for programming models are allowed
• A full set of benchmark runs must be reported with this
“baseline” source code
– Can include non-intrusive and/or portable optimizations
• E.g., compiler flags and standard pragma-style guidance
– Can include anticipated changes to system software
• E.g., MPI and OpenMP runtime improvements
• At least 1 GB per MPI task and use threading within each
task if necessary to utilize all compute resources
– Requirement tied directly to current CORAL codes and
production usage
30
Allowed code modifications
for optimized results
• Offeror may also report optimized results
– Any and all code modifications are allowed
– However, wholesale algorithmic changes that are strongly architecture
specific have less value
– All benchmark code modification will be documented and provided to
CORAL-2
• CORAL-2 and Offeror will continue to improve the efficiency and
scalability of all benchmarks between award of the contracts and
delivery of the systems
– Emphasis on higher level optimizations as well as compiler
optimization technology improvements while maintaining readable and
maintainable code
31
Questions?
http://procurement.ornl.gov/rfp/CORAL2/
ORNL Contract Administrator Contact Information:
William (Willy) Besancenez
Procurement Officer
Acquisition Management Services Division
Oak Ridge National Laboratory
Phone - (865) 576-1538
E-mail - besancenezwr@ornl.gov

More Related Content

What's hot

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Nagarekha_Software_Testing_5.6 Yrs
Nagarekha_Software_Testing_5.6 YrsNagarekha_Software_Testing_5.6 Yrs
Nagarekha_Software_Testing_5.6 Yrs
Nagarekha Krishnappa
 

What's hot (20)

Tapping into the core
Tapping into the coreTapping into the core
Tapping into the core
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Mainframe - OPC
Mainframe -  OPCMainframe -  OPC
Mainframe - OPC
 
Android Operating System
Android Operating SystemAndroid Operating System
Android Operating System
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementations
 
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
RHCSA EX200 - Summary
RHCSA EX200 - SummaryRHCSA EX200 - Summary
RHCSA EX200 - Summary
 
Research on Comparative Study of Different Mobile Operating System_Part-2
Research on Comparative Study of Different Mobile Operating System_Part-2Research on Comparative Study of Different Mobile Operating System_Part-2
Research on Comparative Study of Different Mobile Operating System_Part-2
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPI
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Android Presentation
Android Presentation Android Presentation
Android Presentation
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 
OpenDaylight app development tutorial
OpenDaylight app development tutorialOpenDaylight app development tutorial
OpenDaylight app development tutorial
 
Nagarekha_Software_Testing_5.6 Yrs
Nagarekha_Software_Testing_5.6 YrsNagarekha_Software_Testing_5.6 Yrs
Nagarekha_Software_Testing_5.6 Yrs
 
XML Schema.pdf
XML Schema.pdfXML Schema.pdf
XML Schema.pdf
 
Android with kotlin course
Android with kotlin courseAndroid with kotlin course
Android with kotlin course
 
OFI Overview 2019 Webinar
OFI Overview 2019 WebinarOFI Overview 2019 Webinar
OFI Overview 2019 Webinar
 

Similar to CORAL-2 Exascale Computing RFP and Draft Technical Requirements

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
Shobha Kumar
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
Deepak Shankar
 
TRL technology readiness level of p.pptx
TRL technology readiness level of p.pptxTRL technology readiness level of p.pptx
TRL technology readiness level of p.pptx
anitapansare1
 

Similar to CORAL-2 Exascale Computing RFP and Draft Technical Requirements (20)

Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
Development Trends of Next-Generation Supercomputers
Development Trends of Next-Generation SupercomputersDevelopment Trends of Next-Generation Supercomputers
Development Trends of Next-Generation Supercomputers
 
05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
RFP Briefing_Meralco EDW & BI Project v2.0.pptx
RFP Briefing_Meralco EDW & BI Project v2.0.pptxRFP Briefing_Meralco EDW & BI Project v2.0.pptx
RFP Briefing_Meralco EDW & BI Project v2.0.pptx
 
The U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and PlansThe U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and Plans
 
Getting to the Edge – Exploring 4G/5G Cloud-RAN Deployable Solutions
Getting to the Edge – Exploring 4G/5G Cloud-RAN Deployable SolutionsGetting to the Edge – Exploring 4G/5G Cloud-RAN Deployable Solutions
Getting to the Edge – Exploring 4G/5G Cloud-RAN Deployable Solutions
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
Raghu nambiar:industry standard benchmarks
Raghu nambiar:industry standard benchmarksRaghu nambiar:industry standard benchmarks
Raghu nambiar:industry standard benchmarks
 
Oslc case study (poc results) v1.1
Oslc case study (poc results) v1.1Oslc case study (poc results) v1.1
Oslc case study (poc results) v1.1
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
TRL technology readiness level of p.pptx
TRL technology readiness level of p.pptxTRL technology readiness level of p.pptx
TRL technology readiness level of p.pptx
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise Applications
 
Interoperability
InteroperabilityInteroperability
Interoperability
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim Architect
 
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
 
REEJA_CV1
REEJA_CV1REEJA_CV1
REEJA_CV1
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

CORAL-2 Exascale Computing RFP and Draft Technical Requirements

  • 1. CORAL Briefing on CORAL-2 RFP and Draft Technical Requirements Vendor Webinar Meeting December 6, 2017 COLLABORATION OAK RIDGE ARGONNE LIVERMORE http://procurement.ornl.gov/rfp/CORAL2/
  • 2. 2 What is CORAL? • CORAL is a Collaboration of Oak Ridge, Argonne, and Lawrence Livermore Labs • Began in 2012 to acquire three systems for delivery in 2017 with a single RFP • This CORAL partnership and process was considered a big success in the first acquisition • So the partnership is being continued for the second round of system acquisitions (called CORAL-2) with a single RFP for delivery of a system in late 2021 and one or two systems in late 2022.
  • 3. 3 LLNL IBM/NVidia P9/Volta Classified DOE HPC Facilities Systems NERSC-9 Crossroads Frontier El Capitan Pre-Exascale Systems Exascale Systems Argonne IBM BG/Q Unclassified Argonne Intel/Cray KNL Unclassified ORNL Cray/NVidia K20 Unclassified LBNL Cray/Intel Xeon/KNL Unclassified LBNL TBD Unclassified LANL/SNL TBD Classified Argonne Intel/Cray TBD Unclassified ORNL TBD Unclassified LLNL TBD Classified LANL/SNL Cray/Intel Xeon/KNL Classified 2013 2016 2018 2020 2021-2022 Summit Sierra ORNL IBM/NVidia P9/Volta Unclassified LLNL IBM BG/Q Classified Sequoia CORI A21 Trinity Theta Mira Titan
  • 4. 4 RFP Will Request Two Proposals • ANL is getting a 2021 Intel system (Aurora) outside this RFP • The RFP requests proposals for an ORNL system delivered in late 2021 (accepted in 2022) with the requirement that the system must be diverse from the ALCF 2021 system. • The RFP requests proposals for a LLNL system delivered in late 2022 (accepted in 2023). There is no diversity requirement. • ANL, as CORAL partner, has the potential to choose a system from the same pool as LLNL . • The reasons for just one RFP: – Same RFP release time – Same system requirements (based on ECP exascale definition) – Only differences are facility descriptions and I/O subsystems – Potential opportunity to share NRE costs
  • 5. 5 CORAL-2 RFP Time Line • Vendors are free to talk with the DOE labs about the RFP until the RFP is officially released (e.g. unachievable requirements) • RFP Release goal is February 2018 • Responses due eight business weeks later (early April) • Evaluations by Lab teams • Selections made in May 2018 • Negotiations for NRE contracts begin (50 pg proposal) – with goal of awards before October 2018 – one for each build contract (less if shared NRE) • Negotiations for System Build Contracts begin – with goal of awards by start of 2019 – Set Go/No Go date to change targets to firm requirements
  • 6. 6 LLNL El Capitan System goals and objectives • Support the ongoing and future workload for the NNSA Stockpile Stewardship Program. – Life Extension Programs – Demands from the Predictive Capability Framework and the experimental campaigns • Replace Sierra with a system to support: – Hostile environment assessment – Uncertainty quantification for certification – Very large weapons’ science calculations • Siting in 2022, acceptance in 2023 • Available to support growing mission needs through 2027
  • 7. 7 ORNL Frontier System goals and objectives • Provide the Leadership computing capabilities needed for the DOE Office of Science mission thru 2026 – Capabilities for INCITE, ECP, and ALCC science projects – Capabilities assessed from gathered user requirements – Increased need for data science and deep learning • Siting in 2021. Run in parallel with Summit until Summit end-of-life (2023)
  • 8. 8 Technical requirements contents Chapter 3 High Level System Requirements Chapter 4 Application Benchmarks Chapter 5 Compute Partition Chapter 6 I/O Subsystem Chapter 7 Interconnect Chapter 8 OS, Middleware, System Resource Management Chapter 9 Front-end Environment Chapter 10 System Management and RAS Chapter 11 Maintenance and Support Chapter 12 Facility Requirements Chapter 13 Project Management Started with the successful RFP developed last time in CORAL As before, chapter teams with experts from all three labs worked on each chapter
  • 9. 9 Basic RFP Terminology Definitions The technical requirements have these designations: • Mandatory Requirement (MR): (only 3) – Requirements that are essential to CORAL RFP – Required to be responsive • Mandatory Option (MO): – Separately priced system options that may be exercised in individual build contracts – Required to be responsive • Technical Option (TO-1): – System upgrade option that each laboratory may choose – Not required to be responsive • Target Requirement (TR): – Performance features that are important to CORAL requirements – Graded (1, 2 or 3) based on overall importance to CORAL • 1 is essential; 2 is important 3 is desirable – Not required to be responsive
  • 10. 10 Technical requirements guiding principles • Minimize number of mandatory requirements and allow consideration of widest range of architectural solutions. • Focus on requiring science and throughput performance. Avoid overly prescriptive explicit speeds and feeds. • Agree on common technical requirements across all three Laboratories – not three separate sets of requirements. • Require vendors to describe available options to adjust system size and configuration to meet individual site needs and/or budgetary constraints.
  • 11. 11 SOW has three mandatory requirements 1. Description of the CORAL system – Overall system architecture and details of the interconnect – Detailed node architecture diagram with all data movement paths 2. Description of the software stack – Describe all software components and license strategy for each – Entire system software stack including system management and program development 3. Description of high level project management – Proposed multi-year collaboration plan – Plan to deliver systems – Perceived risks, risk management plan, and potential mitigations – Prime’s ability to ensure the responsiveness of its partners to the performance requirements
  • 12. 12 High level system requirements (1 of 4) It is desirable for overall system to meet or exceed the following: CORAL system peak (TR-1) – Peak of at least 1300 PF (double precision floating point) Reasons • Why a peak requirement? Provides vendors a target to shoot for. – Vendor may have to propose more to meet 50X benchmarks FOMs – There is a technical option requirement to scale peak up or down. • Providing 50X science and throughput over today’s largest DOE systems is a mission need for ECP – Given Titan and Sequoia performance today, 1300 PF was considered a minimum size to achieve 50X science. 1 EF peak not expected to be enough
  • 13. 13 It is desirable for overall system to meet or exceed the following: Total Memory (TR-1) – Aggregate of at least 8 PB total system memory – Must propose the same memory capacity and configuration as is used to obtain the reported benchmark results – Place a high value on solutions that provide mostly fast memory (e.g., HBM) Reasons • Science teams wanted to be sure system has enough memory to do the science without prescribing a particular memory architecture or configuration. • Study of exascale apps showed that at least 4-6 PB of HBM needed. Benchmarks set up to require 5 PB. • Data Science and Deep Learning apps often desire large memory per node. • The vendors can propose a wide range of memory architectures. RFP does not prescribe a particular architecture. High level system requirements (2 of 4)
  • 14. 14 It is desirable for overall system to not exceed the following: Maximum Power Consumption (TR-1) – Cannot exceed 40 MW for the 2021 system or the 2022 system including all peripherals and the file system. – power consumption between 20-30 MW preferred Reasons • All three labs agreed that 40 MW would be the maximum requirement based on – Available power at each facility for the new system – Annual cost of power – Evaluation of the RFI responses for expected power requirements for systems in the 2021-2023 timeframe • The 2022 proposals are expected to have more capability compared to the 2021 proposals for a given power consumption. High level system requirements (3 of 4)
  • 15. 15 • Scale the system size (TO-1) • Scale the system memory (TO-1) • Scale the system interconnect (TO-1) • Scale the system I/O (TO-1) • Scale the integrated telemetry DB (TO-1) • CORAL-scalable unit (MO) • Options for mid-life upgrades (TO-1) Because the different CORAL sites may want to procure different system configurations CORAL sites want to procure smaller versions of the system System does not have to be upgradeable, but if it is, what are the options and time-line Require vendors to describe any available options to adjust system
  • 16. 16 Early access to CORAL hardware/software (TR-1) Throughout the long-term collaboration, Offeror will propose mechanisms to provide the Laboratories with early access to hardware and software technology for evaluation, feedback, and application testing Reason To enhance the partnership nature of the vendor relationship Two runtime variability requirements (TR-1) Runtime variability <= 5% on a dedicated system running the benchmarks Runtime variability <= 30% on a busy system with representative workload Reason Reproducible performance from run to run is a highly desired property by science teams. Values of 5% and 30% come from experience with original CORAL RFP Other notable requirements
  • 17. 17 Offeror will provide an integrated telemetry database (ITDB) that facilitates collection and analysis of system management and monitoring data to help understand utilization of all system resources, failure modes and trends, and to correlate the various data sources. The database will be scalable and accessible through a web interface and command-line interface. • The database will provide rich search capabilities and advanced analytics that enable authorized users to perform complex event correlation and problem determination as well as to avoid outages proactively and to optimize system operation and performance. • Authentication and authorization mechanisms will enable subsets of data to be made available to different groups of people (e.g., system administrators, operators, end- users, and researchers). Integrated Telemetry Database and Analytics Infrastructure (TR-1) New requirement that has arisen since the original CORAL RFP. Sys admins, operators, tool developers, and end-users all had input into this requirement.
  • 18. 18 2021 and 2022 IO Subsystems (TR-1) The evolution of node local and system local storage has led to the replacement of an optional storage system requirement in the original CORAL RFP with a TR-1 requirement to include the entire I/O subsystem in the RFP response. ORNL and LLNL use their file systems in different ways that require different I/O subsystem requirements: • The I/O subsystem for the 2021 system at ORNL will be the center-wide storage system, providing a global POSIX namespace that is accessible by both the CN partition as well as external resources in the facility; • The I/O subsystem for the 2022 system at LLNL will not be a center-wide system but will support two types of namespaces: a global namespace and transient namespaces and will support two logical storage tiers, each optimized based on their performance and data residency requirements. The 2021 System’s I/O subsystem and the 2022 System’s I/O subsystem requirements are described in separate sections to make it clear to the vendors what to respond to in each of the two proposals.
  • 19. 19 Application performance requirements are the highest priority to CORAL An average “figure of merit” (FOM) improvement of at least 50X for scalable science, throughput and data science apps over today’s DOE systems. – Reference machines are Sequoia and Titan – The Offeror will provide actual, predicted and/or extrapolated performance results for the proposed system for the following: • CORAL-2 system performance (TR-1) – Average FOM over five scalable science apps >= 50 – Average FOM over six throughput apps >= 50 – Average FOM over four data science and machine learning apps >= 50 – Raw results for nine skeleton apps
  • 20. 20 CORAL-2 benchmarking strategy • Assuring that actual DOE applications perform well on CORAL platforms is key to success – Develop a benchmark suite that is representative of the DOE workload – Develop a benchmark suite that can evaluate future technology options – Provide flexibility to vendors for diverse architectures • Full applications not ideal to achieve all these goals – Millions of lines of code, may be MPI only, & perhaps export controlled – Limit benchmark suite to smaller applications, skeleton benchmarks, and micro-benchmarks
  • 21. 21 CORAL-2 benchmark categories represent DOE workloads and technical requirements Benchmark Categories • Scalable Science Benchmarks – Expected to run at full scale of the CORAL-2 systems • Throughput Benchmarks – Represent large ensemble runs; may be subsets of full applications; run many small to medium sized jobs across full system • Data Science & Deep Learning Benchmarks – Represent emerging data intensive & deep learning workloads – Integer operations, instruction throughput, indirect addressing, graph analysis – ML algorithms, regression analysis, training • Skeleton Benchmarks – Investigate various platform characteristics including network performance, threading overheads, I/O, memory, memory hierarchies, system software, and programming models Scalable Science Through put Skeleton Breadth Complexity Data Sciences and Machine Learning Scalable Science Throughput Data Sciences & Deep Learning
  • 22. 22 CORAL-2 benchmarking suite uses proxy-apps and a few larger applications Scalable Science Throughput Data Science Deep Learning Skeleton QMCPACK HACC NEKbone ACME VPIC Quicksilver Kripke LAMMPS AMG PENNANT Laghos Integer sort Havoq Big Data Suite Deep Learning Suite MPI suite Memory suite GEMM Pynamic CLOMP ML suite I/O suite RAJA suite Hash All Benchmarks are Target Requirements (TR-1) in the RFP CORAL-2 Benchmarks code and procedures are available at https://asc.llnl.gov/CORAL-2-benchmarks
  • 23. 23 CORAL-2 performance targets for scalable science and throughput codes address key application/workload requirements 50X Improvement on Scalable Science 50X Improvement on Throughput Geometric Mean Geometric Mean QMCPACK HACC NEKbone ACME VPIC Each run/projected at full machine scale Recommend 5X larger problem and 10X faster Each run/projected at 1/24th and 1/4th machine scale Multiple jobs to fill machine Recommend 2-10X larger problem and 5-25X faster Quicksilver Kripke LAMMPS AMG PENNANT Laghos
  • 24. 24 CORAL-2 system performance targets will be projected for each benchmark category Quicksilver Kripke AMG Laghos PENNANT LAMMPS Quicksilver Kripke AMG Quicksilver Kripke AMG Quicksilver Kripke AMG Laghos Laghos Laghos PENNANT PENNANT PENNANT LAMMPS LAMMPS LAMMPS Throughput Benchmarks QMCPACK HACC NEKbone ACME VPIC Scalable Science Benchmarks (each run/projected at full machine scale) Si = projected FOMi / baseline FOMi Throughput 1 Throughput 2 Throughput 3 Throughput 4 𝑆𝑆24 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = � 𝑖𝑖=1 24 𝑆𝑆𝑖𝑖 1 24 𝑆𝑆4 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = � 𝑖𝑖=1 6 𝑆𝑆𝑖𝑖 1 6 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠 = � 𝑖𝑖=1 5 𝑆𝑆𝑖𝑖 1 5 Box represent full scale Run mode#2: Ensembles @ 1/24th machine scale Run mode#1: Capability @ 1/4th machine scale
  • 25. 25 CORAL-2 addresses emerging data centric and deep learning workloads Stressed features • Integer operations • Instruction throughput • Indirect addressing • Both full machine and single node benchmarks • Parallel runs across entire target platform Exercised capabilities • Interconnect • Entire memory hierarchy • Irregular access patterns 50X Improvement on Data Science & Deep Learning Geometric Mean Integer Sort Havoq Big Data Suite Deep Learning Suite
  • 26. 26 CORAL-2 benchmarking includes emerging Data Science, Deep Learning, & Big Data Analytic workloads • Data Science – Parallel Integer Sort + Havoq (Graph Analytics) Triangle counting • Deep Learning Benchmark suite contains: – Convolutional Neural Networks (CNNs) that comprise convolutional layers followed by fully connected layers; – Long short-term memory (LSTM) recurrent neural network (RNN) architecture that remembers values over arbitrary intervals to deal with temporal and time-series prediction; – Distributed training code for classification in ImageNet data set at scale • Big Data Benchmark suite contains: – Subset of CANDLE benchmark codes that implement deep learning architectures that are relevant to problems in cancer. – Address problems at different biological scales, specifically problems at the molecular, cellular and population scales – Two diverse benchmark problems, namely • P1B1, a sparse autoencoder to compress the expression profile into a low-dimensional vector • P3B1 a multi-task deep neural net for data extraction from clinical reports.
  • 27. 27 CORAL-2 system performance targets will be projected for each benchmark category Big Data Big Data Deep Learning Big Data Deep Learning Deep Learning Big Data Big Data Deep Learning Big Data Big Data Deep Learning Big Data Big Data Deep Learning Big Data Big Data Big Data Deep Learning Deep Learning Deep Learning Deep Learning Deep Learning Deep Learning Big Data and Deep Learning Integer Sort Havoq Data/Graph Analytics Si = projected FOMi / baseline FOMi 𝑆𝑆𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = � 𝑖𝑖=1 4 𝑆𝑆𝑖𝑖 1 4 Box represent full scale Run mode: Small Ensembles Run mode: Full machine scale
  • 28. 28 Rules for CORAL-2 benchmark procedures • Define a variety of problem sizes: − Small is single node − Medium is ~1K nodes − Large is reference problem on current platforms − CORAL-2 class problems are: − SS: 5X or larger + 10X faster − TP: 2-10X larger + 5-25X faster − Require 5 PB of memory • Follow CORAL-2 code modification guidelines − Baseline and optimized results allowed • Offeror estimates sustained performance for science, throughput, & data science deep learning benchmarks − CORAL-2 class problem sizes
  • 29. 29 Allowed code modification to CORAL-2 benchmarks • Benchmarks may be modified as necessary to get them to compile and run – Portability changes for programming models are allowed • A full set of benchmark runs must be reported with this “baseline” source code – Can include non-intrusive and/or portable optimizations • E.g., compiler flags and standard pragma-style guidance – Can include anticipated changes to system software • E.g., MPI and OpenMP runtime improvements • At least 1 GB per MPI task and use threading within each task if necessary to utilize all compute resources – Requirement tied directly to current CORAL codes and production usage
  • 30. 30 Allowed code modifications for optimized results • Offeror may also report optimized results – Any and all code modifications are allowed – However, wholesale algorithmic changes that are strongly architecture specific have less value – All benchmark code modification will be documented and provided to CORAL-2 • CORAL-2 and Offeror will continue to improve the efficiency and scalability of all benchmarks between award of the contracts and delivery of the systems – Emphasis on higher level optimizations as well as compiler optimization technology improvements while maintaining readable and maintainable code
  • 31. 31 Questions? http://procurement.ornl.gov/rfp/CORAL2/ ORNL Contract Administrator Contact Information: William (Willy) Besancenez Procurement Officer Acquisition Management Services Division Oak Ridge National Laboratory Phone - (865) 576-1538 E-mail - besancenezwr@ornl.gov