Mauricio breteernitiz hpc-exascale-iscte

High Performance Computing at Exascale :
Application Requirements and Technology Development
Mauricio Breternitz, Ph.D.
05 July 2017

1. Brief Introduction
2. High Performance Computing and Key Application
Requirements
3. Supercomputing Systems Under the Hood
4. Technological Challenges and Path to Exascale
5. Exascale Development Program
Outline

Brief BIO, Publications, Patents
Work-IBM Research, Motorola, Times N, Intel Labs, AMD Research, ISCTE
Education: Ph.D – Carnegie Mellon, ECE
M.Sc. – UNICAMP, Computer Science
E.Eng. – ITA, Brazil (Honors)
Area(s): Computer Architecture, Computer Systems, Performance, Tuning
Big Data, Machine Learning
Patents: 46 U.S. Patents Issued, 50 U.S. Patents Pending
Publications: 1010 Citations H-index 16, i18-index 31, citations 885
Service: Creator /General Chair: International Workshop on
Architectural/Microarchitectural Support for Binary Translation,
joint with and CGO.
Chair, ICCD
PC: CGO
Academic: Guest Lecturer – U.Texas/Austin, CMU, UNICAMP
Collaboration: U.Texas/Austin, Rice University, CMU, UNICAMP, Edinburgh

Education Professional PersonalEducation Professional Personal
time

1. High Performance Computing - Key
Application Requirements
3. Technological Challenges and Path to
Exascale
Brief Introduction

Exascale Research Areas
credit: Exascale Computing Project, 2017

• Experiments: impossible, dangerous, costly
• Vastly more accurate predictive models +
• Analysis of vast quantities of data
IMPROVE
• Regional climate
• Carbon footprint
• Nuclear efficiency, safety
• Renewable energy
• Nuclear stockpile safety
• Human Brain
• Advanced Materials
Report on Exascale Computing
Summary Report of the Advanced Scientific Computing Advisory Committee
(ASCCAC) Subcommittee, U.S.Dept of Energy, Fall 2010
https://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf
Key Application Requirements

1. High Performance Computing - Key
Application Requirements
3. Technological Challenges and Path to
Exascale
Exascale Application Areas
credit: Paul Messina, Exascale Computing Project, 2017

Exascale Challenges

Technology Progress

Achieving Exascale
Credit: Paul Messina, Exascale Computing Project, 2017

Parallelism
• Inter-Node Parallellism
• Intra-Node Parallelism
• Vectors
• Pipelining
• Multiple Cores
• Memory Organization
• Shared Memory
• Distributed Memory
• Locality
• Programming For Parallelism
• Intra-Node:
• Vectorization
• Open-MP
• Inter-Node
• MPI
• I/O

Programming Parallelism
• Intra-Node – Multiple Cores -> Multiple threads
• Memory: shared
• Synchronization
• OpenMP
• Inter-Node: Multiple CPUs -> Multiple processes
• Memory: distributed
• Communication
• MPI
void simple(int n, float *a, float *b) {
int i;
#pragma omp parallel for
for (i=1; i< N; i++){
b([i] = a[i] * 3.1416;
}
if (my_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (my_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0n",
number);
}

Exascale System
• Cabinets
• Nodes
• Switch
• I/O

HPC System Evolution
credit: Nader Bagherzadeh, U.C.Irvine

ARCHITECTURE vs. MICRO-ARCHITECTURE
Credit: Alan Lee, SCP, AMD Research, 2017
• Architecture:
Describes the high level attributes of the system. Sometimes referred to as
Instruction Set Architecture when applied to processors.
Examples include x86 and ARM.
For the purist, Architecture consists of instructions, data types, and
addressing modes
A programmer can “see” the architecture
• Microarchitecture:
Describes the implementation details of the processor. Examples include
pipelining and instruction level parallelism.
The microarchitecture is most often hidden from programming
languages, but it is often quite important to the programmer
Note: These terms are often conflated in casual discussions with computer
engineers

Basic Definitions
credit: Alan Lee, CVP, AMD Research, 2017
• CPU: Central Processing Unit. Current CPUs include multiple cores, on-
package I/O and sometimes include integrated GPUs for display and
compute purposes.
• GPU: Graphics Processing Unit. Current GPUs are comprised of many
small compute units that are optimized for parallel operations.
Specialized graphics hardware is often included to support graphics and
displays.
• FPGA: Field Programmable Gate Array. The FPGA is made up of a large
number of logic blocks surrounded by a digital routing fabric. Current
FPGAs often include floating point building block hardware, small CPU
cores and dedicated I/O circuitry such as memory controllers.
• DSP: Digital Signal Processor. DSPs include a small number of cores that
are highly optimized for multiply-accumulate operations and other
operations often used in signal processing algorithms. DSPs include large
amounts of tightly integrated signal processing I/O.
• ASIC: Application Specific Integrated Circuit. ASICs are specialized
circuitry implemented as a chip for a specific purpose.

Computing Devices
Credit: Alan Lee, AMD Research SVP, 2017
• CPU, GPU, and DSP architectures are closest, differing on parallelism and control.
What differentiates them is how the microarchitectures are combined with each
other (parallelism and control) and with memory and I/O.
• FPGAs provide a semi-flexible solution where digital logic design is used to
implement algorithms and I/O for a specific task. Modern FPGAs include a number of
hardware multiply units that make them suitable for algorithms such as the FFT.
• ASICs are custom chips that can achieve better performance than FPGAs. They are
suitable for well defined algorithms.

Technology Trends
Credit: AMD Research, 2017
• Heterogeneous computing and accelerators
• Increased on-chip integration (e.g. CPU + GPU on the same die)
• Increased on-package integration using multiple chips on an interposer.
E.g: CPU + NIC + memory
• 3D or Die-stacking: Stacked memory chips and logic chips (as seen in
recent GPU products such as AMD’s Fiji)
• Higher core counts
• New memory technologies (e.g. NVRAM, stacked memory)
• Faster interconnects
• New programming paradigms: C++ AMP, OpenMP and OpenACC
standards

CPU Trends
Credit: Alan Lee, AMD Research CVP, 2017
CPU architecture trends:
• Bigger pipelines
• Increased out of order execution
• Improved speculative execution
• Wider vector operations
• Memory scatter/gather instructions
(vectored I/O)
These architectural features improve performance at the cost of die space
and reduced energy efficiency
Increasing core counts enables parallel thread execution

GPU Overview
Very high core count with highly parallel architectures
•Simplified Core architecture to reduce die space and
improve energy efficiency
•Sequential code runs poorly on the GPU, although
current GPUs have better support for general purpose
compute
•Excellent floating point capability
•High throughput memory architecture
Programmable using OpenCL, C++ and other high level languages via OpenMP and
OpenACC.
GPUs are good choices for highly parallel data processing such as signal and image
processing.

Handling Large Data Sets at High Speed
• A conventional CPU executes one thread at a time
A multi-core CPU might execute tens of threads at a time
A GPU can process thousands of threads concurrently
(Repurpose pixel processing for general purpose processing)
Result: Huge increase in power-performance efficiency
Highly parallel algorithms (e.g., X-correlation) experience massive
acceleration
Trend: accelerators are increasingly deployed to attack more algorithms and
problems:

Memory Integration
Trend:
• Integration of memory directly inside
the processor package
• Provides TB/s bandwidths
through 10s of channels

Resilience is a
Showstopper
• University of Virginia’s System with 1100 Apple G5
motherboards worked only at night, why? Cosmic
rays from the sun generated single –bit memory
errors
• At national Labs, supercomputers experience
similar problems
• At 128 PB, Exascale can have a double bit error
every a few minutes
• Remedies may include:
• Self healing, work around the faulty area
• Self correction, redundant calculations
• Fault-tolerant algorithms

Top 500 Supercomputers (Nov 2016)
• China- Top 2
• US-China Tied in
total systems in
list
• 117 Systems
above 1PFlop
• Top 10 with
Accelerators:
XeonPhi
(#2,#5,#6) and
NVIDIA GPUs
(#3,#8)

U.S.Department of Energy Exascale Program
RFP Awardees Total (US$ million)
• FastForward 2011 NVIDIA AMD Intel IBM WhamCloud 62.0
• DesignForward 2013 AMD Cray IBM Intel NVIDIA 25.4
• FastForward2 NVIDIA AMD Intel IBM 100.0
• DesignForward2 2014 AMD, Cray, IBM 20.0
• PathForward 2016 AMD, Cray, IBM, HP, Intel, Nvidia 258.0

Exascale Computing Program – CANDLE
Precision Medicine for Cancer
Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer
scalable deep neural network code
CANcer Distributed Learning Environment (CANDLE)
Three top challenges of the National Cancer Institute:
1. understanding the molecular basis of key protein interactions,
2. developing predictive models for drug response and
3. automating the analysis and extraction of information from millions of cancer patient records
to determine optimal cancer treatment strategies.

Exascale Technology Benefits
• "Big data is what happened when the cost of storing information became
less than the cost of making the decision to throw it away.“
- George Dyson
• Big Data and Analytics
• Machine Learning at Exascale
• Commercial Applications
• High Performance Computing Systems
• Memory
• Storage
• Communication

Concluding Remarks
• Exascale Systems and Applications
• Technology Development to Exascale
• Effects on Big Data and Analytics

Mauricio Breternitz
mbjrz@iscte.pt
Thank You!

using namespace std::experimental::parallel;
int x;
std::mutex m;
int a[] = {1,2};
for_each(par, std::begin(a),std::end(a), [&](int)
{
m.lock();
++x;
m.unlock();
});

Mauricio breteernitiz hpc-exascale-iscte

Mauricio breteernitiz hpc-exascale-iscte

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mauricio breteernitiz hpc-exascale-iscte

Similar to Mauricio breteernitiz hpc-exascale-iscte (20)

Recently uploaded

Recently uploaded (20)

Mauricio breteernitiz hpc-exascale-iscte