SlideShare a Scribd company logo
1 of 40
Download to read offline
High Performance Computing at Exascale :
Application Requirements and Technology Development
Mauricio Breternitz, Ph.D.
05 July 2017
1. Brief Introduction
2. High Performance Computing and Key Application
Requirements
3. Supercomputing Systems Under the Hood
4. Technological Challenges and Path to Exascale
5. Exascale Development Program
Outline
Brief BIO, Publications, Patents
Work-IBM Research, Motorola, Times N, Intel Labs, AMD Research, ISCTE
Education: Ph.D – Carnegie Mellon, ECE
M.Sc. – UNICAMP, Computer Science
E.Eng. – ITA, Brazil (Honors)
Area(s): Computer Architecture, Computer Systems, Performance, Tuning
Big Data, Machine Learning
Patents: 46 U.S. Patents Issued, 50 U.S. Patents Pending
Publications: 1010 Citations H-index 16, i18-index 31, citations 885
Service: Creator /General Chair: International Workshop on
Architectural/Microarchitectural Support for Binary Translation,
joint with and CGO.
Chair, ICCD
PC: CGO
Academic: Guest Lecturer – U.Texas/Austin, CMU, UNICAMP
Collaboration: U.Texas/Austin, Rice University, CMU, UNICAMP, Edinburgh
Education Professional PersonalEducation Professional Personal
time
1. High Performance Computing - Key
Application Requirements
2. Supercomputing Systems Under the Hood
3. Technological Challenges and Path to
Exascale
4. Exascale Development Program
Brief Introduction
Exascale Research Areas
credit: Exascale Computing Project, 2017
• Experiments: impossible, dangerous, costly
• Vastly more accurate predictive models +
• Analysis of vast quantities of data
IMPROVE
• Regional climate
• Carbon footprint
• Nuclear efficiency, safety
• Renewable energy
• Nuclear stockpile safety
• Human Brain
• Advanced Materials
Report on Exascale Computing
Summary Report of the Advanced Scientific Computing Advisory Committee
(ASCCAC) Subcommittee, U.S.Dept of Energy, Fall 2010
https://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf
Key Application Requirements
1. High Performance Computing - Key
Application Requirements
2. Supercomputing Systems Under the Hood
3. Technological Challenges and Path to
Exascale
4. Exascale Development Program
Exascale Application Areas
credit: Paul Messina, Exascale Computing Project, 2017
Cancer Research
Exascale Challenges
credit: Paul Messina, Exascale Computing Project, 2017
Technology Progress
credit: Paul Messina, Exascale Computing Project, 2017
Achieving Exascale
Credit: Paul Messina, Exascale Computing Project, 2017
Performance Via Parallelism
Parallelism
• Inter-Node Parallellism
• Intra-Node Parallelism
• Vectors
• Pipelining
• Multiple Cores
• Memory Organization
• Shared Memory
• Distributed Memory
• Locality
• Programming For Parallelism
• Intra-Node:
• Vectorization
• Open-MP
• Inter-Node
• MPI
• I/O
Programming Parallelism
• Intra-Node – Multiple Cores -> Multiple threads
• Memory: shared
• Synchronization
• OpenMP
• Inter-Node: Multiple CPUs -> Multiple processes
• Memory: distributed
• Communication
• MPI
void simple(int n, float *a, float *b) {
int i;
#pragma omp parallel for
for (i=1; i< N; i++){
b([i] = a[i] * 3.1416;
}
if (my_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (my_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0n",
number);
}
Exascale System
• Cabinets
• Nodes
• Switch
• I/O
Exascale System Specification
HPC System Evolution
credit: Nader Bagherzadeh, U.C.Irvine
ARCHITECTURE vs. MICRO-ARCHITECTURE
Credit: Alan Lee, SCP, AMD Research, 2017
• Architecture:
Describes the high level attributes of the system. Sometimes referred to as
Instruction Set Architecture when applied to processors.
Examples include x86 and ARM.
For the purist, Architecture consists of instructions, data types, and
addressing modes
A programmer can “see” the architecture
• Microarchitecture:
Describes the implementation details of the processor. Examples include
pipelining and instruction level parallelism.
The microarchitecture is most often hidden from programming
languages, but it is often quite important to the programmer
Note: These terms are often conflated in casual discussions with computer
engineers
Basic Definitions
credit: Alan Lee, CVP, AMD Research, 2017
• CPU: Central Processing Unit. Current CPUs include multiple cores, on-
package I/O and sometimes include integrated GPUs for display and
compute purposes.
• GPU: Graphics Processing Unit. Current GPUs are comprised of many
small compute units that are optimized for parallel operations.
Specialized graphics hardware is often included to support graphics and
displays.
• FPGA: Field Programmable Gate Array. The FPGA is made up of a large
number of logic blocks surrounded by a digital routing fabric. Current
FPGAs often include floating point building block hardware, small CPU
cores and dedicated I/O circuitry such as memory controllers.
• DSP: Digital Signal Processor. DSPs include a small number of cores that
are highly optimized for multiply-accumulate operations and other
operations often used in signal processing algorithms. DSPs include large
amounts of tightly integrated signal processing I/O.
• ASIC: Application Specific Integrated Circuit. ASICs are specialized
circuitry implemented as a chip for a specific purpose.
Computing Devices
Credit: Alan Lee, AMD Research SVP, 2017
• CPU, GPU, and DSP architectures are closest, differing on parallelism and control.
What differentiates them is how the microarchitectures are combined with each
other (parallelism and control) and with memory and I/O.
• FPGAs provide a semi-flexible solution where digital logic design is used to
implement algorithms and I/O for a specific task. Modern FPGAs include a number of
hardware multiply units that make them suitable for algorithms such as the FFT.
• ASICs are custom chips that can achieve better performance than FPGAs. They are
suitable for well defined algorithms.
Technology Trends
Credit: AMD Research, 2017
• Heterogeneous computing and accelerators
• Increased on-chip integration (e.g. CPU + GPU on the same die)
• Increased on-package integration using multiple chips on an interposer.
E.g: CPU + NIC + memory
• 3D or Die-stacking: Stacked memory chips and logic chips (as seen in
recent GPU products such as AMD’s Fiji)
• Higher core counts
• New memory technologies (e.g. NVRAM, stacked memory)
• Faster interconnects
• New programming paradigms: C++ AMP, OpenMP and OpenACC
standards
CPU Trends
Credit: Alan Lee, AMD Research CVP, 2017
CPU architecture trends:
• Bigger pipelines
• Increased out of order execution
• Improved speculative execution
• Wider vector operations
• Memory scatter/gather instructions
(vectored I/O)
These architectural features improve performance at the cost of die space
and reduced energy efficiency
Increasing core counts enables parallel thread execution
GPU Overview
Very high core count with highly parallel architectures
•Simplified Core architecture to reduce die space and
improve energy efficiency
•Sequential code runs poorly on the GPU, although
current GPUs have better support for general purpose
compute
•Excellent floating point capability
•High throughput memory architecture
Programmable using OpenCL, C++ and other high level languages via OpenMP and
OpenACC.
GPUs are good choices for highly parallel data processing such as signal and image
processing.
CPU, GPU Comparison
Handling Large Data Sets at High Speed
• A conventional CPU executes one thread at a time
A multi-core CPU might execute tens of threads at a time
A GPU can process thousands of threads concurrently
(Repurpose pixel processing for general purpose processing)
Result: Huge increase in power-performance efficiency
Highly parallel algorithms (e.g., X-correlation) experience massive
acceleration
Trend: accelerators are increasingly deployed to attack more algorithms and
problems:
Memory Integration
Trend:
• Integration of memory directly inside
the processor package
• Provides TB/s bandwidths
through 10s of channels
Resilience is a
Showstopper
• University of Virginia’s System with 1100 Apple G5
motherboards worked only at night, why? Cosmic
rays from the sun generated single –bit memory
errors
• At national Labs, supercomputers experience
similar problems
• At 128 PB, Exascale can have a double bit error
every a few minutes
• Remedies may include:
• Self healing, work around the faulty area
• Self correction, redundant calculations
• Fault-tolerant algorithms
Top 500 Supercomputers (Nov 2016)
• China- Top 2
• US-China Tied in
total systems in
list
• 117 Systems
above 1PFlop
• Top 10 with
Accelerators:
XeonPhi
(#2,#5,#6) and
NVIDIA GPUs
(#3,#8)
U.S.Department of Energy Exascale Program
RFP Awardees Total (US$ million)
• FastForward 2011 NVIDIA AMD Intel IBM WhamCloud 62.0
• DesignForward 2013 AMD Cray IBM Intel NVIDIA 25.4
• FastForward2 NVIDIA AMD Intel IBM 100.0
• DesignForward2 2014 AMD, Cray, IBM 20.0
• PathForward 2016 AMD, Cray, IBM, HP, Intel, Nvidia 258.0
Exascale Computing Program – CANDLE
Precision Medicine for Cancer
Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer
scalable deep neural network code
CANcer Distributed Learning Environment (CANDLE)
Three top challenges of the National Cancer Institute:
1. understanding the molecular basis of key protein interactions,
2. developing predictive models for drug response and
3. automating the analysis and extraction of information from millions of cancer patient records
to determine optimal cancer treatment strategies.
Exascale Technology Benefits
• "Big data is what happened when the cost of storing information became
less than the cost of making the decision to throw it away.“
- George Dyson
• Big Data and Analytics
• Machine Learning at Exascale
• Commercial Applications
• High Performance Computing Systems
• Memory
• Storage
• Communication
Concluding Remarks
• Exascale Systems and Applications
• Technology Development to Exascale
• Effects on Big Data and Analytics
Mauricio Breternitz
mbjrz@iscte.pt
Thank You!
Backup
using namespace std::experimental::parallel;
int x;
std::mutex m;
int a[] = {1,2};
for_each(par, std::begin(a),std::end(a), [&](int)
{
m.lock();
++x;
m.unlock();
});
Mauricio breteernitiz hpc-exascale-iscte

More Related Content

What's hot

ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Junli Gu
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
Junli Gu
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Public Seminar_Final 18112014
Public Seminar_Final 18112014Public Seminar_Final 18112014
Public Seminar_Final 18112014
Hossam Hassan
 

What's hot (20)

OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
Deep Learning Accelerator Design Techniques
Deep Learning Accelerator Design TechniquesDeep Learning Accelerator Design Techniques
Deep Learning Accelerator Design Techniques
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
Intel's Machine Learning Strategy
Intel's Machine Learning StrategyIntel's Machine Learning Strategy
Intel's Machine Learning Strategy
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
Public Seminar_Final 18112014
Public Seminar_Final 18112014Public Seminar_Final 18112014
Public Seminar_Final 18112014
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 

Similar to Mauricio breteernitiz hpc-exascale-iscte

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 

Similar to Mauricio breteernitiz hpc-exascale-iscte (20)

Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 

Recently uploaded

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 

Mauricio breteernitiz hpc-exascale-iscte

  • 1. High Performance Computing at Exascale : Application Requirements and Technology Development Mauricio Breternitz, Ph.D. 05 July 2017
  • 2. 1. Brief Introduction 2. High Performance Computing and Key Application Requirements 3. Supercomputing Systems Under the Hood 4. Technological Challenges and Path to Exascale 5. Exascale Development Program Outline
  • 3. Brief BIO, Publications, Patents Work-IBM Research, Motorola, Times N, Intel Labs, AMD Research, ISCTE Education: Ph.D – Carnegie Mellon, ECE M.Sc. – UNICAMP, Computer Science E.Eng. – ITA, Brazil (Honors) Area(s): Computer Architecture, Computer Systems, Performance, Tuning Big Data, Machine Learning Patents: 46 U.S. Patents Issued, 50 U.S. Patents Pending Publications: 1010 Citations H-index 16, i18-index 31, citations 885 Service: Creator /General Chair: International Workshop on Architectural/Microarchitectural Support for Binary Translation, joint with and CGO. Chair, ICCD PC: CGO Academic: Guest Lecturer – U.Texas/Austin, CMU, UNICAMP Collaboration: U.Texas/Austin, Rice University, CMU, UNICAMP, Edinburgh
  • 4. Education Professional PersonalEducation Professional Personal time
  • 5. 1. High Performance Computing - Key Application Requirements 2. Supercomputing Systems Under the Hood 3. Technological Challenges and Path to Exascale 4. Exascale Development Program Brief Introduction
  • 6. Exascale Research Areas credit: Exascale Computing Project, 2017
  • 7. • Experiments: impossible, dangerous, costly • Vastly more accurate predictive models + • Analysis of vast quantities of data IMPROVE • Regional climate • Carbon footprint • Nuclear efficiency, safety • Renewable energy • Nuclear stockpile safety • Human Brain • Advanced Materials Report on Exascale Computing Summary Report of the Advanced Scientific Computing Advisory Committee (ASCCAC) Subcommittee, U.S.Dept of Energy, Fall 2010 https://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf Key Application Requirements
  • 8. 1. High Performance Computing - Key Application Requirements 2. Supercomputing Systems Under the Hood 3. Technological Challenges and Path to Exascale 4. Exascale Development Program Exascale Application Areas credit: Paul Messina, Exascale Computing Project, 2017
  • 10. Exascale Challenges credit: Paul Messina, Exascale Computing Project, 2017
  • 11. Technology Progress credit: Paul Messina, Exascale Computing Project, 2017
  • 12. Achieving Exascale Credit: Paul Messina, Exascale Computing Project, 2017
  • 14. Parallelism • Inter-Node Parallellism • Intra-Node Parallelism • Vectors • Pipelining • Multiple Cores • Memory Organization • Shared Memory • Distributed Memory • Locality • Programming For Parallelism • Intra-Node: • Vectorization • Open-MP • Inter-Node • MPI • I/O
  • 15. Programming Parallelism • Intra-Node – Multiple Cores -> Multiple threads • Memory: shared • Synchronization • OpenMP • Inter-Node: Multiple CPUs -> Multiple processes • Memory: distributed • Communication • MPI void simple(int n, float *a, float *b) { int i; #pragma omp parallel for for (i=1; i< N; i++){ b([i] = a[i] * 3.1416; } if (my_rank == 0) { number = -1; MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (my_rank == 1) { MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process 1 received number %d from process 0n", number); }
  • 16. Exascale System • Cabinets • Nodes • Switch • I/O
  • 18. HPC System Evolution credit: Nader Bagherzadeh, U.C.Irvine
  • 19. ARCHITECTURE vs. MICRO-ARCHITECTURE Credit: Alan Lee, SCP, AMD Research, 2017 • Architecture: Describes the high level attributes of the system. Sometimes referred to as Instruction Set Architecture when applied to processors. Examples include x86 and ARM. For the purist, Architecture consists of instructions, data types, and addressing modes A programmer can “see” the architecture • Microarchitecture: Describes the implementation details of the processor. Examples include pipelining and instruction level parallelism. The microarchitecture is most often hidden from programming languages, but it is often quite important to the programmer Note: These terms are often conflated in casual discussions with computer engineers
  • 20. Basic Definitions credit: Alan Lee, CVP, AMD Research, 2017 • CPU: Central Processing Unit. Current CPUs include multiple cores, on- package I/O and sometimes include integrated GPUs for display and compute purposes. • GPU: Graphics Processing Unit. Current GPUs are comprised of many small compute units that are optimized for parallel operations. Specialized graphics hardware is often included to support graphics and displays. • FPGA: Field Programmable Gate Array. The FPGA is made up of a large number of logic blocks surrounded by a digital routing fabric. Current FPGAs often include floating point building block hardware, small CPU cores and dedicated I/O circuitry such as memory controllers. • DSP: Digital Signal Processor. DSPs include a small number of cores that are highly optimized for multiply-accumulate operations and other operations often used in signal processing algorithms. DSPs include large amounts of tightly integrated signal processing I/O. • ASIC: Application Specific Integrated Circuit. ASICs are specialized circuitry implemented as a chip for a specific purpose.
  • 21. Computing Devices Credit: Alan Lee, AMD Research SVP, 2017 • CPU, GPU, and DSP architectures are closest, differing on parallelism and control. What differentiates them is how the microarchitectures are combined with each other (parallelism and control) and with memory and I/O. • FPGAs provide a semi-flexible solution where digital logic design is used to implement algorithms and I/O for a specific task. Modern FPGAs include a number of hardware multiply units that make them suitable for algorithms such as the FFT. • ASICs are custom chips that can achieve better performance than FPGAs. They are suitable for well defined algorithms.
  • 22. Technology Trends Credit: AMD Research, 2017 • Heterogeneous computing and accelerators • Increased on-chip integration (e.g. CPU + GPU on the same die) • Increased on-package integration using multiple chips on an interposer. E.g: CPU + NIC + memory • 3D or Die-stacking: Stacked memory chips and logic chips (as seen in recent GPU products such as AMD’s Fiji) • Higher core counts • New memory technologies (e.g. NVRAM, stacked memory) • Faster interconnects • New programming paradigms: C++ AMP, OpenMP and OpenACC standards
  • 23. CPU Trends Credit: Alan Lee, AMD Research CVP, 2017 CPU architecture trends: • Bigger pipelines • Increased out of order execution • Improved speculative execution • Wider vector operations • Memory scatter/gather instructions (vectored I/O) These architectural features improve performance at the cost of die space and reduced energy efficiency Increasing core counts enables parallel thread execution
  • 24.
  • 25. GPU Overview Very high core count with highly parallel architectures •Simplified Core architecture to reduce die space and improve energy efficiency •Sequential code runs poorly on the GPU, although current GPUs have better support for general purpose compute •Excellent floating point capability •High throughput memory architecture Programmable using OpenCL, C++ and other high level languages via OpenMP and OpenACC. GPUs are good choices for highly parallel data processing such as signal and image processing.
  • 27. Handling Large Data Sets at High Speed • A conventional CPU executes one thread at a time A multi-core CPU might execute tens of threads at a time A GPU can process thousands of threads concurrently (Repurpose pixel processing for general purpose processing) Result: Huge increase in power-performance efficiency Highly parallel algorithms (e.g., X-correlation) experience massive acceleration Trend: accelerators are increasingly deployed to attack more algorithms and problems:
  • 28. Memory Integration Trend: • Integration of memory directly inside the processor package • Provides TB/s bandwidths through 10s of channels
  • 29. Resilience is a Showstopper • University of Virginia’s System with 1100 Apple G5 motherboards worked only at night, why? Cosmic rays from the sun generated single –bit memory errors • At national Labs, supercomputers experience similar problems • At 128 PB, Exascale can have a double bit error every a few minutes • Remedies may include: • Self healing, work around the faulty area • Self correction, redundant calculations • Fault-tolerant algorithms
  • 30. Top 500 Supercomputers (Nov 2016) • China- Top 2 • US-China Tied in total systems in list • 117 Systems above 1PFlop • Top 10 with Accelerators: XeonPhi (#2,#5,#6) and NVIDIA GPUs (#3,#8)
  • 31. U.S.Department of Energy Exascale Program RFP Awardees Total (US$ million) • FastForward 2011 NVIDIA AMD Intel IBM WhamCloud 62.0 • DesignForward 2013 AMD Cray IBM Intel NVIDIA 25.4 • FastForward2 NVIDIA AMD Intel IBM 100.0 • DesignForward2 2014 AMD, Cray, IBM 20.0 • PathForward 2016 AMD, Cray, IBM, HP, Intel, Nvidia 258.0
  • 32. Exascale Computing Program – CANDLE Precision Medicine for Cancer Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer scalable deep neural network code CANcer Distributed Learning Environment (CANDLE) Three top challenges of the National Cancer Institute: 1. understanding the molecular basis of key protein interactions, 2. developing predictive models for drug response and 3. automating the analysis and extraction of information from millions of cancer patient records to determine optimal cancer treatment strategies.
  • 33.
  • 34. Exascale Technology Benefits • "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.“ - George Dyson • Big Data and Analytics • Machine Learning at Exascale • Commercial Applications • High Performance Computing Systems • Memory • Storage • Communication
  • 35. Concluding Remarks • Exascale Systems and Applications • Technology Development to Exascale • Effects on Big Data and Analytics
  • 38.
  • 39. using namespace std::experimental::parallel; int x; std::mutex m; int a[] = {1,2}; for_each(par, std::begin(a),std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });