SlideShare a Scribd company logo
WhenWe Need High Performance
► Case1
To perform time-consuming operations in less time/ before a tighter deadline.
►I am a bioinformatic engineer.
►I need to run computationally complex programs.
►I’d rather have the result in 5 minutes than in 5 days.
►Case 2
To do a high number of operations per seconds
►I am an engineer
►My Web server gets 1,000 hits per seconds
►I’d like my web server and my databases to handle 1,000 transactions per seconds so that
customers do not experience bad delays
Amazon does “process” several GBytes of data per seconds
What Does High Performance
Computing Study
► It includes following subjects
 Hardware
- Computer Architecture
- Network Connections
 Software
- Programming paradigms
- Languages
- Middleware
1 103 106 109 1012
Univac 1
Cray 1
Cray XMP
Cray YMP
CDC 6600
Babbage Difference
Intel Delta
Cray X1
Mark 1
IBM 7094
Cray XT5
Nations and global regions including China, the United States, Japan, and
Russia, are racing ahead and have created national programs that are
investing large sums of money to develop exascale supercomputers.
Supercomputer: A computing system exhibiting high-end performance
capabilities and resource capacities within practical constraints of technology, cost,
power, and reliability. Thomas Sterling, 2007
Supercomputer: a large very fast mainframe used especially for scientific
computations. Merriam-Webster Online
Supercomputer: any of a class of extremely powerful computers. The term is
commonly applied to the fastest high-performance systems available at any given time.
Such computers are used primarily for scientific and engineering work requiring
exceedingly high-speed computations. Encyclopedia Britannica Online
Research Applications of HPCs
• Finance
• Sports and Entertainment
• Weather Forecasting
• Space Research
• Health-Care-Related Applications
• to unravel the morphology of cancer cells,
• to diagnose and treat cancers and improve the safety of cancer
• medical research,
• biomedicine,
• bioinformatics,
• epidemiology,
• Personalized medicine
‘Big Data’
Company Share of Global HPC Revenues
Countries Share of Global HPC Revenues
 Introduction - today’s lecture
 System Architectures (Single Instruction - Single Data, Single Instruction - Multiple
Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed Memory,
Cluster, Multiple Instruction - Single Data)
 Performance Analysis of parallel calculations (the speedup, efficiency, time execution
of algorithm…)
 Parallel numerical methods (Principles of Parallel Algorithm Design, Analytical
Modeling of Parallel Programs, Matrices Operations, Matrix-Vector Operations, Graph
 Software (Programming Using Compute Unified Device Architecture)
In the first section we will discuss the importance of parallel computing to high
performance computing. We will show the basic concepts of parallel computing.
The advantages and disadvantages of parallel computing will be discussed. We
will present an overview of current and future trends in HPC hardware.
The second section will provide an introduction to parallel GPU implementations
of numerical methods, such as Matrices Operations, Matrix-Vector Operations,
Graph Algorithms. Also second section will present the application of parallel
computing techniques using Graphic Processing Unit (GPU) in order to improve
the computational efficiency of numerical methods
The third session will briefly discuss such important HPC topics like computing
using graphic processing units (GPUs) with CUDA running relatively simple
examples on this hardware. As tradition dictates, we will show how to write
"Hello World" in CUDA. Some computational libraries available for HPC with GPU
will be highlighted.
 What is traditional programming view?
 Why Use Parallel Computing?
 Motivation for parallelism (Moor’s law).
 An Overview of Parallel Processing
 Parallelism in Uniprocessor Systems
 Organization of Multiprocessor
 Flynn’s Classification
 System Topologies
 MIMD System Architectures
 Concepts and terminology.
Parallelism is a method to improve computer
system performance by executing two or more
instructions simultaneously.
 The goals of parallel processing.
 One goal is to reduce the “wall-clock” time or the
amount of real time that you need to wait for a
problem to be solved.
 Another goal is to solve bigger problems that
might not fit in the limited memory of a single
Consider your favorite computational application
• One processor can give me results in N hours
• Why not use N processors -- and get the
results in just one hour?
Parallel computing: the use
of multiple computers or
processors working together
on a common task.
 Each processor works on
its section of the problem.
 Processors are allowed to
exchange information
with other processors.
Grid of a Problem to be
Flynn’s Classification (Taxonomy)
 Was proposed by researcher Michael J. Flynn in
 It is the most commonly accepted taxonomy of
computer organization.
 In this classification, computers are classified by
whether it processes a single instruction at a time or
multiple instructions simultaneously, and whether it
operates on one or multiple data sets.
4 categories of Flynn’s classification of multiprocessor
systems by their instruction and data streams
Simple Diagrammatic Representation
 SISD machines executes a single instruction
on individual data values using a single
 Based on traditional Von Neumann
uniprocessor architecture, instructions are
executed sequentially or serially, one step
after the next.
 Until most recently, most computers are of
SISD type.
 An SIMD machine executes a single
instruction on multiple data values
simultaneously using many processors.
 Since there is only one instruction, each
processor does not have to fetch and
decode each instruction. Instead, a single
control unit does the fetch and decoding for
all processors.
 SIMD architectures include array processors.
 This category does not actually exist. This
category was included in the taxonomy for
the sake of completeness.
 MIMD machines are usually referred to as
multiprocessors or multicomputers.
 It may execute multiple instructions
simultaneously, contrary to SIMD machines.
 Each processor must include its own control
unit that will assign to the processors parts
of a task or a separate task.
 It has two subclasses: Shared memory and
distributed memory
Shared memory UMA (all processors have equal
access to memory. Can talk via memory.)
Distributed Memory
Processors only Have access to their
local memory “talk” to other processors
over a network
Shared memory nodes connected by a network
Hybrid Machines
•Add special purpose
processors to normal
•Not a new concept
but, regaining traction
Example: our Tesla
Nvidia node,
 Shared memory and distributed memory
machines becomes mainstream.
 Manycore architectures: GPUs used for
computing (GPGPU)
 High performance computing almost equals
to parallel computing
 Finally, the architecture of a MIMD system,
contrast to its topology, refers to its
connections to its system memory.
 A systems may also be classified by their
architectures. Two of these are:
 Uniform memory access (UMA)
 Nonuniform memory access (NUMA)
 The UMA is a type of symmetric
multiprocessor, or SMP, that has two or more
processors that perform symmetric functions.
UMA gives all CPUs equal (uniform) access to
all memory locations in shared memory.
They interact with shared memory by some
communications mechanism like a simple bus
or a complex multistage interconnection
Processor 2
Processor 1
Processor n
 NUMA architectures, unlike UMA architectures
do not allow uniform access to all shared
memory locations. This architecture still
allows all processors to access all shared
memory locations but in a nonuniform way,
each processor can access its local shared
memory more quickly than the other memory
modules not next to it.
Memory 1
Processor 1
Communications mechanism
Memory 2
Processor 2
Memory n
Processor n
An analogy of Flynn’s classification is the
check-in desk at an airport
 SISD: a single desk
 SIMD: many desks and a supervisor with a
megaphone giving instructions that every desk
 MIMD: many desks working at their own pace,
synchronized through a central database
All parallel programs contain:
• Parallel sections
• Serial sections
• Serial sections are when work is being duplicated or no
useful work is being done, (waiting for others)
• Serial sections limit the parallel effectiveness
• If you have a lot of serial computation then you will not
get good speedup
• No serial work “allows” perfect speedup
• Amdahl’s Law states this formally
 Amdahl’s Law places a strict limit on the speedup that can be
realized by using multiple processors.
 Effect of multiple processors on run time
• Effect of multiple processors on speed up
• Where
• fS = serial fraction of code
• fp = parallel fraction of code
• N = number of processors
• Perfect speedup t=t1/n or
 Amdahl’s Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications.
 In reality, communications will result in a
further degradation of performance
Writing effective parallel application is difficult
• Communication can limit parallel efficiency
• Serial time can dominate
• Load balance is important
Is it worth your time to rewrite your application
• Do the CPU requirements justify
• Will the code be used just once?
S(n) > n,
may be seen on occasion, but usually this is due to using a
suboptimal sequential algorithm or some unique feature of the
architecture that favors the parallel formation.
One common reason for superlinear speedup is the extra cache
in the multiprocessor system which can hold more of the
problem data at any instant, it leads to less, relatively slow
memory traffic.
Efficiency = Execution time using one
processor over the
Execution time using a number of processors
Its just the speedup divided by the number of
Used to indicate a hardware design that allows the system to be
increased in size and in doing so to obtain increased
performance - could be described as architecture or hardware
Scalability is also used to indicate that a parallel algorithm can
accommodate increased data items with a low and bounded
increase in computational steps - could be described as
algorithmic scalability.
Problem size: the number of basic steps in the
best sequential algorithm for a given problem and
data set size
•Intuitively, we would think of the number of data elements
being processed in the algorithm as a measure of size.
•However, doubling the date set size would not necessarily
double the number of computational steps. It will depend upon
the problem.
•For example, adding two matrices has this effect, but
multiplying matrices quadruples operations.
Note: bad sequential algorithms tend to scale well.
• Latency
• How long to get between nodes in the
• Bandwidth
• How much data can be moved per unit time.
• Bandwidth is limited by the number of wires
and the rate at which each wire can accept data and
choke points.
 For ultimate performance you may be
concerned how your nodes are connected.
 Avoid communications between distant node.
 For some machines it might be difficult to
control or know the placement of applications.
 A system may also be classified by its topology.
 A topology is the pattern of connections between
 The cost-performance tradeoff determines which
topologies to use for a multiprocessor system.
A topology is characterized by its diameter,
total bandwidth, and bisection bandwidth
◦ Diameter – the maximum distance between two
processors in the computer system.
◦ Total bandwidth – the capacity of a
communications link multiplied by the number of
such links in the system.
◦ Bisection bandwidth – represents the maximum
data transfer that could occur at the bottleneck in
the topology.
 Shared Bus
◦ Processors communicate
with each other via a single
bus that can only handle
one data transmissions at
a time.
◦ In most shared buses,
processors directly
communicate with their
own local memory.
Shared Bus
Network Topologies
 Ring Topology
◦ Uses direct connections
between processors
instead of a shared bus.
◦ Allows communication
links to be active
simultaneously but data
may have to travel
through several
processors to reach its
 Tree Topology
◦ Uses direct
connections between
processors; each
having three
◦ There is only one
unique path between
any pair of
 Mesh Topology
◦ In the mesh topology,
every processor
connects to the
processors above and
below it, and to its
right and left.
 Hypercube
◦ Is a multiple mesh
◦ Each processor
connects to all other
processors whose
binary values differ
by one bit. For
example, processor
0(0000) connects to
1(0001) or 2(0010).
 Completely Connected
 Every processor has
n-1 connections, one to
each of the other
 There is an increase in
complexity as the system
grows but this offers
maximum communication
Moore's Law describes a long-
term trend in the history of
computing hardware, in which
the number of transistors that
can be placed inexpensively on
an integrated circuit has doubled
approximately every two years.
 Mechanical Computing
◦ Babbage, Hollerith, Aiken
 Electronic Digital Calculating
◦ Atanasoff, Eckert, Mauchly
 von Neumann Architecture
◦ Turing, von Neumann, Eckert, Mauchly, Foster, Wilkes
 Semiconductor Technologies
 Birth of the Supercomputer
◦ Cray, Watanabe
 The Golden Age
◦ Batcher, Dennis, S. Chen, Hillis, Dally, Blank, B. Smith
 Common Era of Killer Micros
◦ Scott, Culler, Sterling/Becker, Goodhue, A. Chen, Tomkins
 Petaflops
◦ Messina, Sterling, Stevens, P. Smith,
Historical Machines
• Leibniz Stepped Reckoner
• Babbage Difference Engine
• Hollerith Tabulator
• Harvard Mark 1
• Un. of Pennsylvania Eniac
• Cambridge Edsac
• MIT Whirlwind
• Cray 1
• TMC CM-2
• Intel Touchstone Delta
• Beowulf
• IBM Blue Gene/L
 Eckert and Mauchly,
 Vacuum tubes.
 Numerical solutions
to problems in fields
such as atomic
energy and ballistic
 Maurice Wilkes, 1949.
 Mercury delay lines for
memory and vacuum
tubes for logic.
 Used one of the first
assemblers called
Initial Orders.
 Calculation of prime
numbers, solutions of
algebraic equations,
 Jay Forrester, 1949.
 Fastest computer.
 First computer to use
magnetic core
 Displayed real time
text and graphics on
a large oscilloscope
 Cray Research,
 Pipelined vector
arithmetic units.
 Unique C-shape to
help increase the
signal speeds from
one end to the other.
 Thinking Machines
Corporation, 1987.
 Hypercube
architecture with
65,536 processors.
 Performance in the
range of GFLOPS.
 INTEL, 1990.
 MIMD hypercube.
 LINPACK rating of
13.9 GFLOPS .
 Enough computing
power for
applications like real-
time processing of
satellite images and
molecular models for
AIDS research.
 Thomas Sterling and
Donald Becker, 1994.
 Cluster formed of one
head node and
one/more compute
 Nodes and network
dedicated to the
 Compute nodes are
mass produced
 Use open source
software including
 Japan, 1997.
 Fastest
supercomputer from
2002-2004: 35.86
 640 nodes with eight
vector processors
and 16 gigabytes of
computer memory at
each node.
 IBM, 2004.
 First supercomputer
ever to run over 100
TFLOPS sustained
on a real world
application, namely a
molecular dynamics
code (ddcMD).
 1975 – 1992
 Vector
◦ Cray-1&2, NEC SX,
Fujitsu VPP
◦ Maspar, CM-2
 Systolic
◦ Warp
 Dataflow
◦ Manchester, Sigma,
 Multithreaded
 Actor-based
◦ J-Machine
Cray 1
 1992 to present
 Killer Micro and mass market
 High density DRAM
 High cost of fab lines
◦ Message passing
 Economy of scale S-curve
 Weak scaling
◦ Gustafson et al
 Beowulf, NOW Clusters
 Ethernet, Myrinet
 Linux
 Automated calculating
◦ 17th century
 Stored program digital electronic
◦ 1948
 Vector
◦ 1975
◦ 1980s
 MPPs
◦ 1991
 Commodity Clusters
◦ 1993/4
 Multicore
◦ 2006
 Hybrid cluster solutions & services that fully leverage the performance
of accelerators
Demand for computing power is growing steadily, as scientists &
engineers seek to tackle increasingly complex problems. The emergence
of multi-core CPUs has allowed to keep pace with their demands, but
energy consumption, space, & cooling have become major inhibitors to
computing systems expansion. Hence the success of acceleration
technologies such as GPGPUs (General-Purpose Graphics Processing
Units), which offer both breakthrough performance & outstanding space &
energy efficiency
GPGPUs can accelerate processing by a factor of 1 to 100!
 Starvation
◦ Not enough work to do due to insufficient parallelism or
poor load balancing among distributed resources
 Latency
◦ Waiting for access to memory or other parts of the system
 Overhead
◦ Extra work that has to be done to manage program
concurrency and parallel resources the real work you
want to perform
 Waiting for Contention
◦ Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint.
Lecture 2-3/10
Programming of Graphic Processors
Olesia Barkovskaya , KHTURE, 2018
Electronic Computers Department
Simply saying, in architecture sense, CPU is composed of few huge
Arithmetic Logic Unit (ALU) cores for general purpose processing with lots
of cache memory and one huge control module that can handle a few
software threads at a time. CPU is optimized for serial operations since its
clock is very high. While GPU, on the other hand, has many small ALUs,
small control modules and small cache. GPU is optimized for parallel
A simple way to understand the difference between a GPU and a CPU is to compare how
they process tasks. A CPU consists of a few cores optimized for sequential serial processing
while a GPU has a massively parallel architecture consisting of thousands of smaller, more
efficient cores designed for handling multiple tasks simultaneously.
GPUs have thousands of cores to process parallel workloads efficiently
 CUDA: More mature, bigger ‘ecosystem’, NVIDIA
 OpenCL: Vendor-independent, open industry
 Interfaces to C/C++, Fortran, Python, .NET, . . .
 Important: Hardware abstraction and
‘expressiveness’ are identical
Fourier Transforms
 CUFFT: NVIDIA, part of the CUDA toolkit
 APPML (formerly ACML-GPU): AMD Accelerated
Parallel Processing Math Libraries
Dense linear algebra
 CUBLAS: NVIDIA’s basic linear algebra subprograms
 APPML (formerly ACML-GPU): AMD Accelerated
Parallel Processing Math Libraries
 CULA: Third-party LAPACK, matrix decompositions
and eigenvalue problems
 MAGMA and PLASMA: BLAS/LAPACK for multicore
and manycore (ICL, Tennessee)
Ten years ago, when GPUs were rst used to perform general-purpose computation, they were programmed using
low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and
DirectX [16]. Later, the programs for GPU were developed in assembly language for each card model, and they had
very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In
2007, NVIDIA introduced CUDA [25], a software architecture for managing the GPU as a parallel computing device
without requiring to map the data and the computation into a graphic API. CUDA is based in an extension of the C
language, and it is available for graphic cards GeForce 8 Series and superior, using the 32 and 64 bits versions of
the Linux and Windows (XP and successors) operating systems. Three software layers are used in CUDA to
communicate with the GPU (see Figure 1): a lowlevel hardware driver that performs the data communications
between the CPU and the GPU, a high-level API, and a set of libraries that includes CUBLAS for linear algebra
calculations and CUFFT for Fourier transforms calculation. For the CUDA programmer, the GPU is a computing
device which is able to execute a large number of threads in parallel. A specic procedure to be executed many times
over dierent data can be isolated in a GPU-function using many execution threads. The function is compiled using a
specic set of instructions and the resulting program named kernel is loaded in the GPU. The GPU has its own DRAM,
and the data are copied from the DRAM of the GPU to the RAM of the host (and viceversa) using optimized calls to
the CUDA API. The CUDA architecture is built around a scalable array of multiprocessors, each one of them having
eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create,
manage, and execute parallel threads, with reduced overhead. The threads are grouped in blocks (with up to 512
threads), which are executed in a single multiprocessor of the graphic card, and the blocks are grouped in grids.
Each time that a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is
numbered and distributed to an available multiprocessor.
When a multiprocessor receives one (or more) blocks to execute, it splits the threads in warps a set of 32
consecutive threads. Each warp executes a single instruction at a time, so the best eciency is achieved when the 32
threads in the warp executes the same instruction. Otherwise, the warp serializes the threads. Each time that a
block nishes its execution, a new block is assigned to the available multiprocessor. The threads are able to access
the data using three memory spaces: the shared memory of the block, which can be used by the threads in the
block; the local memory of the thread; and the global memory of the GPU. Minimizing the access to the slower
memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to
achieve eciency in GPU programming. On the other side, the shared memory is placed within the GPU chip, thus it
provides a faster way to store the data.

More Related Content

Similar to intro, definitions, basic laws+.pptx

Parallel processing
Parallel processingParallel processing
Parallel processing
Praveen Kumar
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
Real-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsReal-Time Scheduling Algorithms
Real-Time Scheduling Algorithms
Module 2.pdf
Module 2.pdfModule 2.pdf
Module 2.pdf
Aca module 1
Aca module 1Aca module 1
Aca module 1
Avinash_N Rao
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
Heman Pathak
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)
Fazli Amin
Nagasuri Bala Venkateswarlu
(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx
Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High Performance
AM Publications
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
Sudarshan Mondal
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lectures
Cloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsCloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basics
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
Sudarsun Santhiappan
The evolution of computer
The evolution of computerThe evolution of computer
The evolution of computer
Lolita De Leon
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
A B Shinde

Similar to intro, definitions, basic laws+.pptx (20)

Parallel processing
Parallel processingParallel processing
Parallel processing
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
Real-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsReal-Time Scheduling Algorithms
Real-Time Scheduling Algorithms
Module 2.pdf
Module 2.pdfModule 2.pdf
Module 2.pdf
Aca module 1
Aca module 1Aca module 1
Aca module 1
Lecture 2
Lecture 2Lecture 2
Lecture 2
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)
(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx
Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High Performance
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lectures
Cloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsCloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basics
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
The evolution of computer
The evolution of computerThe evolution of computer
The evolution of computer
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures

More from ssuser413a98

Визначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptxВизначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptx
Фильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.pptФильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.ppt
Шумоподавление в цифровых изображениях.ppt
Шумоподавление в цифровых  изображениях.pptШумоподавление в цифровых  изображениях.ppt
Шумоподавление в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.pptПодавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.ppt
Сегментация изображений в компьютерной графике.ppt
Сегментация  изображений в компьютерной графике.pptСегментация  изображений в компьютерной графике.ppt
Сегментация изображений в компьютерной графике.ppt

More from ssuser413a98 (6)

Визначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptxВизначення терміну динамічний об’єкт на зображеннях .pptx
Визначення терміну динамічний об’єкт на зображеннях .pptx
Фильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.pptФильтрация шумов на цифровом изображении.ppt
Фильтрация шумов на цифровом изображении.ppt
Шумоподавление в цифровых изображениях.ppt
Шумоподавление в цифровых  изображениях.pptШумоподавление в цифровых  изображениях.ppt
Шумоподавление в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.pptПодавление шума в цифровых изображениях.ppt
Подавление шума в цифровых изображениях.ppt
Сегментация изображений в компьютерной графике.ppt
Сегментация  изображений в компьютерной графике.pptСегментация  изображений в компьютерной графике.ppt
Сегментация изображений в компьютерной графике.ppt

Recently uploaded

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx

Recently uploaded (20)

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx

intro, definitions, basic laws+.pptx

  • 1. , WhenWe Need High Performance Computing? ► Case1 To perform time-consuming operations in less time/ before a tighter deadline. ►I am a bioinformatic engineer. ►I need to run computationally complex programs. ►I’d rather have the result in 5 minutes than in 5 days. ►Case 2 To do a high number of operations per seconds ►I am an engineer ►My Web server gets 1,000 hits per seconds ►I’d like my web server and my databases to handle 1,000 transactions per seconds so that customers do not experience bad delays Amazon does “process” several GBytes of data per seconds
  • 2. , What Does High Performance Computing Study ► It includes following subjects  Hardware - Computer Architecture - Network Connections  Software - Programming paradigms - Languages - Middleware
  • 3. 3 1 103 106 109 1012 1015 KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS One OPS 1951 Univac 1 1949 Edsac 1976 Cray 1 1982 Cray XMP 1988 Cray YMP 1964 CDC 6600 1996 T3E 1823 Babbage Difference Engine 1991 Intel Delta 1997 ASCI Red 2001 Earth Simulator 2003 Cray X1 1943 Harvard Mark 1 1959 IBM 7094 2006 BlueGene/L 2009 Cray XT5
  • 4. , INTERNATIONAL COMPETITION FOR HPC LEADERSHIP Nations and global regions including China, the United States, Japan, and Russia, are racing ahead and have created national programs that are investing large sums of money to develop exascale supercomputers.
  • 5. 5 Supercomputer: A computing system exhibiting high-end performance capabilities and resource capacities within practical constraints of technology, cost, power, and reliability. Thomas Sterling, 2007 Supercomputer: a large very fast mainframe used especially for scientific computations. Merriam-Webster Online Supercomputer: any of a class of extremely powerful computers. The term is commonly applied to the fastest high-performance systems available at any given time. Such computers are used primarily for scientific and engineering work requiring exceedingly high-speed computations. Encyclopedia Britannica Online
  • 6. , Research Applications of HPCs • Finance • Sports and Entertainment • Weather Forecasting • Space Research • Health-Care-Related Applications • to unravel the morphology of cancer cells, • to diagnose and treat cancers and improve the safety of cancer treatments • medical research, • biomedicine, • bioinformatics, • epidemiology, • Personalized medicine include ‘Big Data’ aspects
  • 7. , THE GLOBAL HPC MARKET Company Share of Global HPC Revenues
  • 8. THE GLOBAL HPC MARKET Countries Share of Global HPC Revenues
  • 9. 9  Introduction - today’s lecture  System Architectures (Single Instruction - Single Data, Single Instruction - Multiple Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed Memory, Cluster, Multiple Instruction - Single Data)  Performance Analysis of parallel calculations (the speedup, efficiency, time execution of algorithm…)  Parallel numerical methods (Principles of Parallel Algorithm Design, Analytical Modeling of Parallel Programs, Matrices Operations, Matrix-Vector Operations, Graph Algorithms…)  Software (Programming Using Compute Unified Device Architecture)
  • 10. In the first section we will discuss the importance of parallel computing to high performance computing. We will show the basic concepts of parallel computing. The advantages and disadvantages of parallel computing will be discussed. We will present an overview of current and future trends in HPC hardware. The second section will provide an introduction to parallel GPU implementations of numerical methods, such as Matrices Operations, Matrix-Vector Operations, Graph Algorithms. Also second section will present the application of parallel computing techniques using Graphic Processing Unit (GPU) in order to improve the computational efficiency of numerical methods The third session will briefly discuss such important HPC topics like computing using graphic processing units (GPUs) with CUDA running relatively simple examples on this hardware. As tradition dictates, we will show how to write "Hello World" in CUDA. Some computational libraries available for HPC with GPU will be highlighted. 10
  • 11. 11  What is traditional programming view?  Why Use Parallel Computing?  Motivation for parallelism (Moor’s law).  An Overview of Parallel Processing  Parallelism in Uniprocessor Systems  Organization of Multiprocessor  Flynn’s Classification  System Topologies  MIMD System Architectures  Concepts and terminology.
  • 12. Parallelism is a method to improve computer system performance by executing two or more instructions simultaneously.  The goals of parallel processing.  One goal is to reduce the “wall-clock” time or the amount of real time that you need to wait for a problem to be solved.  Another goal is to solve bigger problems that might not fit in the limited memory of a single CPU. 12
  • 13. Consider your favorite computational application • One processor can give me results in N hours • Why not use N processors -- and get the results in just one hour? 13
  • 14. Parallel computing: the use of multiple computers or processors working together on a common task.  Each processor works on its section of the problem.  Processors are allowed to exchange information with other processors. 15 Grid of a Problem to be Solved
  • 15. 16 Flynn’s Classification (Taxonomy)  Was proposed by researcher Michael J. Flynn in 1966.  It is the most commonly accepted taxonomy of computer organization.  In this classification, computers are classified by whether it processes a single instruction at a time or multiple instructions simultaneously, and whether it operates on one or multiple data sets.
  • 16. 17 4 categories of Flynn’s classification of multiprocessor systems by their instruction and data streams Simple Diagrammatic Representation
  • 17. 18  SISD machines executes a single instruction on individual data values using a single processor.  Based on traditional Von Neumann uniprocessor architecture, instructions are executed sequentially or serially, one step after the next.  Until most recently, most computers are of SISD type.
  • 18. 19
  • 19. 20  An SIMD machine executes a single instruction on multiple data values simultaneously using many processors.  Since there is only one instruction, each processor does not have to fetch and decode each instruction. Instead, a single control unit does the fetch and decoding for all processors.  SIMD architectures include array processors.
  • 20. 21
  • 21. 22  This category does not actually exist. This category was included in the taxonomy for the sake of completeness.
  • 22. 23  MIMD machines are usually referred to as multiprocessors or multicomputers.  It may execute multiple instructions simultaneously, contrary to SIMD machines.  Each processor must include its own control unit that will assign to the processors parts of a task or a separate task.  It has two subclasses: Shared memory and distributed memory
  • 23. 24 Shared memory UMA (all processors have equal access to memory. Can talk via memory.) Distributed Memory Processors only Have access to their local memory “talk” to other processors over a network Hybrid Shared memory nodes connected by a network
  • 24. Hybrid Machines •Add special purpose processors to normal processors •Not a new concept but, regaining traction Example: our Tesla Nvidia node, cuda 25
  • 25. 26  Shared memory and distributed memory machines becomes mainstream.  Manycore architectures: GPUs used for computing (GPGPU)  High performance computing almost equals to parallel computing
  • 26. 27  Finally, the architecture of a MIMD system, contrast to its topology, refers to its connections to its system memory.  A systems may also be classified by their architectures. Two of these are:  Uniform memory access (UMA)  Nonuniform memory access (NUMA)
  • 27. 28  The UMA is a type of symmetric multiprocessor, or SMP, that has two or more processors that perform symmetric functions. UMA gives all CPUs equal (uniform) access to all memory locations in shared memory. They interact with shared memory by some communications mechanism like a simple bus or a complex multistage interconnection network.
  • 29. 30  NUMA architectures, unlike UMA architectures do not allow uniform access to all shared memory locations. This architecture still allows all processors to access all shared memory locations but in a nonuniform way, each processor can access its local shared memory more quickly than the other memory modules not next to it.
  • 30. 31 Memory 1 Processor 1 Communications mechanism Memory 2 Processor 2 Memory n Processor n
  • 31. 32 An analogy of Flynn’s classification is the check-in desk at an airport  SISD: a single desk  SIMD: many desks and a supervisor with a megaphone giving instructions that every desk obeys  MIMD: many desks working at their own pace, synchronized through a central database
  • 32. 34
  • 33. All parallel programs contain: • Parallel sections • Serial sections • Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) • Serial sections limit the parallel effectiveness • If you have a lot of serial computation then you will not get good speedup • No serial work “allows” perfect speedup • Amdahl’s Law states this formally 35
  • 34.  Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors.  Effect of multiple processors on run time • Effect of multiple processors on speed up • Where • fS = serial fraction of code • fp = parallel fraction of code • N = number of processors • Perfect speedup t=t1/n or S(n)=n 36
  • 35. 37
  • 36.  Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications.  In reality, communications will result in a further degradation of performance 38
  • 37. Writing effective parallel application is difficult • Communication can limit parallel efficiency • Serial time can dominate • Load balance is important Is it worth your time to rewrite your application • Do the CPU requirements justify parallelization? • Will the code be used just once? 39
  • 38. S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation. One common reason for superlinear speedup is the extra cache in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow memory traffic. 42
  • 39. Efficiency = Execution time using one processor over the Execution time using a number of processors Its just the speedup divided by the number of processors 43
  • 40. Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalability. 44
  • 41. Problem size: the number of basic steps in the best sequential algorithm for a given problem and data set size •Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. •However, doubling the date set size would not necessarily double the number of computational steps. It will depend upon the problem. •For example, adding two matrices has this effect, but multiplying matrices quadruples operations. Note: bad sequential algorithms tend to scale well. 45
  • 42. • Latency • How long to get between nodes in the network. • Bandwidth • How much data can be moved per unit time. • Bandwidth is limited by the number of wires and the rate at which each wire can accept data and choke points. 46
  • 43.  For ultimate performance you may be concerned how your nodes are connected.  Avoid communications between distant node.  For some machines it might be difficult to control or know the placement of applications. 47
  • 44. 48 Topologies  A system may also be classified by its topology.  A topology is the pattern of connections between processors.  The cost-performance tradeoff determines which topologies to use for a multiprocessor system.
  • 45. 49 A topology is characterized by its diameter, total bandwidth, and bisection bandwidth ◦ Diameter – the maximum distance between two processors in the computer system. ◦ Total bandwidth – the capacity of a communications link multiplied by the number of such links in the system. ◦ Bisection bandwidth – represents the maximum data transfer that could occur at the bottleneck in the topology.
  • 46. 50
  • 47. 51  Shared Bus Topology ◦ Processors communicate with each other via a single bus that can only handle one data transmissions at a time. ◦ In most shared buses, processors directly communicate with their own local memory. M P M P M P Global memory Shared Bus Network Topologies
  • 48. 52  Ring Topology ◦ Uses direct connections between processors instead of a shared bus. ◦ Allows communication links to be active simultaneously but data may have to travel through several processors to reach its destination. P P P P P P
  • 49. 53  Tree Topology ◦ Uses direct connections between processors; each having three connections. ◦ There is only one unique path between any pair of processors. P P P P P P P
  • 50. 54  Mesh Topology ◦ In the mesh topology, every processor connects to the processors above and below it, and to its right and left. P P P P P P P P P
  • 51. 55  Hypercube Topology ◦ Is a multiple mesh topology. ◦ Each processor connects to all other processors whose binary values differ by one bit. For example, processor 0(0000) connects to 1(0001) or 2(0010). P P P P P P P P P P P P P P P P
  • 52. 56  Completely Connected Topology  Every processor has n-1 connections, one to each of the other processors.  There is an increase in complexity as the system grows but this offers maximum communication capabilities. P P P P P P P P
  • 53. 57 Moore's Law describes a long- term trend in the history of computing hardware, in which the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years.
  • 54.  Mechanical Computing ◦ Babbage, Hollerith, Aiken  Electronic Digital Calculating ◦ Atanasoff, Eckert, Mauchly  von Neumann Architecture ◦ Turing, von Neumann, Eckert, Mauchly, Foster, Wilkes  Semiconductor Technologies  Birth of the Supercomputer ◦ Cray, Watanabe  The Golden Age ◦ Batcher, Dennis, S. Chen, Hillis, Dally, Blank, B. Smith  Common Era of Killer Micros ◦ Scott, Culler, Sterling/Becker, Goodhue, A. Chen, Tomkins  Petaflops ◦ Messina, Sterling, Stevens, P. Smith, 58
  • 55. 59 Historical Machines • Leibniz Stepped Reckoner • Babbage Difference Engine • Hollerith Tabulator • Harvard Mark 1 • Un. of Pennsylvania Eniac • Cambridge Edsac • MIT Whirlwind • Cray 1 • TMC CM-2 • Intel Touchstone Delta • Beowulf • IBM Blue Gene/L
  • 56.  Eckert and Mauchly, 1946.  Vacuum tubes.  Numerical solutions to problems in fields such as atomic energy and ballistic trajectories. 60
  • 57.  Maurice Wilkes, 1949.  Mercury delay lines for memory and vacuum tubes for logic.  Used one of the first assemblers called Initial Orders.  Calculation of prime numbers, solutions of algebraic equations, etc. 61
  • 58.  Jay Forrester, 1949.  Fastest computer.  First computer to use magnetic core memory.  Displayed real time text and graphics on a large oscilloscope screen. 62
  • 59.  Cray Research, 1976.  Pipelined vector arithmetic units.  Unique C-shape to help increase the signal speeds from one end to the other. 63
  • 60.  Thinking Machines Corporation, 1987.  Hypercube architecture with 65,536 processors.  SIMD.  Performance in the range of GFLOPS. 64
  • 61.  INTEL, 1990.  MIMD hypercube.  LINPACK rating of 13.9 GFLOPS .  Enough computing power for applications like real- time processing of satellite images and molecular models for AIDS research. 65
  • 62.  Thomas Sterling and Donald Becker, 1994.  Cluster formed of one head node and one/more compute nodes.  Nodes and network dedicated to the Beowulf.  Compute nodes are mass produced commodities.  Use open source software including Linux. 66
  • 63.  Japan, 1997.  Fastest supercomputer from 2002-2004: 35.86 TFLOPS.  640 nodes with eight vector processors and 16 gigabytes of computer memory at each node. 67
  • 64.  IBM, 2004.  First supercomputer ever to run over 100 TFLOPS sustained on a real world application, namely a three-dimensional molecular dynamics code (ddcMD). 68
  • 65.  1975 – 1992  Vector ◦ Cray-1&2, NEC SX, Fujitsu VPP  SIMD ◦ Maspar, CM-2  Systolic ◦ Warp  Dataflow ◦ Manchester, Sigma, Monsoon  Multithreaded ◦ HEP, MTA  Actor-based ◦ J-Machine 69 1976 Cray 1
  • 66.  1992 to present  Killer Micro and mass market PCs  High density DRAM  High cost of fab lines  CSP ◦ Message passing  Economy of scale S-curve  MPP  Weak scaling ◦ Gustafson et al  Beowulf, NOW Clusters  MPI  Ethernet, Myrinet  Linux 70
  • 67.  Automated calculating ◦ 17th century  Stored program digital electronic ◦ 1948  Vector ◦ 1975  SIMD ◦ 1980s  MPPs ◦ 1991  Commodity Clusters ◦ 1993/4  Multicore ◦ 2006 71
  • 68.  Hybrid cluster solutions & services that fully leverage the performance of accelerators Demand for computing power is growing steadily, as scientists & engineers seek to tackle increasingly complex problems. The emergence of multi-core CPUs has allowed to keep pace with their demands, but energy consumption, space, & cooling have become major inhibitors to computing systems expansion. Hence the success of acceleration technologies such as GPGPUs (General-Purpose Graphics Processing Units), which offer both breakthrough performance & outstanding space & energy efficiency GPGPUs can accelerate processing by a factor of 1 to 100! 72
  • 69.  Starvation ◦ Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources  Latency ◦ Waiting for access to memory or other parts of the system  Overhead ◦ Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform  Waiting for Contention ◦ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 73
  • 70. Lecture 2-3/10 74 Programming of Graphic Processors Olesia Barkovskaya , KHTURE, 2018 Electronic Computers Department
  • 71. Simply saying, in architecture sense, CPU is composed of few huge Arithmetic Logic Unit (ALU) cores for general purpose processing with lots of cache memory and one huge control module that can handle a few software threads at a time. CPU is optimized for serial operations since its clock is very high. While GPU, on the other hand, has many small ALUs, small control modules and small cache. GPU is optimized for parallel operations. 75
  • 72. A simple way to understand the difference between a GPU and a CPU is to compare how they process tasks. A CPU consists of a few cores optimized for sequential serial processing while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously. GPUs have thousands of cores to process parallel workloads efficiently 76
  • 73. Environments  CUDA: More mature, bigger ‘ecosystem’, NVIDIA only  OpenCL: Vendor-independent, open industry standard  Interfaces to C/C++, Fortran, Python, .NET, . . .  Important: Hardware abstraction and ‘expressiveness’ are identical 77
  • 74. Fourier Transforms  CUFFT: NVIDIA, part of the CUDA toolkit  APPML (formerly ACML-GPU): AMD Accelerated Parallel Processing Math Libraries Dense linear algebra  CUBLAS: NVIDIA’s basic linear algebra subprograms  APPML (formerly ACML-GPU): AMD Accelerated Parallel Processing Math Libraries  CULA: Third-party LAPACK, matrix decompositions and eigenvalue problems  MAGMA and PLASMA: BLAS/LAPACK for multicore and manycore (ICL, Tennessee) 78
  • 75. Ten years ago, when GPUs were rst used to perform general-purpose computation, they were programmed using low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and DirectX [16]. Later, the programs for GPU were developed in assembly language for each card model, and they had very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In 2007, NVIDIA introduced CUDA [25], a software architecture for managing the GPU as a parallel computing device without requiring to map the data and the computation into a graphic API. CUDA is based in an extension of the C language, and it is available for graphic cards GeForce 8 Series and superior, using the 32 and 64 bits versions of the Linux and Windows (XP and successors) operating systems. Three software layers are used in CUDA to communicate with the GPU (see Figure 1): a lowlevel hardware driver that performs the data communications between the CPU and the GPU, a high-level API, and a set of libraries that includes CUBLAS for linear algebra calculations and CUFFT for Fourier transforms calculation. For the CUDA programmer, the GPU is a computing device which is able to execute a large number of threads in parallel. A specic procedure to be executed many times over dierent data can be isolated in a GPU-function using many execution threads. The function is compiled using a specic set of instructions and the resulting program named kernel is loaded in the GPU. The GPU has its own DRAM, and the data are copied from the DRAM of the GPU to the RAM of the host (and viceversa) using optimized calls to the CUDA API. The CUDA architecture is built around a scalable array of multiprocessors, each one of them having eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create, manage, and execute parallel threads, with reduced overhead. The threads are grouped in blocks (with up to 512 threads), which are executed in a single multiprocessor of the graphic card, and the blocks are grouped in grids. Each time that a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is numbered and distributed to an available multiprocessor. When a multiprocessor receives one (or more) blocks to execute, it splits the threads in warps a set of 32 consecutive threads. Each warp executes a single instruction at a time, so the best eciency is achieved when the 32 threads in the warp executes the same instruction. Otherwise, the warp serializes the threads. Each time that a block nishes its execution, a new block is assigned to the available multiprocessor. The threads are able to access the data using three memory spaces: the shared memory of the block, which can be used by the threads in the block; the local memory of the thread; and the global memory of the GPU. Minimizing the access to the slower memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to achieve eciency in GPU programming. On the other side, the shared memory is placed within the GPU chip, thus it provides a faster way to store the data. 79