SlideShare a Scribd company logo
UBa/NAHPI-2020
DepartmentofComputer
Engineering
PARALLEL AND DISTRIBUTED
COMPUTING
By
Malobe LOTTIN Cyrille .M
Network and Telecoms Engineer
PhD Student- ICT–U USA/CAMEROON
Contact
Email:malobecyrille.marcel@ictuniversity.org
Phone:243004411/695654002
CHAPTER 2
Parallel and Distributed Computer
Architectures, Performance Metrics
And Parallel Programming Models
Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
CONTENTS
• INTRODUCTION
• Why parallel Architecture ?
• Modern Classification of Parallel Computers
• Structural Classification of Parallel Computers
• Parallel Computers Memory Architectures
• Hardware Classification
• Performance of Parallel Computers architectures
- Peak and Sustained Performance
• Measuring Performance of Parallel Computers
• Other Common Benchmarks
• Parallel Programming Models
- Shared Memory Programming Model
- Thread Model
- Distributed Memory
- Data Parallel
- SPMD/MPMD
• Conclusion
Exercises ( Check your Progress, Further Reading and Evaluation)
Previously on Chap 1
 Part 1- Introducing Parallel and Distributed Computing
• Background Review of Parallel and Distributed Computing
• INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING
• Some keys terminologies
• Why parallel Computing?
• Parallel Computing: the Facts
• Basic Design Computer Architecture: the von Neumann Architecture
• Classification of Parallel Computers (SISD,SIMD,MISD,MIMD)
• Assignment 1a
 Part 2- Initiation to Parallel Programming Principles
• High Performance Computing (HPC)
• Speed: a need to solve Complexity
• Some Case Studies Showing the need of Parallel Computing
• Challenge of explicit Parallelism
• General Structure of Parallel Programs
• Introduction to the Amdahl's LAW
• The GUSTAFSON’s LAW
• SCALIBILITY
• Fixed Size Versus Scale Size
• Assignment 1b
• Conclusion
INTRODUCTION
• Parallel Computer Architecture is the method that consist of
Maximizing and organizing computer resources to achieve Maximum
performance.
- Performance at any instance of time, is achievable within the limit given
by the technology.
- The same system may be characterized both as "parallel" and
"distributed"; the processors in a typical distributed system run
concurrently in parallel.
• The use of more processors to compute tasks simultaneously
contribute in providing more features to computers systems.
• In the Parallel architecture, Processors during computation may have
access to a shared memory to exchange information between them.
•
imagesSource:Wikipedia,DistributingComputing,2020
• In a Distributed architecture, each processor during computation,
make use of its own private memory (distributed memory). In this
case, Information is exchanged by passing messages between the
processors.
• Significant characteristics of distributed systems are: concurrency of
components, lack of a global clock (Clock synchronization) , and
independent failure of components.
• The use of distributed systems to solve computational problems is
Called Distributed Computing (Divide problem into many tasks, each task is handle by one or
more computers, which communicate with each other via message passing).
• High-performance parallel computation operating shared-memory
multiprocessor uses parallel algorithms while the coordination of
a large-scale distributed system uses distributed algorithms.
INTRODUCTION
imagesSource:Wikipedia,DistributingComputing,2020
• Parallelism is nowadays in all levels of computer architectures.
• It is the Enhancements of Processors that justify the success in the
development of Parallelism.
• Today, they are superscalar (Execute several instructions in parallel each clock cycle).
- besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology,
which allows larger and larger numbers of components to fit on a chip and clock rates to increase.
• Three main elements define structure and performance of Multiprocessor:
- Processors
- Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes)
- Interconnection Network
• But, the gap of performance between the processor and the memory is still
increasing ….
• Parallelism is used by computer architecture to translate the raw potential of
the technology into greater performance and expanded capability of the
computer system
• Diversity in parallel computer architecture makes the field challenging to learn
and challenging to present.
INTRODUCTION ( Cont…)
Remember that:
A parallel computer is a collection of processing elements that
cooperate and communicate to solve large problems fast.
• The attempt to solve this large problems raises some fundamental
questions which the answer can only by satisfy by understanding:
- Various components of Parallel and Distributed systems( Design
and operation),
- How much problems a given Parallel and Distributed system can
solve,
- How processors corporate, communicate / transmit data between
them,
- The primitive abstractions that the hardware and software provide
to the programmer for better control,
- And, How to ensure a proper translation to performance once these
elements are under control.
INTRODUCTION (Cont…)
Why Parallel Architecture ?
• No matter the performance of a single processor at a given time, we can
achieve in principle higher performance by utilizing many such processors
so far we are ready to pay the price (Cost).
Parallel Architecture is needed To:
 Respond to Applications Trends
• Advances in hardware capability enable new application functionality 
drives parallel architecture harder, since parallel architecture focuses on the
most demanding of these applications.
• At the Low end level, we have the largest volume of machines and greatest
number of users; at the High end, most demanding applications.
• Consequence: pressure for increased performance  most demanding
applications must be written as parallel programs to respond to this
demand generated from the High end
 Satisfy the need of High Computing in the field of computational science
and engineering
- A response to simulate physical phenomena impossible or very
costly to observe through empirical means (modeling global climate change
over long periods, the evolution of galaxies, the atomic structure of materials,
etc…)
 Respond to Technology Trends
• Can’t “wait for the single processor to get fast enough ”
Respond to Architectural Trends
• Advances in technology determine what is possible; architecture
translates the potential of the technology into performance and
capability .
• Four generation of Computer architectures (tubes, transistors,
integrated circuits, and VLSI ) where strong distinction is function of
the type of parallelism implemented ( Bit level parallelism  4-bits
to 64 bits, 128 bits is the future).
• There has been tremendous architectural advances over this period
: Bit level parallelism, Instruction level Parallelism, Thread Level
Parallelism
All these forces driving the development of parallel architectures are
resumed under one main quest: Achieve absolute maximum
performance ( Supercomputing)
Why Parallel Architecture ? (Cont …)
Modernclassification
Accordingto(Sima,Fountain,Kacsuk)
Before modern classification,
Recall Flynn’s taxonomy classification of Computers
- based on the number of instructions that can be executed and how they operate on data.
Four Main Type:
• SISD: traditional sequential architecture
• SIMD: processor arrays, vector processor
• Parallel computing on a budget – reduced control unit cost
• Many early supercomputers
• MIMD: most general purpose parallel computer today
• Clusters, MPP, data centers
• MISD: not a general purpose architecture
Note: Globally four type of parallelism are implemented:
- Bit Level Parallelism: performance of processors based on word size ( bits)
- Instruction Level Parallelism: give ability to processors to execute more than instruction
per clock cycle
- Task Parallelism: characterize Parallel programs
- Superword Level Parallelism: Based on vectorization Techniques
Computer Architectures
SISD SIMD MIMD MISD
• Classification here is based on how parallelism is achieved
• by operating on multiple data: Data parallelism
• by performing many functions in parallel: Task parallelism (function)
• Control parallelism, task parallelism depending on the level of the functional
parallelism.
ModernClassification
Accordingto(Sima,Fountain,Kacsuk)
Parallel architectures
Data-parallel
architectures
Function-parallel
architectures
- Different operations are
performed on the same or
different data
- Asynchronous computation
- Speedup is less as each
processor will execute a different
thread or process on the same or
different set of data.
- Amount of parallelization is
proportional to the number of
independent tasks to be
performed
- Load balancing depends on the
availability of the hardware and
scheduling algorithms like static
and dynamic scheduling.
- Applicability : pipelining
- Same operations are
performed on different
subsets of same data
- Synchronous computation
- Speedup is more as there is
only one execution thread
operating on all sets of data.
- Amount of parallelization is
proportional to the input data
size
- Designed for optimum load
balance on multi processor
system
Applicability: Arrays, Matrix
• Flynn’s classification Focus on the behavioral aspect of computers .
• Looking at the structure, Parallel computers can be classified based on a focus on
how processors communicate with the memory.
 When multiprocessors communicate through the global shared memory modules
then this organization is called Shared memory computer or Tightly
 when every processor in a multiprocessor system, has its own local memory and
the processors communicate via messages transmitted between their local memories,
then this organization is called Distributed memory computer or Loosely coupled system
StructuralClassificationof ParallelComputers
Parallel Computer Memory Architectures
Shared Memory Parallel Computers architecture
- Processors can access all memory as global
address space
- Multi-processors can operate independently but
share the same memory resources
- Changes in a memory location effected by one
processor are visible to all other processors
Based on memory access time, we can
classify Shared memory Parallel Computers into
two:
 Uniform Memory Access (UMA)
 Non-Uniform Memory Access (NUMA)
ParallelComputerMemoryArchitectures(Cont…)
 Uniform Memory Access (UMA) (known as Cache Coherent -
UMA)
• Commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
Note: Cache coherent is a hardware operation where any update of a
location in shared memory by one processor , is announce to all the
other processors .
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
Non-Uniform Memory Access (NUMA)
• The architecture often link two or more SMPs
In such that :
- One SMP can directly access memory of another SMP
- Not all processors have equal access time to all memories
- Memory access across link is slower
Note: if Cache coherent is implemented, then we can also call it
Cache Coherent NUMA
• The proximity of memory to CPUs on Shared Memory parallel computer
makes Data sharing between tasks fast and uniform.
• But, there is a lack of scalability between memory and CPUs.
ParallelComputerMemoryArchitectures(Cont…)
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
BruceJacob,...DavidT.Wang,inMemorySystems,2008
ParallelComputerMemoryArchitectures(Cont…)
 Distributed Memory Parallel Computer Architecture
• Different varieties as Shared Memory Computer.
• Require a communication network to connect inter-processor memory.
- Each processor operates independently with its own local memory
- individual processors changes does not affect the memory of other
processors.
- Cache Coherency does not apply here !
• Access to data in another processor is usually the task of the
programmer(explicitly define how and when data is communicated)
• This architecture is cost effective (can use commodity, off-the-shelf
processors and networking).
• But, the responsibility of the programmer is more engage for data
communication between processors
Source:Retrievedfrom
https://www.futurelearn.com/courses/supercomputing/0/steps/24022
ParallelComputerMemoryArchitectures(Cont…)
Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016
ParallelComputerMemoryArchitectures(Cont…)
Overview of Parallel Memory Architecture
Note: - The largest and fastest computers in the world today employ both
shared and distributed memory architectures (Hybrid Memory)
- In hybrid design, Shared memory component here can be a shared
memory machine and/or graphics processing units (GPU)
- And, Distributed memory component is the networking of multiple
shared memory/GPU machines
- This type of memory architecture will continue to prevail and increase
• Parallel computers can be roughly classified according to the level
at which the hardware in the parallel architecture supports
parallelism.
 Multicore Computing
Symmetric multiprocessing ( tightly coupled multiprocessing)
Hardwareclassification
- Made of computer system with multiple
identical processors that share memory
and connect via a bus
- Do not comprise more than 32 processors
to minimize bus contention
- Symmetric multiprocessors are extremely
cost-effective
retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit-
level_parallelism,2020
- Processor includes multiple processing units (called "cores") on the
same chip.
- issue multiple instructions per clock cycle from multiple instruction
streams
- Differs from a superscalar processor. But, Each core in a multi-core
processor can potentially be superscalar as well.
Superscalar: issue multiple instructions per clock cycle from one instruction stream
(thread).
- Example: IBM's Cell microprocessor in Sony PlayStation 3
 Distributed Computing (distributed memory multiprocessor)
Cluster Computing
Hardwareclassification(Cont…)
• Not to be confused with Decentralized computing
- Allocation of resources (Hardware + software) to individual
workstations
• components are located on different networked computers,
which communicate and coordinate their actions by passing
messages to one another
• Interaction of components is done to achieve a common goal
• Characterize by concurrency of components, lack of a global
clock, and independent failure of components.
• can include heterogeneous computations where some nodes
may perform a lot more computation, some perform very
little computation and a few others may perform specialized
functionality
• Example: Multiplayer Online game
• loosely coupled computers that work together closely
• in some respects they can be regarded as a single computer
• multiple standalone machines constitute a cluster and
connected by a network.
• computer clusters have each node set to perform the same
task, controlled and scheduled by software.
• Computer clustering relies on a centralized management
approach which makes the nodes available as orchestrated
shared servers.
• Example: IBM's Sequoia
Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012
CiscoSystems,2003
PERFORMANCE METRICS
Performance of parallel architectures
 Various ways to measure the performance of a parallel algorithm running
on a parallel processor.
 Most commonly used measurements:
- speed-up
- Efficiency/ Isoefficiency
- Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job.
- price/performance
Note: none of these metrics should be used independent of the run time of the parallel system
 Common metrics of Performance
• FLOPS and MIPS are units of measure for the numerical computing performance of a
computer
• Distributed computing uses the Internet to link personal computers to achieve more
FLOPS
- MIPS: million instructions per second
MIPS = instruction count/(execution time x 106)
- MFLOPS: million floating point operations per second.
FLOPS = FP ops in program/(execution time x 106)
• Which of the metric is better?
• FLOP is more related to the time of a task in numerical code.
# of FLOP / program is determined by the matrix size
See Chapter 1
“In June 2020, Fugaku turned in a High Performance Linpack (HPL) result
of 415.5 petaFLOPS, besting the now second-place Summit system by a
factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC,
becoming the first number one system on the list to be powered by ARM
processors. In single or further reduced precision, used in machine learning
and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1
exaflops). The new system is installed at RIKEN Center for Computational
Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020).
Performance of parallel architectures
Here we are !
Single CPU Performance
The future
Peak and sustained performance
Peak performance
• Measured in MFLOPS
• Highest possible MFLOPS when the system does nothing but
numerical computation
• Rough hardware measure
• Little indication on how the system will perform in practice.
Peak Theoretical Performance
• Node performance in GFlops = (CPU speed in GHz) x (number of
CPU cores) x (CPU instruction per cycle) x (number of CPUs per
node)
Peak and sustained performance
• Sustained performance
• The MFLOPS rate that a program achieves over the entire run.
• Measuring sustained performance
• Using benchmarks
• Peak MFLOPS is usually much larger than sustained MFLOPS
• Efficiency rate = sustained MFLOPS / peak MFLOPS
Measuring the performance of
parallel computers
• Benchmarks: programs that are used to measure the
performance.
• LINPACK benchmark: a measure of a system’s floating point
computing power
• Solving a dense N by N system of linear equations Ax=b
• Use to rank supercomputers in the top500 list.
No. 1 since June 2020
Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first
number one system on the list to be powered by ARM processors.
Other common benchmarks
• Micro benchmarks suit
• Numerical computing
• LAPACK
• ScaLAPACK
• Memory bandwidth
• STREAM
• Kernel benchmarks
• NPB (NAS parallel benchmark)
• PARKBENCH
• SPEC
• Splash
PARALLEL PROGRAMMING MODELS
A programming perspective of Parallelism implementation in parallel
and distributed Computer architectures
Parallel Programming Models
Parallel programming models exist as an abstraction above hardware
and memory architectures.
 There are commonly several parallel programming models used
• Shared Memory (without threads)
• Threads
• Distributed Memory / Message Passing
• Data Parallel
• Hybrid
• Single Program Multiple Data (SPMD)
• Multiple Program Multiple Data (MPMD)
 These models are NOT specific to a particular type of machine or
memory architecture (a given model can be implemented on any
underlying hardware).
Example: - SHARED memory model on a DISTRIBUTED memory
machine ( Machine memory is physically distributed across networked
machines, but at the user level as a single shared memory global address
space --- Kendall Square Research (KSR) ALLCACHE---
Which Model to USE ??
There is no "best" model
However, there are certainly better implementations of some models over others
Parallel Programming Models
SharedMemoryProgramming Model
(WithoutThread)
• A thread is the basic unit to which the operating system allocates
processor time. They are smallest sequence of programmed
instructions
• In a Share Memory programming model,
- Processes/tasks share a common address space, which they
read and write to asynchronously.
- Make use of mechanisms such as locks / semaphores to control
access to the shared memory, resolve contentions and to prevent race
conditions and deadlocks.
• This may be consider as the simplest parallel programming model
• Note: Locks, Mutexe and semaphore are type of
synchronization objects in a share resources
environment. Abstract concepts.
-Locks protects access to some kind of shared resource, and give
right to access the protected share resource when owned.
Example, if you have a lockable object ABC you may:
- acquire the lock on ABC,
- take the lock on ABC,
- lock ABC,
- take ownership of ABC, or relinquish ownership of ABC if not needed
- Mutexe (Mutual EXclusion): lockable object that can be owned by
exactly one thread at a time
• Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex
-- Semaphore: A semaphore is a very relaxed type of lockable object,
with a predefined maximum count, and a current count.
Shared MemoryProgramming Model(Cont..)
Advantages Disadvantages
• No need to specify explicitly the
communication of data between tasks,
so no need to implement “ownership”.
Very advantageous for a Programmer
It becomes more difficult to understand
and manage data locality.
• All processes see and have equal access
to shared memory
There is Conservation of memory access,
cache refresh and bus traffic when keeping
data local to a given process
• Open for simplification during the
development of the program
controlling data locality is hard to
understand and may be beyond the control
of the average user.
Shared MemoryProgramming Model(Cont..)
During Implementation,
• Case: stand-alone shared memory machines
- native operating systems, compilers and/or hardware provide support for
shared memory programming. E.g. POSIX standard provides an API for using shared memory.
• Case: distributed memory machines:
- memory is physically distributed across a network of machines, but made
global through specialized hardware and software
• This is a type of shared memory programming.
• Here, a single "heavy weight" process can have multiple "light weight",
concurrent execution paths.
• To understand this model, let us consider the execution of a main
program a.out , scheduled to run by the native operating system.
Thread Model
 a.out start by loading and acquiring all of the necessary system and user resources
to run. This constitute the "heavy weight" process
 a.out performs some serial work, and then creates a number of tasks (threads) that
can be scheduled and run by the operating system concurrently
 Each thread has local data, but also, shares the entire resources of a.out “Light
weight” and benefit from a global memory view because it shares the memory
space of a.out
 Need for synchronization coordination to ensure that more than one thread is not
updating the same global address at any time.
• During Implementation, threads implementations commonly comprise:
 A library of subroutines that are called from within parallel source code
 A set of compiler directives imbedded in either serial or parallel source
code.
Note: Often , the programmer is responsible for determining the parallelism.
• Unrelated standardization efforts have resulted in two very different
implementations of threads:
- POSIX Threads
* Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and
Very explicit parallelism--requires significant programmer attention to detail.
- OpenMP ( Used for Tutorial in the context of this course).
* Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows
platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for
"incremental parallelism". Can begin with serial code.
Others include: - Microsoft threads
- Java, Python threads
- CUDA threads for GPUs
Thread Model (Cont…)
• In this Model,
A set of tasks uses their own local memory during computation
Multiple tasks can reside on the same physical machine and/or across an arbitrary
number of machines.
Exchange of data by tasks is done through communication( sending/ receiving
messages).
But, there must be a certain Process Cooperation during data transfer.
During Implementation,
• The programmer is responsible for determining all parallelism
• Message passing implementations usually comprise a library of subroutines that
are imbedded in source code.
• MPI is the "de facto" industry standard for message passing.
- Message Passing Interface (MPI), specification available at http://www.mpi-
forum.org/docs/.
DistributedMemory/MessagePassingModel
Can also be referred to as the Partitioned Global Address Space (PGAS) model.
Here,
 Address space is treated globally
 Most of the parallel work focuses on performing operations on a data set
typically organized into a common structure, such as an array or cube
 A set of tasks work collectively on the same data structure, however, each task
works on a different partition of the same data structure.
 Tasks perform the same operation on their partition of work, for example, "add 4
to every array element“
 Can be implemented on share memory (data structure is accessed through
global memory) and distributed memory architectures (Global Data structure
can be logically/Physical split across tasks).
Data Parallel Model
For the Implementation,
• Various popular, and sometimes developmental parallel
programming based on the Data Parallel / PGAS model.
• - Coarray Fortran, compiler dependent
* further reading (https://en.wikipedia.org/wiki/Coarray_Fortran)
• - Unified Parallel C (UPC), extension to the C programming
language for SPMD parallel programming.
* further reading http://upc.lbl.gov/
- Global Arrays , shared memory style programming environment in the context of
distributed array data structures.
* Further reading on https://en.wikipedia.org/wiki/Global_Arrays
Data Parallel Model ( Cont…)
Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
"high level" programming model (Can be build based on any parallel programming
model)
Why SINGLE PROGRAM ?
All tasks execute their copy of the same
program (threads, message passing, data
parallel or hybrid) simultaneously
Why MULTIPLE PROGRAM ?
Tasks may execute different programs
(threads, message passing, data parallel or
hybrid) simultaneously
Why MULTIPLE DATA ?
All tasks may use different data
Why MULTIPLE Data ?
All tasks may use different data
Intelligent Enough: tasks do not necessarily
have to execute the entire program.
Not intelligent enough has SPMD.
But, may be better suited for certain types
of problems (functional decomposition
problems)
Single ProgramMultipleData (SPMD)/
MultipleProgram MultipleData (MPMD)
Conclusion
• Parallel computer architectures contribute in achieving maximum performance within the limit
given by the technology.
• Diversity in parallel computer architecture makes the field challenging to learn and challenging to
present
• Classification can be based on the number of instructions that can be executed and how they
operate on data- Flynn (SISD,SIMD,MISD,MIMD)
• Also, classification can be based on how parallelism is achieved (Data parallel architectures,
Function-parallel architectures)
• Classification can as well be focus on how processors communicate with the memory (Shared
memory computer or Tightly , Distributed memory computer or Loosely coupled system)
• There must be a way to appreciate the performance of the parallelize architecture
• FLOPS and MIPS are units of measure for the numerical computing performance of a computer.
• Parallelism is made possible with implementation of adequate parallel programming models.
• The most simple model appears to be the Shared Memory Programming Model.
• The SPMD and MPMD programming required mastering of the previous programming model for
Proper implementation.
• How do we then design a Parallel Program for effective parallelism?
See Next Chapter: Designing Parallel Programs and understanding notion of
Concurrency and Decomposition.
Challenge your understanding
1- What difference do you make between Parallel computer and Parallel Computing ?
2- What do you understand by True data dependency and Resource dependency?
3- Illustrate the notion of Vertical Waste and Horizontal Waste.
4- According to you, which of the design architecture can provide better performance ?. Use
performance metrics to justify your arguments.
6- what is Concurrent-read, concurrent-write (CRCW) PRAM
5-
On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b)
Bus-based interconnects with local memory/caches.
Explain the difference focusing on :
- The design architecture
- The operation
- The Pros and Cons
6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
Class Work Group and Presentation
• Purpose: Demonstrate Condition to detect eventual
Parallelism.
“Parallel computing requires that the segments to be executed
in parallel must be independent of each other. So, before
executing parallelism, all the conditions of parallelism between
the segments must be analyzed”.
Use Bernstein Conditions for Detection of Parallelism to demonstrate when
instructions i1, i2,….,in can be said “ Parallelized”.
REFERENCES
1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems,
Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html
2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000
3. Blaise Barney, Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv
iew, Last Modified: 11/02/2020 16:39:01
4. J BlazeWich et al, Handbook on Parallel and distributed
Processing, International Handbook of Information Systems,
spinger, 2000
5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large
scale Distributed Systems, 2020
6. A. Grana, et al. Introduction to Parallel Computing, lecture 3
END.

More Related Content

What's hot

Directed Acyclic Graph
Directed Acyclic Graph Directed Acyclic Graph
Directed Acyclic Graph
AJAL A J
 
Lecture 01 introduction to database
Lecture 01 introduction to databaseLecture 01 introduction to database
Lecture 01 introduction to databaseemailharmeet
 
Thread scheduling in Operating Systems
Thread scheduling in Operating SystemsThread scheduling in Operating Systems
Thread scheduling in Operating SystemsNitish Gulati
 
Parallelism
ParallelismParallelism
Parallelism
Md Raseduzzaman
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
Rafi Ullah
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
Julie Iskander
 
Parallel programming model
Parallel programming modelParallel programming model
Parallel programming model
Illuru Phani Kumar
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
iqbalphy1
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
Distributed Computing ppt
Distributed Computing pptDistributed Computing ppt
Algorithms Lecture 1: Introduction to Algorithms
Algorithms Lecture 1: Introduction to AlgorithmsAlgorithms Lecture 1: Introduction to Algorithms
Algorithms Lecture 1: Introduction to Algorithms
Mohamed Loey
 
Bfs and Dfs
Bfs and DfsBfs and Dfs
Bfs and Dfs
Masud Parvaze
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processorMuhammad Ishaq
 
Recursion - Algorithms and Data Structures
Recursion - Algorithms and Data StructuresRecursion - Algorithms and Data Structures
Recursion - Algorithms and Data Structures
Priyanka Rana
 
Distributed data processing
Distributed data processingDistributed data processing
Distributed data processing
Ayisha Kowsar
 
Merge sort algorithm
Merge sort algorithmMerge sort algorithm
Merge sort algorithm
Shubham Dwivedi
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.
Meghaj Mallick
 
Algorithms Lecture 7: Graph Algorithms
Algorithms Lecture 7: Graph AlgorithmsAlgorithms Lecture 7: Graph Algorithms
Algorithms Lecture 7: Graph Algorithms
Mohamed Loey
 

What's hot (20)

Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Directed Acyclic Graph
Directed Acyclic Graph Directed Acyclic Graph
Directed Acyclic Graph
 
Lecture 01 introduction to database
Lecture 01 introduction to databaseLecture 01 introduction to database
Lecture 01 introduction to database
 
Thread scheduling in Operating Systems
Thread scheduling in Operating SystemsThread scheduling in Operating Systems
Thread scheduling in Operating Systems
 
Parallelism
ParallelismParallelism
Parallelism
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Parallel programming model
Parallel programming modelParallel programming model
Parallel programming model
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Distributed Computing ppt
Distributed Computing pptDistributed Computing ppt
Distributed Computing ppt
 
Algorithms Lecture 1: Introduction to Algorithms
Algorithms Lecture 1: Introduction to AlgorithmsAlgorithms Lecture 1: Introduction to Algorithms
Algorithms Lecture 1: Introduction to Algorithms
 
Bfs and Dfs
Bfs and DfsBfs and Dfs
Bfs and Dfs
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processor
 
Recursion - Algorithms and Data Structures
Recursion - Algorithms and Data StructuresRecursion - Algorithms and Data Structures
Recursion - Algorithms and Data Structures
 
Distributed data processing
Distributed data processingDistributed data processing
Distributed data processing
 
Merge sort algorithm
Merge sort algorithmMerge sort algorithm
Merge sort algorithm
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.
 
Algorithms Lecture 7: Graph Algorithms
Algorithms Lecture 7: Graph AlgorithmsAlgorithms Lecture 7: Graph Algorithms
Algorithms Lecture 7: Graph Algorithms
 

Similar to Chap 2 classification of parralel architecture and introduction to parllel program. models

Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
Malobe Lottin Cyrille Marcel
 
Parallel processing
Parallel processingParallel processing
Parallel processing
Praveen Kumar
 
CC unit 1.pptx
CC unit 1.pptxCC unit 1.pptx
CC unit 1.pptx
DivyaRadharapu1
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
CloudLightning
 
CCUnit1.pdf
CCUnit1.pdfCCUnit1.pdf
CCUnit1.pdf
AnayGupta26
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
Avinash_N Rao
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
krnaween
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
AbdullahMunir32
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
RubenGabrielHernande
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
Sudarshan Mondal
 
Cloud computing basic introduction and notes for exam
Cloud computing basic introduction and notes for examCloud computing basic introduction and notes for exam
Cloud computing basic introduction and notes for exam
UtkarshAnand512529
 
Computing notes
Computing notesComputing notes
Computing notesthenraju24
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
AbcvDef
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
Nikhil Sharma
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
Akhila Prabhakaran
 
introduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfintroduction to cloud computing for college.pdf
introduction to cloud computing for college.pdf
snehan789
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material cc
Ankit Gupta
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf
TyStrk
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdf
John422973
 

Similar to Chap 2 classification of parralel architecture and introduction to parllel program. models (20)

Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
CC unit 1.pptx
CC unit 1.pptxCC unit 1.pptx
CC unit 1.pptx
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
CCUnit1.pdf
CCUnit1.pdfCCUnit1.pdf
CCUnit1.pdf
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Cloud computing basic introduction and notes for exam
Cloud computing basic introduction and notes for examCloud computing basic introduction and notes for exam
Cloud computing basic introduction and notes for exam
 
Computing notes
Computing notesComputing notes
Computing notes
 
Par com
Par comPar com
Par com
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
introduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfintroduction to cloud computing for college.pdf
introduction to cloud computing for college.pdf
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material cc
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdf
 

Recently uploaded

Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Chap 2 classification of parralel architecture and introduction to parllel program. models

  • 1. UBa/NAHPI-2020 DepartmentofComputer Engineering PARALLEL AND DISTRIBUTED COMPUTING By Malobe LOTTIN Cyrille .M Network and Telecoms Engineer PhD Student- ICT–U USA/CAMEROON Contact Email:malobecyrille.marcel@ictuniversity.org Phone:243004411/695654002
  • 2. CHAPTER 2 Parallel and Distributed Computer Architectures, Performance Metrics And Parallel Programming Models Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
  • 3. CONTENTS • INTRODUCTION • Why parallel Architecture ? • Modern Classification of Parallel Computers • Structural Classification of Parallel Computers • Parallel Computers Memory Architectures • Hardware Classification • Performance of Parallel Computers architectures - Peak and Sustained Performance • Measuring Performance of Parallel Computers • Other Common Benchmarks • Parallel Programming Models - Shared Memory Programming Model - Thread Model - Distributed Memory - Data Parallel - SPMD/MPMD • Conclusion Exercises ( Check your Progress, Further Reading and Evaluation)
  • 4. Previously on Chap 1  Part 1- Introducing Parallel and Distributed Computing • Background Review of Parallel and Distributed Computing • INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING • Some keys terminologies • Why parallel Computing? • Parallel Computing: the Facts • Basic Design Computer Architecture: the von Neumann Architecture • Classification of Parallel Computers (SISD,SIMD,MISD,MIMD) • Assignment 1a  Part 2- Initiation to Parallel Programming Principles • High Performance Computing (HPC) • Speed: a need to solve Complexity • Some Case Studies Showing the need of Parallel Computing • Challenge of explicit Parallelism • General Structure of Parallel Programs • Introduction to the Amdahl's LAW • The GUSTAFSON’s LAW • SCALIBILITY • Fixed Size Versus Scale Size • Assignment 1b • Conclusion
  • 5. INTRODUCTION • Parallel Computer Architecture is the method that consist of Maximizing and organizing computer resources to achieve Maximum performance. - Performance at any instance of time, is achievable within the limit given by the technology. - The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel. • The use of more processors to compute tasks simultaneously contribute in providing more features to computers systems. • In the Parallel architecture, Processors during computation may have access to a shared memory to exchange information between them. • imagesSource:Wikipedia,DistributingComputing,2020
  • 6. • In a Distributed architecture, each processor during computation, make use of its own private memory (distributed memory). In this case, Information is exchanged by passing messages between the processors. • Significant characteristics of distributed systems are: concurrency of components, lack of a global clock (Clock synchronization) , and independent failure of components. • The use of distributed systems to solve computational problems is Called Distributed Computing (Divide problem into many tasks, each task is handle by one or more computers, which communicate with each other via message passing). • High-performance parallel computation operating shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms. INTRODUCTION imagesSource:Wikipedia,DistributingComputing,2020
  • 7. • Parallelism is nowadays in all levels of computer architectures. • It is the Enhancements of Processors that justify the success in the development of Parallelism. • Today, they are superscalar (Execute several instructions in parallel each clock cycle). - besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology, which allows larger and larger numbers of components to fit on a chip and clock rates to increase. • Three main elements define structure and performance of Multiprocessor: - Processors - Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes) - Interconnection Network • But, the gap of performance between the processor and the memory is still increasing …. • Parallelism is used by computer architecture to translate the raw potential of the technology into greater performance and expanded capability of the computer system • Diversity in parallel computer architecture makes the field challenging to learn and challenging to present. INTRODUCTION ( Cont…)
  • 8. Remember that: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. • The attempt to solve this large problems raises some fundamental questions which the answer can only by satisfy by understanding: - Various components of Parallel and Distributed systems( Design and operation), - How much problems a given Parallel and Distributed system can solve, - How processors corporate, communicate / transmit data between them, - The primitive abstractions that the hardware and software provide to the programmer for better control, - And, How to ensure a proper translation to performance once these elements are under control. INTRODUCTION (Cont…)
  • 9. Why Parallel Architecture ? • No matter the performance of a single processor at a given time, we can achieve in principle higher performance by utilizing many such processors so far we are ready to pay the price (Cost). Parallel Architecture is needed To:  Respond to Applications Trends • Advances in hardware capability enable new application functionality  drives parallel architecture harder, since parallel architecture focuses on the most demanding of these applications. • At the Low end level, we have the largest volume of machines and greatest number of users; at the High end, most demanding applications. • Consequence: pressure for increased performance  most demanding applications must be written as parallel programs to respond to this demand generated from the High end  Satisfy the need of High Computing in the field of computational science and engineering - A response to simulate physical phenomena impossible or very costly to observe through empirical means (modeling global climate change over long periods, the evolution of galaxies, the atomic structure of materials, etc…)
  • 10.  Respond to Technology Trends • Can’t “wait for the single processor to get fast enough ” Respond to Architectural Trends • Advances in technology determine what is possible; architecture translates the potential of the technology into performance and capability . • Four generation of Computer architectures (tubes, transistors, integrated circuits, and VLSI ) where strong distinction is function of the type of parallelism implemented ( Bit level parallelism  4-bits to 64 bits, 128 bits is the future). • There has been tremendous architectural advances over this period : Bit level parallelism, Instruction level Parallelism, Thread Level Parallelism All these forces driving the development of parallel architectures are resumed under one main quest: Achieve absolute maximum performance ( Supercomputing) Why Parallel Architecture ? (Cont …)
  • 11. Modernclassification Accordingto(Sima,Fountain,Kacsuk) Before modern classification, Recall Flynn’s taxonomy classification of Computers - based on the number of instructions that can be executed and how they operate on data. Four Main Type: • SISD: traditional sequential architecture • SIMD: processor arrays, vector processor • Parallel computing on a budget – reduced control unit cost • Many early supercomputers • MIMD: most general purpose parallel computer today • Clusters, MPP, data centers • MISD: not a general purpose architecture Note: Globally four type of parallelism are implemented: - Bit Level Parallelism: performance of processors based on word size ( bits) - Instruction Level Parallelism: give ability to processors to execute more than instruction per clock cycle - Task Parallelism: characterize Parallel programs - Superword Level Parallelism: Based on vectorization Techniques Computer Architectures SISD SIMD MIMD MISD
  • 12. • Classification here is based on how parallelism is achieved • by operating on multiple data: Data parallelism • by performing many functions in parallel: Task parallelism (function) • Control parallelism, task parallelism depending on the level of the functional parallelism. ModernClassification Accordingto(Sima,Fountain,Kacsuk) Parallel architectures Data-parallel architectures Function-parallel architectures - Different operations are performed on the same or different data - Asynchronous computation - Speedup is less as each processor will execute a different thread or process on the same or different set of data. - Amount of parallelization is proportional to the number of independent tasks to be performed - Load balancing depends on the availability of the hardware and scheduling algorithms like static and dynamic scheduling. - Applicability : pipelining - Same operations are performed on different subsets of same data - Synchronous computation - Speedup is more as there is only one execution thread operating on all sets of data. - Amount of parallelization is proportional to the input data size - Designed for optimum load balance on multi processor system Applicability: Arrays, Matrix
  • 13. • Flynn’s classification Focus on the behavioral aspect of computers . • Looking at the structure, Parallel computers can be classified based on a focus on how processors communicate with the memory.  When multiprocessors communicate through the global shared memory modules then this organization is called Shared memory computer or Tightly  when every processor in a multiprocessor system, has its own local memory and the processors communicate via messages transmitted between their local memories, then this organization is called Distributed memory computer or Loosely coupled system StructuralClassificationof ParallelComputers
  • 14. Parallel Computer Memory Architectures Shared Memory Parallel Computers architecture - Processors can access all memory as global address space - Multi-processors can operate independently but share the same memory resources - Changes in a memory location effected by one processor are visible to all other processors Based on memory access time, we can classify Shared memory Parallel Computers into two:  Uniform Memory Access (UMA)  Non-Uniform Memory Access (NUMA)
  • 15. ParallelComputerMemoryArchitectures(Cont…)  Uniform Memory Access (UMA) (known as Cache Coherent - UMA) • Commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors • Equal access and access times to memory Note: Cache coherent is a hardware operation where any update of a location in shared memory by one processor , is announce to all the other processors . Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
  • 16. Non-Uniform Memory Access (NUMA) • The architecture often link two or more SMPs In such that : - One SMP can directly access memory of another SMP - Not all processors have equal access time to all memories - Memory access across link is slower Note: if Cache coherent is implemented, then we can also call it Cache Coherent NUMA • The proximity of memory to CPUs on Shared Memory parallel computer makes Data sharing between tasks fast and uniform. • But, there is a lack of scalability between memory and CPUs. ParallelComputerMemoryArchitectures(Cont…) Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory BruceJacob,...DavidT.Wang,inMemorySystems,2008
  • 18.  Distributed Memory Parallel Computer Architecture • Different varieties as Shared Memory Computer. • Require a communication network to connect inter-processor memory. - Each processor operates independently with its own local memory - individual processors changes does not affect the memory of other processors. - Cache Coherency does not apply here ! • Access to data in another processor is usually the task of the programmer(explicitly define how and when data is communicated) • This architecture is cost effective (can use commodity, off-the-shelf processors and networking). • But, the responsibility of the programmer is more engage for data communication between processors Source:Retrievedfrom https://www.futurelearn.com/courses/supercomputing/0/steps/24022 ParallelComputerMemoryArchitectures(Cont…)
  • 19. Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016 ParallelComputerMemoryArchitectures(Cont…) Overview of Parallel Memory Architecture Note: - The largest and fastest computers in the world today employ both shared and distributed memory architectures (Hybrid Memory) - In hybrid design, Shared memory component here can be a shared memory machine and/or graphics processing units (GPU) - And, Distributed memory component is the networking of multiple shared memory/GPU machines - This type of memory architecture will continue to prevail and increase
  • 20. • Parallel computers can be roughly classified according to the level at which the hardware in the parallel architecture supports parallelism.  Multicore Computing Symmetric multiprocessing ( tightly coupled multiprocessing) Hardwareclassification - Made of computer system with multiple identical processors that share memory and connect via a bus - Do not comprise more than 32 processors to minimize bus contention - Symmetric multiprocessors are extremely cost-effective retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit- level_parallelism,2020 - Processor includes multiple processing units (called "cores") on the same chip. - issue multiple instructions per clock cycle from multiple instruction streams - Differs from a superscalar processor. But, Each core in a multi-core processor can potentially be superscalar as well. Superscalar: issue multiple instructions per clock cycle from one instruction stream (thread). - Example: IBM's Cell microprocessor in Sony PlayStation 3
  • 21.  Distributed Computing (distributed memory multiprocessor) Cluster Computing Hardwareclassification(Cont…) • Not to be confused with Decentralized computing - Allocation of resources (Hardware + software) to individual workstations • components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another • Interaction of components is done to achieve a common goal • Characterize by concurrency of components, lack of a global clock, and independent failure of components. • can include heterogeneous computations where some nodes may perform a lot more computation, some perform very little computation and a few others may perform specialized functionality • Example: Multiplayer Online game • loosely coupled computers that work together closely • in some respects they can be regarded as a single computer • multiple standalone machines constitute a cluster and connected by a network. • computer clusters have each node set to perform the same task, controlled and scheduled by software. • Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. • Example: IBM's Sequoia Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012 CiscoSystems,2003
  • 23. Performance of parallel architectures  Various ways to measure the performance of a parallel algorithm running on a parallel processor.  Most commonly used measurements: - speed-up - Efficiency/ Isoefficiency - Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job. - price/performance Note: none of these metrics should be used independent of the run time of the parallel system  Common metrics of Performance • FLOPS and MIPS are units of measure for the numerical computing performance of a computer • Distributed computing uses the Internet to link personal computers to achieve more FLOPS - MIPS: million instructions per second MIPS = instruction count/(execution time x 106) - MFLOPS: million floating point operations per second. FLOPS = FP ops in program/(execution time x 106) • Which of the metric is better? • FLOP is more related to the time of a task in numerical code. # of FLOP / program is determined by the matrix size See Chapter 1
  • 24. “In June 2020, Fugaku turned in a High Performance Linpack (HPL) result of 415.5 petaFLOPS, besting the now second-place Summit system by a factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors. In single or further reduced precision, used in machine learning and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1 exaflops). The new system is installed at RIKEN Center for Computational Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020). Performance of parallel architectures Here we are ! Single CPU Performance The future
  • 25. Peak and sustained performance Peak performance • Measured in MFLOPS • Highest possible MFLOPS when the system does nothing but numerical computation • Rough hardware measure • Little indication on how the system will perform in practice. Peak Theoretical Performance • Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node)
  • 26. Peak and sustained performance • Sustained performance • The MFLOPS rate that a program achieves over the entire run. • Measuring sustained performance • Using benchmarks • Peak MFLOPS is usually much larger than sustained MFLOPS • Efficiency rate = sustained MFLOPS / peak MFLOPS
  • 27. Measuring the performance of parallel computers • Benchmarks: programs that are used to measure the performance. • LINPACK benchmark: a measure of a system’s floating point computing power • Solving a dense N by N system of linear equations Ax=b • Use to rank supercomputers in the top500 list. No. 1 since June 2020 Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors.
  • 28. Other common benchmarks • Micro benchmarks suit • Numerical computing • LAPACK • ScaLAPACK • Memory bandwidth • STREAM • Kernel benchmarks • NPB (NAS parallel benchmark) • PARKBENCH • SPEC • Splash
  • 29. PARALLEL PROGRAMMING MODELS A programming perspective of Parallelism implementation in parallel and distributed Computer architectures
  • 30. Parallel Programming Models Parallel programming models exist as an abstraction above hardware and memory architectures.  There are commonly several parallel programming models used • Shared Memory (without threads) • Threads • Distributed Memory / Message Passing • Data Parallel • Hybrid • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD)  These models are NOT specific to a particular type of machine or memory architecture (a given model can be implemented on any underlying hardware). Example: - SHARED memory model on a DISTRIBUTED memory machine ( Machine memory is physically distributed across networked machines, but at the user level as a single shared memory global address space --- Kendall Square Research (KSR) ALLCACHE---
  • 31. Which Model to USE ?? There is no "best" model However, there are certainly better implementations of some models over others Parallel Programming Models
  • 32. SharedMemoryProgramming Model (WithoutThread) • A thread is the basic unit to which the operating system allocates processor time. They are smallest sequence of programmed instructions • In a Share Memory programming model, - Processes/tasks share a common address space, which they read and write to asynchronously. - Make use of mechanisms such as locks / semaphores to control access to the shared memory, resolve contentions and to prevent race conditions and deadlocks. • This may be consider as the simplest parallel programming model
  • 33. • Note: Locks, Mutexe and semaphore are type of synchronization objects in a share resources environment. Abstract concepts. -Locks protects access to some kind of shared resource, and give right to access the protected share resource when owned. Example, if you have a lockable object ABC you may: - acquire the lock on ABC, - take the lock on ABC, - lock ABC, - take ownership of ABC, or relinquish ownership of ABC if not needed - Mutexe (Mutual EXclusion): lockable object that can be owned by exactly one thread at a time • Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex -- Semaphore: A semaphore is a very relaxed type of lockable object, with a predefined maximum count, and a current count. Shared MemoryProgramming Model(Cont..)
  • 34. Advantages Disadvantages • No need to specify explicitly the communication of data between tasks, so no need to implement “ownership”. Very advantageous for a Programmer It becomes more difficult to understand and manage data locality. • All processes see and have equal access to shared memory There is Conservation of memory access, cache refresh and bus traffic when keeping data local to a given process • Open for simplification during the development of the program controlling data locality is hard to understand and may be beyond the control of the average user. Shared MemoryProgramming Model(Cont..) During Implementation, • Case: stand-alone shared memory machines - native operating systems, compilers and/or hardware provide support for shared memory programming. E.g. POSIX standard provides an API for using shared memory. • Case: distributed memory machines: - memory is physically distributed across a network of machines, but made global through specialized hardware and software
  • 35. • This is a type of shared memory programming. • Here, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. • To understand this model, let us consider the execution of a main program a.out , scheduled to run by the native operating system. Thread Model  a.out start by loading and acquiring all of the necessary system and user resources to run. This constitute the "heavy weight" process  a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently  Each thread has local data, but also, shares the entire resources of a.out “Light weight” and benefit from a global memory view because it shares the memory space of a.out  Need for synchronization coordination to ensure that more than one thread is not updating the same global address at any time.
  • 36. • During Implementation, threads implementations commonly comprise:  A library of subroutines that are called from within parallel source code  A set of compiler directives imbedded in either serial or parallel source code. Note: Often , the programmer is responsible for determining the parallelism. • Unrelated standardization efforts have resulted in two very different implementations of threads: - POSIX Threads * Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and Very explicit parallelism--requires significant programmer attention to detail. - OpenMP ( Used for Tutorial in the context of this course). * Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code. Others include: - Microsoft threads - Java, Python threads - CUDA threads for GPUs Thread Model (Cont…)
  • 37. • In this Model, A set of tasks uses their own local memory during computation Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. Exchange of data by tasks is done through communication( sending/ receiving messages). But, there must be a certain Process Cooperation during data transfer. During Implementation, • The programmer is responsible for determining all parallelism • Message passing implementations usually comprise a library of subroutines that are imbedded in source code. • MPI is the "de facto" industry standard for message passing. - Message Passing Interface (MPI), specification available at http://www.mpi- forum.org/docs/. DistributedMemory/MessagePassingModel
  • 38. Can also be referred to as the Partitioned Global Address Space (PGAS) model. Here,  Address space is treated globally  Most of the parallel work focuses on performing operations on a data set typically organized into a common structure, such as an array or cube  A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.  Tasks perform the same operation on their partition of work, for example, "add 4 to every array element“  Can be implemented on share memory (data structure is accessed through global memory) and distributed memory architectures (Global Data structure can be logically/Physical split across tasks). Data Parallel Model
  • 39. For the Implementation, • Various popular, and sometimes developmental parallel programming based on the Data Parallel / PGAS model. • - Coarray Fortran, compiler dependent * further reading (https://en.wikipedia.org/wiki/Coarray_Fortran) • - Unified Parallel C (UPC), extension to the C programming language for SPMD parallel programming. * further reading http://upc.lbl.gov/ - Global Arrays , shared memory style programming environment in the context of distributed array data structures. * Further reading on https://en.wikipedia.org/wiki/Global_Arrays Data Parallel Model ( Cont…)
  • 40. Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD) "high level" programming model (Can be build based on any parallel programming model) Why SINGLE PROGRAM ? All tasks execute their copy of the same program (threads, message passing, data parallel or hybrid) simultaneously Why MULTIPLE PROGRAM ? Tasks may execute different programs (threads, message passing, data parallel or hybrid) simultaneously Why MULTIPLE DATA ? All tasks may use different data Why MULTIPLE Data ? All tasks may use different data Intelligent Enough: tasks do not necessarily have to execute the entire program. Not intelligent enough has SPMD. But, may be better suited for certain types of problems (functional decomposition problems) Single ProgramMultipleData (SPMD)/ MultipleProgram MultipleData (MPMD)
  • 41. Conclusion • Parallel computer architectures contribute in achieving maximum performance within the limit given by the technology. • Diversity in parallel computer architecture makes the field challenging to learn and challenging to present • Classification can be based on the number of instructions that can be executed and how they operate on data- Flynn (SISD,SIMD,MISD,MIMD) • Also, classification can be based on how parallelism is achieved (Data parallel architectures, Function-parallel architectures) • Classification can as well be focus on how processors communicate with the memory (Shared memory computer or Tightly , Distributed memory computer or Loosely coupled system) • There must be a way to appreciate the performance of the parallelize architecture • FLOPS and MIPS are units of measure for the numerical computing performance of a computer. • Parallelism is made possible with implementation of adequate parallel programming models. • The most simple model appears to be the Shared Memory Programming Model. • The SPMD and MPMD programming required mastering of the previous programming model for Proper implementation. • How do we then design a Parallel Program for effective parallelism? See Next Chapter: Designing Parallel Programs and understanding notion of Concurrency and Decomposition.
  • 42. Challenge your understanding 1- What difference do you make between Parallel computer and Parallel Computing ? 2- What do you understand by True data dependency and Resource dependency? 3- Illustrate the notion of Vertical Waste and Horizontal Waste. 4- According to you, which of the design architecture can provide better performance ?. Use performance metrics to justify your arguments. 6- what is Concurrent-read, concurrent-write (CRCW) PRAM 5- On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b) Bus-based interconnects with local memory/caches. Explain the difference focusing on : - The design architecture - The operation - The Pros and Cons 6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
  • 43. Class Work Group and Presentation • Purpose: Demonstrate Condition to detect eventual Parallelism. “Parallel computing requires that the segments to be executed in parallel must be independent of each other. So, before executing parallelism, all the conditions of parallelism between the segments must be analyzed”. Use Bernstein Conditions for Detection of Parallelism to demonstrate when instructions i1, i2,….,in can be said “ Parallelized”.
  • 44. REFERENCES 1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems, Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html 2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000 3. Blaise Barney, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv iew, Last Modified: 11/02/2020 16:39:01 4. J BlazeWich et al, Handbook on Parallel and distributed Processing, International Handbook of Information Systems, spinger, 2000 5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large scale Distributed Systems, 2020 6. A. Grana, et al. Introduction to Parallel Computing, lecture 3
  • 45. END.